Why I’m not a corporate investor

Party like it’s 1993.

-Russ [Russell Beattie Notebook]

1993 was also when I first heard the word ‘Linux’, which, because we were British, was pronounced ‘Line-ux’. One of my Uni acquaintances showed it to me. “Look, free Unix on a pc”. “That won’t go anywhere”, I thought… This rates alongside my other great predictions such as, ‘Netscape are having an IPO, should I invest? Naa, they probably won’t be worth much’. And Yahoo, and Redhat… I console myself that I was a poor student and didn’t have much to invest at the time anyway. Wouldn’t have had enough money to get rich, just enough to have had one whale of a time at Uni.

BEEP

Fed my book-buying habit today with the acquisition of the O’Reilly BEEP book. Build your own network protocol. Cool. BEEP looks very interesting as an alternative for all the contortions distributed application developers have to go through to make them work over HTTP. It provides a framework where most of the complex low-level stuff is done for you, and you just have to build your application-specific stuff on top of it. So the developer gets to decide whether the connection should be pull/push or both, stateless or stateful, pipelined or multiplexed etc. And security appears to be pluggable too.

I seem to remember Paul Hammant mentioning something about writing a BEEP module for AltRMI, which sounds like a great idea, especially for doing asynchronous callbacks. Must read more in case I’m totally wrong…

You’re not a *nix geek unless…

…you’ve replaced sendmail as your default MTA (in my case with postfix). Oh my word. Talk about stressful. At one point I thought I’d just obliterated a whole day’s worth of incoming mail because I kicked off fetchmail (thinking I was ready when I wasn’t), and postfix threw a wobbly. Thankfully it kept all the undelivered messages so after a few frantic minutes skimming the docs, hacking the config and one ‘postfix flush’ later, all my email reappeared. Phew.

I flatter myself that I can usually puzzle my way through most techie things, but email delivery systems are way more complex than I ever imagined. I had no idea what I was getting into when I started. Its still not working as I expected but I appear to be able to send email, so I think I’ll leave it until my palms stop sweating.

Distributed Lucene

Interesting article by Mark Harwood here regarding distributed

lucene indexes. Using distributed indexes is how google achieves its scalability I

believe, but they are a fairly special case.

If scalability in the sense of concurrent users is the issue, I tend to favour

multiple identical boxes with a load balancer and an RPC frontend. This can be

as simple as a servlet, or you can use SOAP or XML-RPC etc. (Possibly RMI,

although I’ve never tried that across a load balancer). Doing things this way

is probably a lot simpler to manage than splitting your indexes across boxes and

means that even if your queries are asymmetric (ie. 85% of the queries are for

the same thing), the load can be fairly balanced. Reliability is achieved for

free as well – if a box dies just stop sending requests there. Given Lucene’s

performance (it has been used to index collections of more than 10 million

documents) its pretty unlikely that your dataset will get so large that sheer

size starts to affect your query times. Unless of course, you are google 🙂

Lucene hints

Lucene is great, but some of the default settings are heavily biased towards interactive indexing and searching. If you’re building an index in a batch process style, set the IndexWriter.mergeFactor value to something big. I use 10,000, which makes it burn about 500 meg of RAM while indexing, but speeds it up a lot over the default value of 10. YMMV as ever.