Distributed Lucene

Interesting article by Mark Harwood here regarding distributed

lucene indexes. Using distributed indexes is how google achieves its scalability I

believe, but they are a fairly special case.

If scalability in the sense of concurrent users is the issue, I tend to favour

multiple identical boxes with a load balancer and an RPC frontend. This can be

as simple as a servlet, or you can use SOAP or XML-RPC etc. (Possibly RMI,

although I’ve never tried that across a load balancer). Doing things this way

is probably a lot simpler to manage than splitting your indexes across boxes and

means that even if your queries are asymmetric (ie. 85% of the queries are for

the same thing), the load can be fairly balanced. Reliability is achieved for

free as well – if a box dies just stop sending requests there. Given Lucene’s

performance (it has been used to index collections of more than 10 million

documents) its pretty unlikely that your dataset will get so large that sheer

size starts to affect your query times. Unless of course, you are google 🙂

Darren Hobbs