Distributed Lucene

Interesting article by Mark Harwood here regarding distributed

lucene indexes. Using distributed indexes is how google achieves its scalability I

believe, but they are a fairly special case.

If scalability in the sense of concurrent users is the issue, I tend to favour

multiple identical boxes with a load balancer and an RPC frontend. This can be

as simple as a servlet, or you can use SOAP or XML-RPC etc. (Possibly RMI,

although I’ve never tried that across a load balancer). Doing things this way

is probably a lot simpler to manage than splitting your indexes across boxes and

means that even if your queries are asymmetric (ie. 85% of the queries are for

the same thing), the load can be fairly balanced. Reliability is achieved for

free as well – if a box dies just stop sending requests there. Given Lucene’s

performance (it has been used to index collections of more than 10 million

documents) its pretty unlikely that your dataset will get so large that sheer

size starts to affect your query times. Unless of course, you are google 🙂

Lucene hints

Lucene is great, but some of the default settings are heavily biased towards interactive indexing and searching. If you’re building an index in a batch process style, set the IndexWriter.mergeFactor value to something big. I use 10,000, which makes it burn about 500 meg of RAM while indexing, but speeds it up a lot over the default value of 10. YMMV as ever.

More mail musings

Next Generation Email Clients.

Wow, you want a lot. 🙂

“A reasonable man adapts himself to the world around him. An unreasonable man expects the world to adapt to him. Therefore, all progress is made by unreasonable men.” – George Bernard Shaw.

Of course, he was a lot more eloquent than me. I just look pointedly at the title of my blog 🙂

First of all you should probably be using IMAP in the short term as it will provide a much better means for accessing email in a centralized location. The downside is that IMAP tends to require a good connection speed because the messages stay on the server and are downloaded on demand, as opposed to the POP strategy which downloads all messages in your INBOX and lets you organize and store them locally.

Yeah, I’ll probably end up doing something like that. Although I can never do things the easy way, so what I might well end up doing is running my own local IMAP server…

As for a cross platform GUI client which actually works well, I have yet to find one which satisfies me. On W2K I use Outlook, and while it is adaquate there are a number of things which really piss me off. Sometimes Outlook just sits there for 10 or 20 minutes “checking for messages”. Then there is the virus issue. On OS X I use the included Mail application which works pretty darn well. In the worse case scenerio I resort to Pine on Linux (if I can only get in via SSH).

The most powerful windows email client I’ve ever used is The Bat!. Also used by Ron Jeffries I believe.

On my side I think I have abandoned the desktop client in favor of a web client. There are just too many issues with synchronization accross clients which are too difficult to overcome. The existing protocols (IMAP and POP) do not really work well when you get into the level of tens of thousands of messages in hundreds of folders. You will essentially need to have a custom server and protocol which deals with these issues so that synchronization is not completely up to the client. Unfortunately this will be a painful if not impossible uphill battle due to the fact that people have their email servers in place and would be very relucatant to replace it with your server.

I wasn’t really suggesting a replacement for established servers, but something more along the lines of a web service. As connectivity increases and more people have permanent connections, it’s not unforseeable that your own static IP is as common as having a phone number. If you believe the IPv6 hype, even your fridge will have a net presence in the not-too-far future. Anyway, if you have your own server on the net, you have more options with regard to applications. Image the scenario: your powerful home server is collecting and indexing your email, news feeds, etc. according to the criteria you have defined. You have your lightweight wireless device / laptop with you, and can simply hook up to your central system and be presented with a condensed and sorted view of all the stuff it has for you. Read some emails, send some replies, organise your calendar, all centrally stored and managed from your personal server. Your personal server could equally well be a hosted service, much like many bloggers already have.

Here are my ideas for a web-based email/information manager:

  • It will not emulate a typical client-side application. No folder trees. No drag and drop. My idea is a single “INBOX” which is an aggregation of your different message sources such as email, RSS, newsgroups, etc.
  • It will link together email/contact management/history/tasks/issues etc. in a fashion which makes it easy to view the lifetime of a particular discussion as well as the results of a discussion.
  • It will provide multi-user functionality so that a group of users could share some messages which are related to the group but not others which are related to the individual only.
  • It will work with POP and IMAP servers.
  • It will track EVERYTHING which comes in and goes out.
  • It will link with JIRA! 🙂

[All Things Java]

All good stuff. Especially the JIRA bit 🙂

Interesting. I think MS

Interesting. I think MS has its work cut out for itself to build a large open source community anywhere near the sizes of Linux, GNU or Jakarta. Especially if its under a dodgy MS licence. All I’ve seen so far have been C# ports of existing Java open source projects (like NAnt, NUnit etc). Is there really much of a C# open source community out there?[James Strachan’s Radio Weblog]

As someone who’s turned most of their development attention to .NET (splitter), I’m feeling kind of isolated. OSS is totally alien to the existing MS community and the Java community view .NET as the plague. An umbrella is badly needed for the .NET OSS community.

How long until we see dotnet.apache.org? 🙂

[Joe’s Jelly]

C# is similar enough to Java that it isn’t hard to pick up for java developers who haven’t rejected it merely on the basis of its heritage. As companies (like Joe’s clients) start to demand .net stuff, developers coming in to .net who are used to the availability of OSS tools for java will probably start to produce equivalents for the .net world. Although what Microsoft were thinking with their first workspaces license escapes me. Unless they offer a compelling technical reason not to use sourceforge I think they may struggle. The community will decide, and it doesn’t really matter whether the tools are hosted on sourceforge, apache, or gotdotnet workspaces.

Java email redux

I am humbled, and impressed. Humbled by once again realising how much I don’t know, and impressed by how much others do, and how important human interaction is in what we do.

In one single comment, Chris has pointed me to at least three java related email projects I somehow managed to miss. I’d heard of ICEMail already, but forgot to mention it.

Must brush up on my google terms – its too easy to artificially narrow the criteria and not get what you were after. Google should just know what I want, not what I say I want!

So what do I want? Its more than just a mail reader. I have a ‘too many email addresses’ issue. I have several personal ones and a work one. I want to be able to aggregate all my personal ones and read them at work and at home. I want emails I write to be stored centrally, irrespective of where I am when I write them. At home I use WindowsXP for most things, and FreeBSD for everything else. Part of the ‘everything else’ includes my email. But I don’t want to have to switch chairs just to write an email. Fetchmail, Procmail and Mutt form a powerful triumvirate, and I can still fire up KMail if I feel like it and have it work. What I’d love is to have something like KMail as a cross-platform remote client, so I could use a nice GUI from wherever I can get a net connection, without having to lug my entire email directory around with me. It would have to be client-server. I don’t want to try and rethread a 15,000 message folder back and forth over the net. The server should do that and just show me the results.

The other thing I want (don’t ask much do I?) is actually where I started. The concept of a smart mail reader that is also a newsgroup, web page, rss, etc. aggregator that can index all my stuff and show me the relationships between things, sort of like ZOË taken to its logical extreme.

Then I just have the problem of what to do after breakfast… 🙂

Dreams and aspirations

Inspired by ZOË, and motivated partly by the fact that Mutt is still the best go-anywhere remote mail reader, I’ve been daydreaming about a cross-platform client-server GUI (not web based) java mail program with built in smart indexing and context sensitive threading and dynamic linking.

What have I achieved so far? I’ve got some very simplistic code that scans an mbox file and creates EmailMessage objects.

On the way I discovered that:

  • There appear to be no open source java libraries for parsing mbox mail files.
  • Mozilla starting work on a java email reader, called Grendel in 1997, but it seems to have died. There appears to be a lot of code in CVS, but most of it seems to be pre-java 1.2, and it requires netscape jars which probably aren’t open source.
  • There is apparently no official spec for the mbox format hence…
  • …the implementation varies significantly between applications and platforms. About the only thing you can assume is that each message starts with ‘From ‘ preceded by a blank line or the start of the file.
  • The ‘From ‘ address at the start of each message isn’t necessarily the same as the ‘From:’ header.
  • Jamie Zawinski (ex-Mozillian) has lots of useful and eclectic information on his site.

Quality of service

Late night ramblings…. Back to blogging… ahhhh.

I sorta mentioned my new server setup in passing before, but I’m using OrionServer again and I love it. I was thinking about slapping a hacked copy of Weblogic on the server so I can work and play in the same environment, but Orion is just so nice I decided to play nice.

-Russ [Russell Beattie Notebook]

Nice to have you back, Russ.

On Weblogic: Whoa, you’ve only got 4 gig of disk and a gig of RAM, are you sure she can take it captain?

On JohnCompanies: I received the following at 10 pm PST on a friday night, after signing up at 1pm. Impressed.

Hi,

We have a really nice welcome message that we send out to new customers for our FreeBSD product, but I haven’t finished writing the welcome message for our linux customers – I assume you’d rather not wait for that and just get an informal welcome.

…[account stuff skipped]…

P.P.S. We are having a scheduled maintenance tomorrow (saturday) night for about 20 minutes … this is rare and it is just a coincidence that it is happening the day after you sign up.

A hosting firm that actually keeps their customers informed? Remarkable.

Finally:

…we are very happy to have you as a customer…

I think I’m going to be very happy being one.

On hosting and domain names

I’ve been hanging on to darrenhobbs.com for a couple of years, but never done anything with it. I always felt it was a bit narcissistic somehow, don’t ask me why. Anyway, having discovered JohnCompanies thanks to Russell, and having suffered before from not having full control over my hosting I decided to take the plunge and sign up for a linux account with them. So hopefully in a few days http://www.darrenhobbs.com will go live. What shall I do with it?

And what do I use for an email address? me@darrenhobbs.com? mail@darrenhobbs.com?