Looking for a good Java search engine
Saturday, May 6th, 2006The title is a bit misleading. I’m not actually looking for a search engine written in Java, if I was I’d head straight to Nutch. What I’m looking for is a search that just covers Java articles and tutorials. The company I work for has software to build vertical search engines, in fact I wrote most of it, so I really should be eating my own dog food. But you know how it is, the cobbler’s children never have any shoes, so while I wait for some servers to free up I decided to have a go with similar systems that are freely available on the web.
The two sites I’m trying out are Rollyo and Swicki. Both sites let you specify a list of web sites and search across them. The sites I used for my test were:
- On Java
- IBM Developer Works (just the Java section)
- java.net (just the articles)
Rollyo
First up is Rollyo. It’s very simple to setup the search engine, I won’t bother going into the specifics because anyone could figure it out. Rollyo is actually backed by Yahoo! search, what soon becomes clear is that it’s just a site restricted search in Yahoo. Two of the sites on my list actually cover many things and I just wanted a sub-directory of each, Rollyo searches the whole site, which is not what I want.
My test search is concurreny deadlock detection. The first two results are of the same page, there are plenty of other duplicates, and I get non-Java results back, e.g. ‘DB2 for z/OS: DB2 Universal Database concurrency‘. So far not so good. A few more searches turns up similar results. Restricting the set of data I want to search, and getting rid of duplicates is basic functionality, Rollyo fails on both counts and won’t get any more of my time.
Rollyo - Java Articles
Swicki
Next up is Swicki. First difference, they have their own crawler, which means that have a lot more control over the data. Second is they are community focused, i.e. a group of like minded people contribute sites to the search engine, rather than you building it up yourself. Setting up the search was simple, and I didn’t have to create an account either (although I did so I could keep my search engine). Swicki also says you can search just parts of sites and you don’t have to do any special configuration either, just make sure the directory is included in the URL.
So how did concurrency deadlock detection fair this time? A lot better than Rollyo. There were plenty of relevant links, no duplicates and Swicki actually covered more sites than I selected, but they appear to be relevant so that’s fine with me. Was it perfect? No because some database deadlock information crept in, which I wasn’t interested in. But since they control the data set (remember they crawl the web themselves) you can personalise the system. They claim it learns from your behaviour, but I haven’t used it enough to verify that claim, but what I can see is that for every result you can promote it, or the site, and delete the result or the site it comes from. So over time the results will get better.
So first impressions are good. I’ll have to use it a bit longer to see if it does learn. I’ll try to stick a search box to it on this site, but since I didn’t make this design and I’m not great with CSS, it might be a while, in the meantime, here’s the link:
