Archive for the 'Programming' Category


Looking for a good Java search engine

Saturday, May 6th, 2006

The title is a bit misleading. I’m not actually looking for a search engine written in Java, if I was I’d head straight to Nutch. What I’m looking for is a search that just covers Java articles and tutorials. The company I work for has software to build vertical search engines, in fact I wrote most of it, so I really should be eating my own dog food. But you know how it is, the cobbler’s children never have any shoes, so while I wait for some servers to free up I decided to have a go with similar systems that are freely available on the web.

The two sites I’m trying out are Rollyo and Swicki. Both sites let you specify a list of web sites and search across them. The sites I used for my test were:

Rollyo

First up is Rollyo. It’s very simple to setup the search engine, I won’t bother going into the specifics because anyone could figure it out. Rollyo is actually backed by Yahoo! search, what soon becomes clear is that it’s just a site restricted search in Yahoo. Two of the sites on my list actually cover many things and I just wanted a sub-directory of each, Rollyo searches the whole site, which is not what I want.

My test search is concurreny deadlock detection. The first two results are of the same page, there are plenty of other duplicates, and I get non-Java results back, e.g. ‘DB2 for z/OS: DB2 Universal Database concurrency‘. So far not so good. A few more searches turns up similar results. Restricting the set of data I want to search, and getting rid of duplicates is basic functionality, Rollyo fails on both counts and won’t get any more of my time.

Rollyo - Java Articles

Swicki

Next up is Swicki. First difference, they have their own crawler, which means that have a lot more control over the data. Second is they are community focused, i.e. a group of like minded people contribute sites to the search engine, rather than you building it up yourself. Setting up the search was simple, and I didn’t have to create an account either (although I did so I could keep my search engine). Swicki also says you can search just parts of sites and you don’t have to do any special configuration either, just make sure the directory is included in the URL.

So how did concurrency deadlock detection fair this time? A lot better than Rollyo. There were plenty of relevant links, no duplicates and Swicki actually covered more sites than I selected, but they appear to be relevant so that’s fine with me. Was it perfect? No because some database deadlock information crept in, which I wasn’t interested in. But since they control the data set (remember they crawl the web themselves) you can personalise the system. They claim it learns from your behaviour, but I haven’t used it enough to verify that claim, but what I can see is that for every result you can promote it, or the site, and delete the result or the site it comes from. So over time the results will get better.

So first impressions are good. I’ll have to use it a bit longer to see if it does learn. I’ll try to stick a search box to it on this site, but since I didn’t make this design and I’m not great with CSS, it might be a while, in the meantime, here’s the link:

Swicki - Java Articles

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Looking for a good Java search engine digg.com digg it!  |  reddit reddit!

Cleverest coder?

Tuesday, May 2nd, 2006

I came across a post on the Guardian’s tech blog called Are You Europe’s Cleverest Coder?. Google and TopCoder are running a competition in Europe for the first time. It sounds like a laugh and with a top prize of €2,500, it’s definitely worth having a go, but unfortunately I’m away during the qualifying rounds.

One thing that does bother though is the label ‘Cleverest coder’. To be fair it is the Guardian’s title, and isn’t used by the competition. Why do I have a problem with it? Well the score is based soley on time. Okay, that’s not strictly true, your program has to pass the tests first, otherwise you get 0, but the discriminator is the time you took, i.e.:

Total points awarded = points Where PT is the time spent coding a problem, TT is the total time allocated for coding all problems, and MP is the maximum points available for that problem.

In my experience, the program that was written the quickest is rarely the best, hence I wouldn’t label the winner of this contest the cleverest coder. But I’ll have to take a look at the actual competition before I criticise it too much.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Cleverest coder? digg.com digg it!  |  reddit reddit!

Who came up with DOM and XPath in Java?

Monday, April 24th, 2006

I’ve hated the DOM implementation in Java for a long time. Today I used XPath for the first time, now I hate it too. Up to now I’ve used a collection of utility methods that would just iterate over nodes until it found one with a matching tag name and/or attribute set. After my experience today I’m back to them. Seriously how wordy is this? (Exception handling excluded for ‘brevity’)

DocumentBuilderFactory docfactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docbuilder = docfactory.newDocumentBuilder();

// Assume we've got the file as an InputSource
Document docroot = docbuilder.parse(filestream);

XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate("/errors", docroot, XPathConstants.NODESET);

int length = nodes.getLength();
for (int i = 0; i < length; i++) {
    Node node = nodes.item(i);

    if (node instanceof Element) {
        Element e = (Element) node;
        // ...
    }
    else {
        // ...
    }
}

You get the idea. The code I was writing was meant to pull out all ‘errors’ blocks, consolidate them, update a count attribute, then replace the old errors blocks with the new one. This mean there were more XPaths, parsing of Integers, then converting them back to Strings, it was ridiculous. To make it even worse, the nodes returned in the NodeList, were as far as I could tell, copies and not the original nodes, so I couldn’t remove them. If I’m reading the API correctly, a document fragment is returned, so to be fair it is documented, but when element.getParentNode().removeChild(element) is failing, it’s hard to get past the frustration and make sense of the docs.

Why can’t I have something like:

Document doc = new Document(...path to XML file...);
List matchingNodes = doc.find(”/errors”);

for (Element errors : matchingNodes) { // … }

That’s not too un-Java is it? Okay, the return type of my find method isn’t well defined, but that can be worked around.

Why is this API so unwieldy?

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Who came up with DOM and XPath in Java? digg.com digg it!  |  reddit reddit!

Character Sets and Encodings

Saturday, April 1st, 2006

Long gone are the days where each character was represented by a number, or more specifically a byte. You know, when your character routines were simple like:

/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
    if (c >= 'A' && c < = 'Z')
        return 'a' + 'A';
    else
        return c;
}

(The C Programming Language, page 43)

Of course that only worked for character sets with consecutive letters, e.g. ASCII, but not EBCDIC.

But even early on I knew this type of function was bad, and where ever possible you should just use the supplied library routines that took care of all character set nastiness for you. In fact with Java’s String class, it’s amazing how long you can be ignorant of character sets and encoding. This is because in the English speaking World, our characters map to the same set of bytes in almost all encodings.

I think now’s a good time to take a quick break and go read Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Now hopefully from that you’ve got three key things:

  1. There is no such thing as plain text. Bytes on a disk are worthless without knowing the character encoding.
  2. Character set != Character encoding
  3. Character != Byte

Plain text was gone the moment someone wanted to store something besides an English character. You shouldn’t assume a ‘text file’ is encoding in US-ASCII. One of the conveniences of the new character encodings (e.g. ISO-8859-1 and UTF-8) is that they let you get away with this if you’re dealing with English text. But you still shouldn’t do it!

For a long time a character was mapped to a byte or a byte sequence, hence character sets and character encodings were one and the same thing. This changed with Unicode (which is what Java uses to represent characters internally), which brought in the concept of a ‘code point’. A code point is a unique identifier for a glyph, e.g.:

Unicode code point
U+0041
U+00DF
U+6771
U+10400
Representative glyph
UTF-32 code units
00000041
000000DF
00006771
00010400
UTF-16 code units
0041
00DF
6771
D801 DC00
UTF-8 code units
41
C3 9F
E6 9D B1
F0 90 90 80

(Table from Supplementary Characters in the Java Platform)

Included in the above table are the byte representations of the glyphs in various different UTF-x encodings. The thing to note about the UTF-x encodings is that they are variable length, and the ‘x’ is the smallest number of bits required to represent one character, but it may require more. Although in practice UTF-32 always uses 32-bits because that currently covers all the code points. This brings us to the third point, a character is no longer represented by one byte, but it can be. This breaks a lot of character handling routines that assumes characters a 8 bits long. It’s worth checking about the Unicode 4.0 support in J2SE 1.5 to see how the use of the ‘char’ type is going out of fashion.

It should be fairly obvious why reading a UTF-8 encoded file as ASCII could produce a lot of garbage, but at the same time your English characters would be fine.

So why does this matter if Java uses Unicode to store Strings? You’d assume it would also have a default encoding, so the following would be standard:

String text = ....;
FileOutputStream fos = new FileOutputStream("/tmp/dump.txt");
fos.write(text.getBytes());

This file could be read back in with a FileInputStream and you’d get the same file each time. This is true if you run the read and write programs on the same machine, but the default character encoding depends on the JVM and operating system.

How to find the default character set for your JVM

import java.io.OutputStreamWriter;
import java.nio.charset.Charset;

/**
 * How to determine the default encoding
 */
public class CharacterSet {

    public static void main(String[] args) {
        // in JDK 1.4, defaultEncodingName will typically be "Cp1252"
        // In an Applet, this requires signing for privilege.
        String defaultEncodingName = System.getProperty( "file.encoding" );
        log(defaultEncodingName);

        // in JDK 1.5+, will typically be "windows-1252"
        // First, get the Charset/encoding then convert to String.
        defaultEncodingName = Charset.defaultCharset().name();
        log(defaultEncodingName);

        // I'm told this circumlocution has the nice property you can even use
        // it in an unsigned Applet.
        defaultEncodingName = new OutputStreamWriter( System.out ).getEncoding();
        log(defaultEncodingName);
    }

    private static void log(String msg) {
        System.out.println(msg);
    }
}

Output (IBM 1.5.0 JDK on Linux)

ANSI_X3.4-1968
US-ASCII
ASCII

Clearly a lot of variation! My preference is to specify UTF-8 when I’m reading and writing my own files because I deal mostly with English text and this does save bytes. It also allows the files to be viewed in almost any text reader.

But what if you don’t control bytes, e.g. you download a file from the web? This becomes a bit trickier. Sometimes they tell you what it is, e.g. in the ‘Content-Type’ header, or in a metatag, sometimes they don’t. Thankfully browsers have had to deal with this problem for years, and the Mozilla project has produced character set detectors, which have been ported to Java. Definitely worth looking into if you have to handle text files from unknown sources.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Character Sets and Encodings digg.com digg it!  |  reddit reddit!

Rails 1.1 Follow Up

Wednesday, March 29th, 2006

In the end DreamHost has uninstalled Rails 1.1 and all other dependencies. An odd move if you ask me since it was more work involved than fixing the upgrade. I think the main issue of contention was that Typo (a Ruby blogging program) doesn’t work with 1.1.

Before that I did manage to get all my sites working properly. There are three main ways to do this:

  1. rake gem_freeze
  2. Copy Rails 1.0 to your vendor directory:
    svn export http://dev.rubyonrails.org/svn/rails/tags/rel_1-0-0 rails
    
    Or do it manually as laid out here.
  3. Or insert the appropriate lines from the wiki to specify certain versions of Rails and its components. This requires your host to have older versions of the gems. A few caveats:
    • Make sure you have require ‘rubygems’
    • Make sure the dependencies come before any lines that need them, e.g.:
      ActiveRecord::Base.configurations = File.open("#{RAILS_ROOT}/config/database.yml") { |f| YAML::load(f) }
      
    • Comment out the default require lines for Rails
Also make sure you kill any old ruby processes so the changes can have an affect. To be honest I used a combination of 2 and 3, I’m fairly certain only one made a difference, and I know which one, but since the site is live and is now working, I’m very reluctant to change anything. As for future Rails hosts, my shortlist is:
  1. OCS Solutions
  2. Planet Aragon
  3. and a distant third, TextDrive

I think I’ll go with OCS Solutions, because they seem very serious about Rails, they support lighttp, I couldn’t find any bad press about them and they’re cheap. DreamHost is good for testing Rails apps, but I definitely wouldn’t recommend them for client hosting.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Rails 1.1 Follow Up digg.com digg it!  |  reddit reddit!

Rails 1.1 Pain

Wednesday, March 29th, 2006

DreamHost upgraded to Rails 1.1 last night, which broke my sites. Of course this was just a few hours after I announced them, making me look like a complete tit. Strictly speaking I should have locked my version of Rails to prevent this sort of thing happening, but a warning would have been nice. I doubt they would have upgraded to an incompatible version of PHP without mentioning it. Actually the problem is not so much the upgrade, but it was a partial upgrade. They forgot to upgrade activerecord, so no Rails sites work. So far it’s been over 10 hours without a word from their support team. All they need to do is roll out one gem across their servers.

I managed to get one site back up but the other is still down. DreamHost have always been a bit slow when it comes to Rails performance so I’m on the look out for a new host for my Rails sites. A quick search shows users on TextDrive (the official RoR host) went through a similar unannounced upgrade. If you know any solid Rails hosting company, let me know.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Rails 1.1 Pain digg.com digg it!  |  reddit reddit!

Typo Sidebar - CalendarHelper

Wednesday, March 8th, 2006

One of the sidebar plugins I wrote needed a calendar. I decided to use Jeremy Voorhis’s calendar-helper. There are examples on his page on how to use it so I won’t go into those details here. The reason for this post is about how to get it working with Typo.

It proved fiddly because after putting the file in the ‘components’ directory made it visible in my controller class (with require 'calendar-helper' and include CalendarHelper) but not in ‘content.rhtml’. After a lot of Googling I found that the component needed to be installed as a global view helper. Those instructions aren’t the clearest, so here’s what you need to do:

  1. Install calendar_helper.rb to /typo/vendor/plugins/calendar_helper/lib
  2. In calendar_helper.rb change module CalendarHelper to module ActionView::Helpers::CalendarHelper
  3. At the bottom of calendar_helper.rb (after the last end) add
    ActionView::Base.send(:include, ActionView::Helpers::CalendarHelper)
  4. Create a file: ‘/typo/vendor/plugins/calendar_helper/init.rb‘ and put in it:
    require 'calendar_helper'
    

And after those four easy steps the ‘calendar’ function is now available in ‘content.rhtml’.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Typo Sidebar - CalendarHelper digg.com digg it!  |  reddit reddit!

Typo Sidebar Tutorial

Wednesday, March 8th, 2006

I’ve been playing around a bit with Typo, a Rails blogging program, recently. The main reason I chose Typo is because I wanted to add some news features to a golf league manager I’ve also written in Rails. Merging the two code bases is probably a bit much at this stage but I did want to add something to Typo to integrate the two. The best way to do is sidebar components. This is a brief tutorial on how to write a sidebar component. In my examples I’ll use the category sidebar from Typo.

Read the rest of this entry »

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Typo Sidebar Tutorial digg.com digg it!  |  reddit reddit!

ONJava slipping?

Thursday, February 16th, 2006

Not really but apparently provocative (mis-leading) headllines are the way to get on the ‘A-List’. ;) (Note: He’s changed the paragraph about titles so it’s much softer than what I got in my RSS feed).

So what’s wrong with ONJava? It’s their latest article, Integrating Ant with Eclipse, Part 1. For those of you who don’t use Eclipse and Ant to integrate them you need to do:

  1. Window > Show View > Ant
  2. Drag build.xml from the Package Explorer into the new window pane

And you’re done. How does this warrant an article?

On closer examination it’s an excerpt from an oldish book, and most of it is about setting up Eclipse and writing an Ant build script. But if you’re not already using Eclipse and have an Ant build script, why would you read an article titled ‘Integrating Ant with Eclipse’?

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit ONJava slipping? digg.com digg it!  |  reddit reddit!

Does logic belong in the database?

Tuesday, February 14th, 2006

Builder UK has done an indepth interview with David Heinemeier Hansson, Ruby on Rails: The importance of being 1.0. I like Rails and would like to use it more, but was never very happy with it’s lack of SQL features. From the interview:

Regarding the specifics, it’s no secret that I’m not a big fan of logic in the database. I don’t think the database is an appropriate place to maintain a coherent domain model. And I don’t think you should integrate multiple applications through the database. So if you follow that and shield your database from access of multiple applications, you can move all of that logic you would have put in stored procedures, triggers, and what have you into an object-oriented model that can take advantage of the last 20-plus years of progress in software-development techniques.

There’s a fair point but in the past I’m pretty sure DHH mentioned foreign keys were unnecessary too, which I found frustrating when I switched from PostgreSQL to MySQL (DreamHost doesn’t support the former).

But now I look back at it, it doesn’t look bad. Rails has several features that makes it possible to maintain data integrity and allows you to keep all the details of your model in one place, e.g. hasmany, belongsto in ActiveRecord. Also Rails apps tend to be built from the ground up (rather than built on top of legacy systems) so if you control the only access points to the database, why not limit your checks to one place? There are two things that come to mind that made me think keeping all the data integrity in the source code was a bad thing.

ACS 4.0 Tcl My first real programming job was at ArsDigita. They had a toolkit know as the ACS (ArsDigita Community System), which at the time was written in Tcl. Tcl is a scriping language, a fairly simple one at that. One of the big new features in ACS 4 was object orientation. I can’t remember how they did OO in Tcl but I do recall the extends they went to to mimic OO features in the database. Object hierarchies were modelled by table hierarchies, with lots of constraints, we even wrote OO data retrieval functions in Oracle PL/SQL. Beyond that there was also a lot of emphasis on reducing the amount of SQL queries each web page makes and optimizing them where ever possible. If you’re interested, here’s the full indoctrination.

JDBC The other thing I’m going to blame is how hard it is to do SQL in Java. It’s just painful. How verbose is this!

Connection c = ... // I won't bother including connection setup code
PreparedStatement ps = null;
ResultSet rs = null;

try {
    ps = c.prepareStatement("select id, name, password from users where id = ?");
    ps.setString(1, id);
    rs = ps.executeQuery();

    if (rs.next()) {
        String id = rs.getString(1);
        String name = rs.getString(2);
        String password = rs.getString(3);

        User user = new User(id, name, password);
        return user;
    }
} 
finally {
    // I won't bother with closing code either
    close(rs);
    close(ps);
}

With Rails you declare your User class as:

class User < ActiveRecord::Base

end

and the lookup code is:

user = User.find(id)

and you’re done. True the ActiveRecord limits you to fairly simple object models, but when the framework makes your life this easy, you can cope. My main point is using SQL in Java is painful enough I’d push as much logic as I can down into the database in the form of constraints, cascades, triggers, PL/SQL, etc. so I could simplify my JDBC code.

So does logic belong in the database? Yes and no. If you think your database is going to out live your application, then it’s probably a good idea to make sure it contains all your data integrity rules. But that’s not always the case, sometimes the database is just a store that’s meaningless without your application, so why duplicate logic? Rails is full of ‘it really doesn’t have to be that hard’ moments. I’m surprised I’m still coming across them.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Does logic belong in the database? digg.com digg it!  |  reddit reddit!