Archive for the 'Java' Category


Looking for a good Java search engine

Saturday, May 6th, 2006

The title is a bit misleading. I’m not actually looking for a search engine written in Java, if I was I’d head straight to Nutch. What I’m looking for is a search that just covers Java articles and tutorials. The company I work for has software to build vertical search engines, in fact I wrote most of it, so I really should be eating my own dog food. But you know how it is, the cobbler’s children never have any shoes, so while I wait for some servers to free up I decided to have a go with similar systems that are freely available on the web.

The two sites I’m trying out are Rollyo and Swicki. Both sites let you specify a list of web sites and search across them. The sites I used for my test were:

Rollyo

First up is Rollyo. It’s very simple to setup the search engine, I won’t bother going into the specifics because anyone could figure it out. Rollyo is actually backed by Yahoo! search, what soon becomes clear is that it’s just a site restricted search in Yahoo. Two of the sites on my list actually cover many things and I just wanted a sub-directory of each, Rollyo searches the whole site, which is not what I want.

My test search is concurreny deadlock detection. The first two results are of the same page, there are plenty of other duplicates, and I get non-Java results back, e.g. ‘DB2 for z/OS: DB2 Universal Database concurrency‘. So far not so good. A few more searches turns up similar results. Restricting the set of data I want to search, and getting rid of duplicates is basic functionality, Rollyo fails on both counts and won’t get any more of my time.

Rollyo - Java Articles

Swicki

Next up is Swicki. First difference, they have their own crawler, which means that have a lot more control over the data. Second is they are community focused, i.e. a group of like minded people contribute sites to the search engine, rather than you building it up yourself. Setting up the search was simple, and I didn’t have to create an account either (although I did so I could keep my search engine). Swicki also says you can search just parts of sites and you don’t have to do any special configuration either, just make sure the directory is included in the URL.

So how did concurrency deadlock detection fair this time? A lot better than Rollyo. There were plenty of relevant links, no duplicates and Swicki actually covered more sites than I selected, but they appear to be relevant so that’s fine with me. Was it perfect? No because some database deadlock information crept in, which I wasn’t interested in. But since they control the data set (remember they crawl the web themselves) you can personalise the system. They claim it learns from your behaviour, but I haven’t used it enough to verify that claim, but what I can see is that for every result you can promote it, or the site, and delete the result or the site it comes from. So over time the results will get better.

So first impressions are good. I’ll have to use it a bit longer to see if it does learn. I’ll try to stick a search box to it on this site, but since I didn’t make this design and I’m not great with CSS, it might be a while, in the meantime, here’s the link:

Swicki - Java Articles

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Looking for a good Java search engine digg.com digg it!  |  reddit reddit!

Who came up with DOM and XPath in Java?

Monday, April 24th, 2006

I’ve hated the DOM implementation in Java for a long time. Today I used XPath for the first time, now I hate it too. Up to now I’ve used a collection of utility methods that would just iterate over nodes until it found one with a matching tag name and/or attribute set. After my experience today I’m back to them. Seriously how wordy is this? (Exception handling excluded for ‘brevity’)

DocumentBuilderFactory docfactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docbuilder = docfactory.newDocumentBuilder();

// Assume we've got the file as an InputSource
Document docroot = docbuilder.parse(filestream);

XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate("/errors", docroot, XPathConstants.NODESET);

int length = nodes.getLength();
for (int i = 0; i < length; i++) {
    Node node = nodes.item(i);

    if (node instanceof Element) {
        Element e = (Element) node;
        // ...
    }
    else {
        // ...
    }
}

You get the idea. The code I was writing was meant to pull out all ‘errors’ blocks, consolidate them, update a count attribute, then replace the old errors blocks with the new one. This mean there were more XPaths, parsing of Integers, then converting them back to Strings, it was ridiculous. To make it even worse, the nodes returned in the NodeList, were as far as I could tell, copies and not the original nodes, so I couldn’t remove them. If I’m reading the API correctly, a document fragment is returned, so to be fair it is documented, but when element.getParentNode().removeChild(element) is failing, it’s hard to get past the frustration and make sense of the docs.

Why can’t I have something like:

Document doc = new Document(...path to XML file...);
List matchingNodes = doc.find(”/errors”);

for (Element errors : matchingNodes) { // … }

That’s not too un-Java is it? Okay, the return type of my find method isn’t well defined, but that can be worked around.

Why is this API so unwieldy?

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Who came up with DOM and XPath in Java? digg.com digg it!  |  reddit reddit!

Character Sets and Encodings

Saturday, April 1st, 2006

Long gone are the days where each character was represented by a number, or more specifically a byte. You know, when your character routines were simple like:

/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
    if (c >= 'A' && c < = 'Z')
        return 'a' + 'A';
    else
        return c;
}

(The C Programming Language, page 43)

Of course that only worked for character sets with consecutive letters, e.g. ASCII, but not EBCDIC.

But even early on I knew this type of function was bad, and where ever possible you should just use the supplied library routines that took care of all character set nastiness for you. In fact with Java’s String class, it’s amazing how long you can be ignorant of character sets and encoding. This is because in the English speaking World, our characters map to the same set of bytes in almost all encodings.

I think now’s a good time to take a quick break and go read Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Now hopefully from that you’ve got three key things:

  1. There is no such thing as plain text. Bytes on a disk are worthless without knowing the character encoding.
  2. Character set != Character encoding
  3. Character != Byte

Plain text was gone the moment someone wanted to store something besides an English character. You shouldn’t assume a ‘text file’ is encoding in US-ASCII. One of the conveniences of the new character encodings (e.g. ISO-8859-1 and UTF-8) is that they let you get away with this if you’re dealing with English text. But you still shouldn’t do it!

For a long time a character was mapped to a byte or a byte sequence, hence character sets and character encodings were one and the same thing. This changed with Unicode (which is what Java uses to represent characters internally), which brought in the concept of a ‘code point’. A code point is a unique identifier for a glyph, e.g.:

Unicode code point
U+0041
U+00DF
U+6771
U+10400
Representative glyph
UTF-32 code units
00000041
000000DF
00006771
00010400
UTF-16 code units
0041
00DF
6771
D801 DC00
UTF-8 code units
41
C3 9F
E6 9D B1
F0 90 90 80

(Table from Supplementary Characters in the Java Platform)

Included in the above table are the byte representations of the glyphs in various different UTF-x encodings. The thing to note about the UTF-x encodings is that they are variable length, and the ‘x’ is the smallest number of bits required to represent one character, but it may require more. Although in practice UTF-32 always uses 32-bits because that currently covers all the code points. This brings us to the third point, a character is no longer represented by one byte, but it can be. This breaks a lot of character handling routines that assumes characters a 8 bits long. It’s worth checking about the Unicode 4.0 support in J2SE 1.5 to see how the use of the ‘char’ type is going out of fashion.

It should be fairly obvious why reading a UTF-8 encoded file as ASCII could produce a lot of garbage, but at the same time your English characters would be fine.

So why does this matter if Java uses Unicode to store Strings? You’d assume it would also have a default encoding, so the following would be standard:

String text = ....;
FileOutputStream fos = new FileOutputStream("/tmp/dump.txt");
fos.write(text.getBytes());

This file could be read back in with a FileInputStream and you’d get the same file each time. This is true if you run the read and write programs on the same machine, but the default character encoding depends on the JVM and operating system.

How to find the default character set for your JVM

import java.io.OutputStreamWriter;
import java.nio.charset.Charset;

/**
 * How to determine the default encoding
 */
public class CharacterSet {

    public static void main(String[] args) {
        // in JDK 1.4, defaultEncodingName will typically be "Cp1252"
        // In an Applet, this requires signing for privilege.
        String defaultEncodingName = System.getProperty( "file.encoding" );
        log(defaultEncodingName);

        // in JDK 1.5+, will typically be "windows-1252"
        // First, get the Charset/encoding then convert to String.
        defaultEncodingName = Charset.defaultCharset().name();
        log(defaultEncodingName);

        // I'm told this circumlocution has the nice property you can even use
        // it in an unsigned Applet.
        defaultEncodingName = new OutputStreamWriter( System.out ).getEncoding();
        log(defaultEncodingName);
    }

    private static void log(String msg) {
        System.out.println(msg);
    }
}

Output (IBM 1.5.0 JDK on Linux)

ANSI_X3.4-1968
US-ASCII
ASCII

Clearly a lot of variation! My preference is to specify UTF-8 when I’m reading and writing my own files because I deal mostly with English text and this does save bytes. It also allows the files to be viewed in almost any text reader.

But what if you don’t control bytes, e.g. you download a file from the web? This becomes a bit trickier. Sometimes they tell you what it is, e.g. in the ‘Content-Type’ header, or in a metatag, sometimes they don’t. Thankfully browsers have had to deal with this problem for years, and the Mozilla project has produced character set detectors, which have been ported to Java. Definitely worth looking into if you have to handle text files from unknown sources.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Character Sets and Encodings digg.com digg it!  |  reddit reddit!

ONJava slipping?

Thursday, February 16th, 2006

Not really but apparently provocative (mis-leading) headllines are the way to get on the ‘A-List’. ;) (Note: He’s changed the paragraph about titles so it’s much softer than what I got in my RSS feed).

So what’s wrong with ONJava? It’s their latest article, Integrating Ant with Eclipse, Part 1. For those of you who don’t use Eclipse and Ant to integrate them you need to do:

  1. Window > Show View > Ant
  2. Drag build.xml from the Package Explorer into the new window pane

And you’re done. How does this warrant an article?

On closer examination it’s an excerpt from an oldish book, and most of it is about setting up Eclipse and writing an Ant build script. But if you’re not already using Eclipse and have an Ant build script, why would you read an article titled ‘Integrating Ant with Eclipse’?

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit ONJava slipping? digg.com digg it!  |  reddit reddit!

Does logic belong in the database?

Tuesday, February 14th, 2006

Builder UK has done an indepth interview with David Heinemeier Hansson, Ruby on Rails: The importance of being 1.0. I like Rails and would like to use it more, but was never very happy with it’s lack of SQL features. From the interview:

Regarding the specifics, it’s no secret that I’m not a big fan of logic in the database. I don’t think the database is an appropriate place to maintain a coherent domain model. And I don’t think you should integrate multiple applications through the database. So if you follow that and shield your database from access of multiple applications, you can move all of that logic you would have put in stored procedures, triggers, and what have you into an object-oriented model that can take advantage of the last 20-plus years of progress in software-development techniques.

There’s a fair point but in the past I’m pretty sure DHH mentioned foreign keys were unnecessary too, which I found frustrating when I switched from PostgreSQL to MySQL (DreamHost doesn’t support the former).

But now I look back at it, it doesn’t look bad. Rails has several features that makes it possible to maintain data integrity and allows you to keep all the details of your model in one place, e.g. hasmany, belongsto in ActiveRecord. Also Rails apps tend to be built from the ground up (rather than built on top of legacy systems) so if you control the only access points to the database, why not limit your checks to one place? There are two things that come to mind that made me think keeping all the data integrity in the source code was a bad thing.

ACS 4.0 Tcl My first real programming job was at ArsDigita. They had a toolkit know as the ACS (ArsDigita Community System), which at the time was written in Tcl. Tcl is a scriping language, a fairly simple one at that. One of the big new features in ACS 4 was object orientation. I can’t remember how they did OO in Tcl but I do recall the extends they went to to mimic OO features in the database. Object hierarchies were modelled by table hierarchies, with lots of constraints, we even wrote OO data retrieval functions in Oracle PL/SQL. Beyond that there was also a lot of emphasis on reducing the amount of SQL queries each web page makes and optimizing them where ever possible. If you’re interested, here’s the full indoctrination.

JDBC The other thing I’m going to blame is how hard it is to do SQL in Java. It’s just painful. How verbose is this!

Connection c = ... // I won't bother including connection setup code
PreparedStatement ps = null;
ResultSet rs = null;

try {
    ps = c.prepareStatement("select id, name, password from users where id = ?");
    ps.setString(1, id);
    rs = ps.executeQuery();

    if (rs.next()) {
        String id = rs.getString(1);
        String name = rs.getString(2);
        String password = rs.getString(3);

        User user = new User(id, name, password);
        return user;
    }
} 
finally {
    // I won't bother with closing code either
    close(rs);
    close(ps);
}

With Rails you declare your User class as:

class User < ActiveRecord::Base

end

and the lookup code is:

user = User.find(id)

and you’re done. True the ActiveRecord limits you to fairly simple object models, but when the framework makes your life this easy, you can cope. My main point is using SQL in Java is painful enough I’d push as much logic as I can down into the database in the form of constraints, cascades, triggers, PL/SQL, etc. so I could simplify my JDBC code.

So does logic belong in the database? Yes and no. If you think your database is going to out live your application, then it’s probably a good idea to make sure it contains all your data integrity rules. But that’s not always the case, sometimes the database is just a store that’s meaningless without your application, so why duplicate logic? Rails is full of ‘it really doesn’t have to be that hard’ moments. I’m surprised I’m still coming across them.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Does logic belong in the database? digg.com digg it!  |  reddit reddit!

Enum<E extends Enum<E>>

Tuesday, February 14th, 2006

Just as I begin to get comfortable with Java 5 I come across the class declaration ‘Enum<E extends Enum<E>>’. Now there’s a head scratcher!

Update: Here’s a good explanation of what’s going on: http://madbean.com/2004/mb2004-3

Basically it’s an idiom to allow a class to have methods that have a return type that is a sub class of itself.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit Enum<E extends Enum<E>> digg.com digg it!  |  reddit reddit!

What does String#length do?

Monday, August 22nd, 2005

I just came across John O’Conner’s ‘How long is your String?‘ post. The gist of the post is that String#length in Java doesn’t always return the number of characters in your string, it depends on the character set you’re using. If you’re just writing an English app, you’re fine, but if you plan to i18n your app you have another thing to worry about.

There are plenty of places in my current project where I use string.length() (e.g. making sure a user name and password is long enough, etc.) but now to be sure it needs to be replaced by:

String str = ....; int len = str.codePointCount(0, str.length());

Of course user name and password lengths might not make sense in a non-alphabetic language anyway.

At least they keep the definition of String#length consistent, which in case you never bothered to read the Javadoc (like me) is:

Returns the length of this string. The length is equal to the number of 16-bit Unicode characters in the string.

So you can work out the size of the data fairly easily, e.g. to make sure the data is not too big for a database field. And if your database is using the same size for characters you don’t have to do any calculations. But I can see some pretty gnarly bugs coming out of that, so it’s probably best to work in bytes, i.e. you have to do something like:

String str = ....; int len = str.getBytes().length;

If you interested in reading more about character sets, Tim Bray has a pretty good article: Characters vs. Bytes.

Spread the word: Technorati related  |  del.icio.us bookmark it!  |  submit What does String#length do? digg.com digg it!  |  reddit reddit!