Long gone are the days where each character was represented by a number, or more specifically a byte. You know, when your character routines were simple like:
/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
if (c >= 'A' && c < = 'Z')
return 'a' + 'A';
else
return c;
}
(The C Programming Language, page 43)
Of course that only worked for character sets with consecutive letters, e.g. ASCII, but not EBCDIC.
But even early on I knew this type of function was bad, and where ever possible you should just use the supplied library routines that took care of all character set nastiness for you. In fact with Java’s String class, it’s amazing how long you can be ignorant of character sets and encoding. This is because in the English speaking World, our characters map to the same set of bytes in almost all encodings.
I think now’s a good time to take a quick break and go read Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Now hopefully from that you’ve got three key things:
- There is no such thing as plain text. Bytes on a disk are worthless without knowing the character encoding.
- Character set != Character encoding
- Character != Byte
Plain text was gone the moment someone wanted to store something besides an English character. You shouldn’t assume a ‘text file’ is encoding in US-ASCII. One of the conveniences of the new character encodings (e.g. ISO-8859-1 and UTF-8) is that they let you get away with this if you’re dealing with English text. But you still shouldn’t do it!
For a long time a character was mapped to a byte or a byte sequence, hence character sets and character encodings were one and the same thing. This changed with Unicode (which is what Java uses to represent characters internally), which brought in the concept of a ‘code point’. A code point is a unique identifier for a glyph, e.g.:
| Unicode code point |
U+0041 |
U+00DF |
U+6771 |
U+10400 |
| Representative glyph |
 |
 |
 |
 |
| UTF-32 code units |
|
|
|
|
| UTF-16 code units |
|
|
|
|
| UTF-8 code units |
|
|
|
|
(Table from Supplementary Characters in the Java Platform)
Included in the above table are the byte representations of the glyphs in various different UTF-x encodings. The thing to note about the UTF-x encodings is that they are variable length, and the ‘x’ is the smallest number of bits required to represent one character, but it may require more. Although in practice UTF-32 always uses 32-bits because that currently covers all the code points. This brings us to the third point, a character is no longer represented by one byte, but it can be. This breaks a lot of character handling routines that assumes characters a 8 bits long. It’s worth checking about the Unicode 4.0 support in J2SE 1.5 to see how the use of the ‘char’ type is going out of fashion.
It should be fairly obvious why reading a UTF-8 encoded file as ASCII could produce a lot of garbage, but at the same time your English characters would be fine.
So why does this matter if Java uses Unicode to store Strings? You’d assume it would also have a default encoding, so the following would be standard:
String text = ....;
FileOutputStream fos = new FileOutputStream("/tmp/dump.txt");
fos.write(text.getBytes());
This file could be read back in with a FileInputStream and you’d get the same file each time. This is true if you run the read and write programs on the same machine, but the default character encoding depends on the JVM and operating system.
How to find the default character set for your JVM
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
/**
* How to determine the default encoding
*/
public class CharacterSet {
public static void main(String[] args) {
// in JDK 1.4, defaultEncodingName will typically be "Cp1252"
// In an Applet, this requires signing for privilege.
String defaultEncodingName = System.getProperty( "file.encoding" );
log(defaultEncodingName);
// in JDK 1.5+, will typically be "windows-1252"
// First, get the Charset/encoding then convert to String.
defaultEncodingName = Charset.defaultCharset().name();
log(defaultEncodingName);
// I'm told this circumlocution has the nice property you can even use
// it in an unsigned Applet.
defaultEncodingName = new OutputStreamWriter( System.out ).getEncoding();
log(defaultEncodingName);
}
private static void log(String msg) {
System.out.println(msg);
}
}
Output (IBM 1.5.0 JDK on Linux)
ANSI_X3.4-1968
US-ASCII
ASCII
Clearly a lot of variation! My preference is to specify UTF-8 when I’m reading and writing my own files because I deal mostly with English text and this does save bytes. It also allows the files to be viewed in almost any text reader.
But what if you don’t control bytes, e.g. you download a file from the web? This becomes a bit trickier. Sometimes they tell you what it is, e.g. in the ‘Content-Type’ header, or in a metatag, sometimes they don’t. Thankfully browsers have had to deal with this problem for years, and the Mozilla project has produced character set detectors, which have been ported to Java. Definitely worth looking into if you have to handle text files from unknown sources.