Character Sets and Encodings

Long gone are the days where each character was represented by a number, or more specifically a byte. You know, when your character routines were simple like:

/* lower: convert c to lower case; ASCII only */
int lower(int c)
{
    if (c >= 'A' && c < = 'Z')
        return 'a' + 'A';
    else
        return c;
}

(The C Programming Language, page 43)

Of course that only worked for character sets with consecutive letters, e.g. ASCII, but not EBCDIC.

But even early on I knew this type of function was bad, and where ever possible you should just use the supplied library routines that took care of all character set nastiness for you. In fact with Java’s String class, it’s amazing how long you can be ignorant of character sets and encoding. This is because in the English speaking World, our characters map to the same set of bytes in almost all encodings.

I think now’s a good time to take a quick break and go read Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Now hopefully from that you’ve got three key things:

  1. There is no such thing as plain text. Bytes on a disk are worthless without knowing the character encoding.
  2. Character set != Character encoding
  3. Character != Byte

Plain text was gone the moment someone wanted to store something besides an English character. You shouldn’t assume a ‘text file’ is encoding in US-ASCII. One of the conveniences of the new character encodings (e.g. ISO-8859-1 and UTF-8) is that they let you get away with this if you’re dealing with English text. But you still shouldn’t do it!

For a long time a character was mapped to a byte or a byte sequence, hence character sets and character encodings were one and the same thing. This changed with Unicode (which is what Java uses to represent characters internally), which brought in the concept of a ‘code point’. A code point is a unique identifier for a glyph, e.g.:

Unicode code point
U+0041
U+00DF
U+6771
U+10400
Representative glyph
UTF-32 code units
00000041
000000DF
00006771
00010400
UTF-16 code units
0041
00DF
6771
D801 DC00
UTF-8 code units
41
C3 9F
E6 9D B1
F0 90 90 80

(Table from Supplementary Characters in the Java Platform)

Included in the above table are the byte representations of the glyphs in various different UTF-x encodings. The thing to note about the UTF-x encodings is that they are variable length, and the ‘x’ is the smallest number of bits required to represent one character, but it may require more. Although in practice UTF-32 always uses 32-bits because that currently covers all the code points. This brings us to the third point, a character is no longer represented by one byte, but it can be. This breaks a lot of character handling routines that assumes characters a 8 bits long. It’s worth checking about the Unicode 4.0 support in J2SE 1.5 to see how the use of the ‘char’ type is going out of fashion.

It should be fairly obvious why reading a UTF-8 encoded file as ASCII could produce a lot of garbage, but at the same time your English characters would be fine.

So why does this matter if Java uses Unicode to store Strings? You’d assume it would also have a default encoding, so the following would be standard:

String text = ....;
FileOutputStream fos = new FileOutputStream("/tmp/dump.txt");
fos.write(text.getBytes());

This file could be read back in with a FileInputStream and you’d get the same file each time. This is true if you run the read and write programs on the same machine, but the default character encoding depends on the JVM and operating system.

How to find the default character set for your JVM

import java.io.OutputStreamWriter;
import java.nio.charset.Charset;

/**
 * How to determine the default encoding
 */
public class CharacterSet {

    public static void main(String[] args) {
        // in JDK 1.4, defaultEncodingName will typically be "Cp1252"
        // In an Applet, this requires signing for privilege.
        String defaultEncodingName = System.getProperty( "file.encoding" );
        log(defaultEncodingName);

        // in JDK 1.5+, will typically be "windows-1252"
        // First, get the Charset/encoding then convert to String.
        defaultEncodingName = Charset.defaultCharset().name();
        log(defaultEncodingName);

        // I'm told this circumlocution has the nice property you can even use
        // it in an unsigned Applet.
        defaultEncodingName = new OutputStreamWriter( System.out ).getEncoding();
        log(defaultEncodingName);
    }

    private static void log(String msg) {
        System.out.println(msg);
    }
}

Output (IBM 1.5.0 JDK on Linux)

ANSI_X3.4-1968
US-ASCII
ASCII

Clearly a lot of variation! My preference is to specify UTF-8 when I’m reading and writing my own files because I deal mostly with English text and this does save bytes. It also allows the files to be viewed in almost any text reader.

But what if you don’t control bytes, e.g. you download a file from the web? This becomes a bit trickier. Sometimes they tell you what it is, e.g. in the ‘Content-Type’ header, or in a metatag, sometimes they don’t. Thankfully browsers have had to deal with this problem for years, and the Mozilla project has produced character set detectors, which have been ported to Java. Definitely worth looking into if you have to handle text files from unknown sources.

Spread the word: Technorati related  |  Technorati related  |  del.icio.us bookmark it!  |  submit Character Sets and Encodings digg.com digg it!  |  reddit reddit!

Leave a Reply

You must be logged in to post a comment.


ati-tv software ABBYY FineReader 8.0 Professional Multilanguage
ipod nana software ABBYY FineReader 8.0 Professional Multilanguage
easy fax software ABBYY FineReader Professional Edition 9.0 with Djvu Addon
encoder coding software Ableton Live 6.0.9
analytical system software ACD Systems Combo Pack
employee scheduling software Acronis Disk Director Server 10.0
eraise fundraising software Acronis Disk Director Suite 10.0
openeye software Acronis Disk Editor v6.0.360
gateway limited software Acronis Drive Cleanser v6.0 Build 383
tungsten software download Acronis Migrate Easy Deluxe v1.0.0.43
physics software Acronis & Paragon Universal Boot CD USB 2009 1.0
avatar server software Acronis & Paragon Universal Boot CD USB 2009 1.0
imaging software driver Acronis & Paragon Universal Boot CD USB 2009 1.0
software raid performance Acronis PartitionExpert 2003
legislative trackinig software Acronis Privacy Expert Suite 7.0
software listings Acronis Privacy Expert Suite 7.0
art software program Acronis Recovery Expert Deluxe
adams software forms Acronis True Image 7.0
free reservations software Acronis True Image Echo Server for Windows 9.5
panda antiviruse software Acronis True Image Enterprise Server 9.1.3666
microsoft excel software Acronis True Image Home 11.0
tape duplication software Acronis True Image Workstation 9.1.3887
free dayplanner software ActionScript 3.0 in Flash CS3 Professional Essential Training
raft software ActiveState Komodo IDE 4.2.0
software crises ActiveState Perl Dev Kit Pro 7
jq-210 download software Adobe Acrobat 7 Professional for Mac
audit software tracking Adobe Acrobat 7 Professional for Mac
smartist software Adobe Acrobat 7.0 Professional
callender software Adobe Acrobat 7.0 Professional
currency conversion software Adobe Acrobat 8.0 Professional
n95 remove software Adobe Acrobat 8.0 Professional
anderson software internet Adobe Acrobat 8.0 Professional
genealogy publishing software Adobe Acrobat 8.0 Professional for Mac
article distribution software Adobe Acrobat 8.0 Professional for Mac
map maker software Adobe Acrobat 9 Pro Extended
decisionbar software Adobe Acrobat 9 Pro Extended
timekeeping system software Adobe Acrobat 9 Pro Extended
knowledge mapping software Adobe Acrobat V 6.0 Professional PC
writing style software Adobe Acrobat V 6.0 Professional PC
management product software Adobe After Effects 6.5 for Mac
military packaging software Adobe After Effects 6.5 for Mac
knockout software Adobe After Effects 6.5 for Mac
mantel test software Adobe After Effects 7.0 Standard
pst repair software Adobe After Effects 7.0 Standard
pcr software customers Adobe After Effects 7.0 Standard
loopholes software testing Adobe After Effects CS3
ebay software tools Adobe After Effects CS3
volleyball statistic software Adobe After Effects CS3
autoquote software Adobe Atmosphere 1.0
medical documentation software Adobe Audition 2.0
software distrubitor Adobe Audition 2.0
6600 software program Adobe Audition 3.0
buy financial software Adobe Contribute CS3
apartment purchasing software Adobe Contribute CS3
define software application Adobe Creative Suite 2 Premium for Mac
autocad software review Adobe Creative Suite 2 Premium for Mac
pctools software Adobe Creative Suite 2 Premium for Mac
webchat software Adobe Creative Suite 2 Premium for Windows
book software store Adobe Creative Suite 2 Premium for Windows
software remote access Adobe Creative Suite 2 Premium for Windows
aor scanner software Adobe Creative Suite 2 Premium for Windows
software design article Adobe Creative Suite 3 Design Premium for Mac
microstudio 4.001 software Adobe Creative Suite 3 Design Premium for Mac
netscape 8 software Adobe Creative Suite 3 Design Premium for Mac
biblesoft software update Adobe Creative Suite 3 Design Premium for Mac
linux frontbridge software Adobe Creative Suite 3 Design Premium for Mac
code repository software Adobe Creative Suite 3 Design Premium for Win
acceleration software program Adobe Creative Suite 3 Design Premium for Win
clinical database software Adobe Creative Suite 3 Design Premium for Win
midi studio software Adobe Creative Suite 3 Design Premium for Win
aqura software Adobe Creative Suite 3 Master Collection for Mac
ezdata software Adobe Creative Suite 3 Master Collection for Mac
brightstore backup software Adobe Creative Suite 3 Master Collection for Mac
tk20 software Adobe Creative Suite 3 Master Collection for Mac
sequence analysis software Adobe Creative Suite 3 Master Collection for Mac
deployable spy software Adobe Creative Suite 3 Master Collection for Win
dprofiler software Adobe Creative Suite 3 Master Collection for Win
atm switch software Adobe Creative Suite 3 Master Collection for Win
hyperion software operations Adobe Creative Suite 3 Master Collection for Win
voice synthesis software Adobe Creative Suite 3 Master Collection for Win
iomega ghost software Adobe Creative Suite 3 Master Collection for Win
ups calculation software Adobe Creative Suite for Mac
oracle software Adobe Creative Suite for Mac
discount cad software Adobe Creative Suite for Mac
audio software forum Adobe Creative Suite for Mac
architectural renovation software Adobe Creative Suite 3 Master Collection for Win + Microsoft Office 2007 Enterprise
iphone 1.0.2 software Adobe Dreamweaver CS3
edsoft software Adobe Dreamweaver CS3
swing analysis software Adobe Dreamweaver CS3 for Mac
flowsheet software Adobe Encore CS3
lunix old software Adobe Encore CS3
software download drums Adobe Encore CS3
ficiton software Adobe Encore DVD 2.0
ebay software bidding Adobe Encore DVD 2.0
src software incorporated Adobe Encore DVD 2.0
amateur radio software Adobe Fireworks CS3
coin collectors software Adobe Fireworks CS3 for Mac
software package Adobe Flash CS3 Professional
convention software Adobe Flash CS3 Professional
picture software free Adobe Flash CS3 Professional for Mac
software circuit design Adobe Flash CS3 Professional for Mac
lenticular software crack Adobe Flash CS4
journal software engineering Adobe Flash CS4
webpage editor software Adobe Flash CS4
airframe business software Adobe Flex v.3.0.2
delmia software Adobe Flex v.3.0.2
freehand graphics software Adobe Font Folio 11
northrop resume software Adobe Font Folio 11
diabetes software pdas Adobe FrameMaker 7.0
istante software inc Adobe FrameMaker 7.0
vigilance software Adobe FrameMaker 8.0
software kinder preschool Adobe FrameMaker 8.0
dex drive software Adobe FrameMaker 9.0
de elementos software Adobe FrameMaker 9.0
software inspections Adobe FrameMaker 9.0
free cdrw software Adobe GoLive CS V 7.0 PC
ftp p2p software Adobe GoLive CS V 7.0 PC
transmit software Adobe GoLive CS2
becker data software Adobe GoLive CS2
fta firmware software Adobe Illustrator CS V 11.0 PC
liquids marketing software Adobe Illustrator CS V 11.0 PC
motorola v-180 software Adobe Illustrator CS2
explore anywhere software Adobe Illustrator CS3
outsourcing software solution Adobe InDesign CS V 3.0 PC
wntipcfg software Adobe InDesign CS V 3.0 PC
opengl software program Adobe InDesign CS2
nokia 6102i software Adobe InDesign CS2
free coloring software Adobe InDesign CS3
opp blackjack software Adobe InDesign CS3
astronomie software free Adobe InDesign CS4
monitoring email software Adobe InDesign CS4
development proposal software Adobe InDesign CS4
software cpu underclocking Adobe PageMaker 7.0.1
feko software Adobe Photoshop Album V 2.0
educational software seattle Adobe Photoshop Album V 2.0
learning measurement software Adobe Photoshop CS for Mac
flir software Adobe Photoshop CS for Mac
gps tracks software Adobe Photoshop CS v.8.0
free donation software Adobe Photoshop CS v.8.0
software updating freeware Adobe Photoshop CS2 for Mac
mountbridge software Adobe Photoshop CS2 for Mac
scream software seismic Adobe Photoshop CS2 V 9.0
software requirements plan Adobe Photoshop CS2 V 9.0
mpeg4 slideshow software Adobe Photoshop CS2 V 9.0
french genealogy software Adobe Photoshop CS3: Enhancing Digital Photographs
free software imac Adobe Photoshop CS3: Enhancing Digital Photographs
graphics benchmarking software Adobe Photoshop CS3 Extended
firewall software features Adobe Photoshop CS3 Extended
automatic prayer software Adobe Photoshop CS3 Extended
pine software examples Adobe Photoshop CS3 Extended for Mac