Debugging Unicode Problems
This page describes what to do in a very specific situation. Namely, you've got some
character data in one place (typically a database) which has to go through various
steps and then ends up being shown to the user (often on a web page). Unfortunately,
some characters aren't being displayed correctly. Due to the many steps involved,
the problem can occur in various places. This page aims to help you find out what's
wrong simply and reliably.
Step 1: Understand the basics of Unicode
If you feel comfortable with Unicode, character encodings etc, feel free to skip this
step. Basically, you need to know a little bit about what characters are and what
conversions are likely to be applied to them before going much further. See
my article on the subject (and the articles it references)
for more information.
Step 2: Try to identify the possible conversions involved
If you can work out where things might be going wrong, it's much easier
to then isolate which one it is. Also bear in mind not just how you're retrieving
the data, but how the data got there in the first place. (Some problems I've seen
have been due to an old application writing to and reading from the database in an
incorrect way, but the bugs cancelling each other out. No problems occur when it's just
this broken application which accesses the database, but things go wrong when anything
else does.) Steps involved may well include fetching the data from the database,
reading it from a file, sending it across a web connection, or displaying it on the
screen.
Step 3: Verify the data at each step
The first lesson here is not to trust anything which tries to log
the character data as a sequence of glyphs. Instead, you should log
the character data as a sequence of Unicode values (integers). For
instance, if I had a string containing the word "hello", I would display
it as "0068 0065 006c 006c 006f". (Using hex makes it easier to check values
against the Unicode code charts later.) To achieve this, step through
each character in the string and display the character however you would
display an integer. For instance, here is a method to dump all the
characters in a string to the console:
static void DumpString (string value)
{
foreach (char c in value)
{
Console.Write("{0:x4} ", (int)c);
}
Console.WriteLine();
}
Depending on your exact environment, your method of logging will vary, but using
something like the above should give you what you need. My
article on strings gives a more detailed debugging form.
The reason for doing this is that it gets rid of problems with fonts, other encoding
issues, etc. If you can't log even plain ASCII hex digits properly, you're in a world
of trouble anyway - but you may well not be able to log Unicode in a reliable way,
and as you already know you've got some problems on the Unicode front, it's worth
being safe.
Now you need to make sure there's a test case to use. Find some (preferrably small)
example of where your application is failing, make sure you know exactly what the result
should be, and then log the actual result at each of your possible problem points. (Some
may be out of your control, but usually if you log as soon as you receive some data and
just before you send some data, you'll find the problem.)
Having logged a problematic string, you should verify whether or not it's what it should be.
This is where the Unicode code charts page
comes in. You can either pick which block you believe the correct character is in, or
you can search for your character alphabetically. Check that each character in the string
has its proper Unicode value. As soon as you find a point in your application flow where
the character data is corrupted, you should investigate that area of the code, find out
why it's being corrupted and fix it. When you've got it right throughout the application
flow, the application should be working properly.
Conclusion
Like so many problems in software engineering, the solution to fixing problems with text
usually involves a "divide and conquer" approach. Once you're confident in each step,
you should be able to be confident in the whole. If you come across particularly awkward
examples while working out what's going wrong, I'd strongly advise you to write unit
tests covering them - both as documentation for what can happen, and as a guard against
future regressions.