Home Guidelines Datetime Currency i18n Problems Practical Bits Quality

Problems Galore

 

Things can and will go wrong. Everybody makes mistakes. While my view is that I do not give a hoot about who makes what mistake as long as it is caught before end users or clients get their hands on the software, I have also found that this view is not necessarily shared in all companies or all groups I have worked with in the last two decades.

If I had to summarize the evolution in the software industry with regard to internationalization and localization over the last twenty years in two paragraphs, it would go like this.


On the positive side, programming languages, libraries and frameworks have become better at doing internationalization. Translators are better trained and use better tools. I was responsible for some time for some of the latter, including the translation memory tool you know as SDL X. And our friends in England published a short article I wrote entitled “A DIY translation manager”, which described how to emulate the MS Word user interface of TRADOS with a simple MS Access database backend. XML has come along and taken much of the bite out of  code pages and character sets. And Unicode is finally unifying the world.

On the negative side, XML has come along and, together with greater use of scripting and the internet, added to what I call the “layer problem”.  More than ever, user interface strings can and will be used across different software components that do not handle strings in the same way. You may end up with a broken character in a database, you may see a square box or questions marks on a web page or in a command window. You may get insidious runtime errors because of a bad use of single or double quotes in javascript. And, as much as I hate to say this, documentation has gotten worse while it as improved. Automation of documentation for C# and Java, for example, is now largely in the hands of individual developers and that means very different quality within even small components. Last but not least, there is the proliferation of devices, from mobile phones to madical devices large and small, various pads to point of sale terminals and radio frequency identification chips.

While many people continue to argue about the correct definitions of internationalization and localization or expect erudite explanations of character set, encoded character set and font, I personally focus on “transition”. Every single time, a piece of text (a string) is handed off to some other component, there is the potential for a problem (a bug). Once you think in these terms, much of the work becomes easier. The concept that much of internationalization is about transition makes the need for good pseudo-translation pretty obvious. Just as it makes good QA obvious.

The following is a collection of snappy headlines with examples of common problems.

 

you can quote me on this

It is probably fair to say that the cumulative cost of single and double ASCII quotes over the past 30 years or so is in the order of millions of dollars across the software industry.
From \'escapes\' that do not survive the translation process to a French year 2000 (l'an 2000) that becomes a lan 2000 in the ui and on to javascript errors that surface when a customer hits a page - ASCII quotes (U + 00xx) continue to surprise.
There is a very simple solution for web pages: use smart quotes (U+20xx). See
ASCII and Unicode quotation marks.

 

when good strings go bad: Mojibake

Instead of me laboriously trying to explain and find examples of character corruption, Wikipedia will do the explaining: Mojibake around the world.

 

garbled download filenames

A variation on a still common theme. Common for Firefox if you do not use mime-types with base64 encoded filename in the CONTENT-DISPOSITION header. The mime encoded string consists of =??B??= where B stands for base 64. So, it should look more or less like this CONTENT-DISPOSITION: inline; filename==?utf-8?B?44GC44GE44GG44GI44GK44GL44GN44GP44GR44GTLnhscw==?=

 

for entities' sake

Do not use entity references unless you absolutely have to. Which means, use entity references only to escape markup inside xml, for example: <mystring>This is a &lt;b&gt; bold &lt;/b&gt; statement</mystring>

Why you should never use them to output text to a web page:

The page source is no longer human readable for many languages

They can easily mask a bad "content-type". If you do not set the charset and the server defaults to iso-8859-1, your entities will display correctly but:

Form submission may be bad

Other software that consumes this page may run into errors

 

to sort or to collate

Displaying sorted lists of text can be done in different ways. If your entries come from a database, you can either rely on whatever collation is set for it, you can show the list in its binary sort, or you can try to sort in accordance with the language of a user.  You can download a simple web application for text sorting right here.

 

duplicate translations

Issue: files coming back from translation have duplicate translations. In other words, you have two different English strings but they have the same translation.
Solution: The generic advice is to never make your ui dependent on uniqueness of strings.

There are instances where this is correct behavior. An over-simplified example would be something like this: en1=Goto en2=Go to with translation1=Gehe nach and translation2=Gehe nach

There are other instances where the English is artificially "unique". See the article on currencies for examples.

And then there are instances where it is simply a translation error.

The problem is compounded when a data loading utility is used and uniqueness constraints are enforced. The solution for uniqueness constraints is:
You need to write or provide a utility that is file based and that can check a file for uniqueness violations before import. It is critical that all violations in a given file are found in a single run. Many tools, for example, xml parsers, will stop at the first error, and this can lead to numerous, time consuming iterations.

 

(ir)regular expressions

Issue: Regular expressions sometimes fail with non-ASCII.

Solution: While they may have their use, a slightly more verbose approach to coding may well be better in the long run for shipped products. "Multi-byte" regular expression bugs need particularly serious testing.

 

string theory and math

In general, you can expect a "char" in any programming language to hold one 16-bit value. Since Unicode supplementary characters need 2 of these, make sure your string math is right.
.NET: A single Char object usually represents a single code point; that is, the numeric value of the Char equals the code point. However, a code point might require more than one encoded element. For example, a Unicode supplementary code point (a surrogate pair) is encoded with two Char objects.
To work with each Unicode character instead of each Char object, use the System.Globalization.StringInfo class.
Java: "char" and CodePoint are the respective equivalents.
JavaScript: see once upon a time (surrogates)

How about some Oracle PL/SQL math:

CREATE OR REPLACE PROCEDURE mgmt$mask_sendMsg (msg IN VARCHAR2) IS

msg1 VARCHAR2(1020);

len INTEGER := length(msg);

i INTEGER := 1;

BEGIN dbms_output.enable (1000000);

LOOP msg1 := SUBSTR (msg, i, 255);

dbms_output.put_line (msg1);

len := len - 255;

i := i + 255;

EXIT WHEN len <= 0; END LOOP;

END mgmt$mask_sendMsg;

/

 

Issue: This procedure does not work properly if the input string contains multi-byte characters.

We get a "string buffer size is too small" error from the call to dbms_output.put_line(). It works if you change the substr length to 100.

Solution: Find the Oracle reference "SQL Functions for Different Length Semantics".

 

 

databases

Issue: Text stored in a MS SQL table using "N" datatypes fails to import into a DB2 utf-8 table.
Solution: Here is a rule of thumb for using databases to store text in "char/varchar" columns:

If you only support Oracle and/or DB2: use a Unicode database (set the character set for the whole database to utf-8)

If you support Oracle and MS SQL (and DB2): use N datatypese (including clobs) and "graphic" types for DB2. This will make the RDBMS behavior more consistent.

Since every commercial RDBMS comes with its own set of challenges and documentation, go and check there for details you may never need.

 

pseudo translation

This involves adding language specific text (something Japanese, French, etc.) to extracted English strings and then testing "Japanese", "French" and the like. One of the most widely known benefits is the detection of hardcoded strings. But, equally important are:

detection of encoding problems (see "when good strings go bad" above)

detection of layout problems

detection of bad programming practices (for example, using an extracted string as a variable name)

Inadequate pseudo translation: never add just ASCII characters to a string. For example, this 'myenstring=enter a number' would be badly "pseudo translated" as something like that:
myenstring=[PS] enter a number
myenstring=$$ enter a number $$

 

creative placeholders

In general, placeholders should be simple, numeric tokens of the type 0, {1}, '2', and the like. Unfortunately, there are "standards" will a long history (such as the %s, %d from C/C++) as well as various, well-intentioned more recent approaches. Do not get creative, for example: The Accounting Pay Period yyy / zzz for the aaa entry is not open.
If your product has old or proprietary placeholders, be prepared to mitigate bugs. One way to do this would be to split one message into two in order to get around the need to re-order placeholders in a translation.

 

SBCS, MBCS, DBCS

In a Unicode world, you should use these terms only in a very narrow sense. For example, Microsoft MSDN says: Support for Multibyte Character Sets (MBCS) Multibyte character sets (MBCS) are an alternative to Unicode for supporting character sets, like Japanese and Chinese, that cannot be represented in a single byte. If you are programming for an international market, consider using either Unicode or MBCS, or enabling your program so you can build it for either by changing a switch. The most common MBCS implementation is double-byte character sets (DBCS). Visual C++ in general, and MFC in particular, is fully enabled for DBCS. If you can spend a minute on Wikipedia Multi-byte_character_set, you are on your way to say "Chinese, Japanese, Korean" or "CJK" instead of MBCS when you really mean CJK. Also, please note that modern "MBCS" also contains Western European, Russian, Greek, etc. - just not in the same location that you would expect.

 

fonts: squares

Issue: Users see squares instead of the desired characters or glyphs.
Solution: Check what fonts the application is calling. If possible, use the "generic" (aka. logical) name "sansSerif" or "serif". You can find a font test tool in the Downloads section of this web site. Browsers are generally smarter than desktop applications when it comes to displaying content.

 

fonts: question marks

Issue: Users see question marks instead of the desired characters or glyphs.
Solution: Check that encoding conversion is done correctly and encoding is specified at all transition points.

 

american == english?

The Oracle NLSGDK allows developers to map languages and locales (Oracle traditionally uses different names than Java). The result of the mapping may not always be as expected:

Locale locale = new Locale( "en" );
String oraLang = LocaleMapper.getOraLanguage( locale );
String oraLangFromJavaLang = LocaleMapper.getOraLangFromJavaLang( locale.toString() );
System.out.println("oraLang = " + oraLang); System.out.println("oraLangFromJavaLang = " + oraLangFromJavaLang);
The output is: oraLang = AMERICAN oraLangFromJavaLang = ENGLISH

 

once upon a time (surrogates)

Once upon a time, the notion of more than 64K of characters, glyphs or whatever you want to call it, seemed frivolous. And then someone decided to combine some unused ones, and surrogates were born. And with them, parsing strings by/into chars became a little bit more complicated.

 

size matters

Be nice to your translators and tell them they can use abbreviations if your code does not allow text to expand enough. While the latter is preferred, there will be times when it cannot be done. Let them use abbreviations and, on your side, make sure that punctuation marks in your strings do not have any functional meaning in your code.