Home Articles and Downloads Remote Font Diagnosis Language Detection PathCopy and Xcopy Encodings

What a Difference a Space Makes - Language Detection with Minimal Linguistics


A sample program in Java that uses the one man's garbage concept is here for free. A question for you: What results in one big mess that makes a for a spine chilling 30-second news blip? And you know, it will happen again - and again. One answer is automatic language detection by computers. After this blatant attention getting device, we will go into a calmer discussion of some aspects of language detection. The catchy introduction is in no way meant to belittle the great strides linguists and computer scientists have made towards reliably detecting the language of documents.

As you reach the end of the article, you will have a description of a simple, practical, quite reliable algorithm for document language detection. There will be no mention of what I call "the easy way out", which you can roughly equate with the push toward adding language tags to everything as you author it. This article is limited to documents, written text. Your smart phone asking you "did you mean water" when you are trying to call your friend Walter via a voice command is a related but different subject, and the author admits that he uses voice activated dialing only for entertainment when all the popular science shows on the local public broadcasting network are re-runs. Also note that we will deal only with detecting a language, not with the more difficult task of indexing texts for later full-text search or machine translation or other uses.

Signs of undeniable progress in this area are everywhere. One of the more visible ones is software by Xerox, the veteran company that brought us some earlier notable successes in computational linguistics. Xerox now has a product that does what translators have long seen as the obvious thing to do: it looks at every sentence of a document.

Doesn't this make good sense? How come it took so long? One obvious reason is the complexity of natural languages. Unraveling this complexity and transforming it into something a computer program can use to come up with a good, reproducible result was rightly seen as daunting. Then there was the dire shortage of specialists and the immense boredom associated with going through text after text to mark it up, or tag it, with meta information such as gender, nouns, predicates and the relationship between the various parts of speech.

Another reason for what some saw as slow progress was that these efforts lived in a niche of little interest to a broader audience. But for documents this situation changed with the massive expansion of computing itself and the Internet in particular. There were several developments that spurred advances, for example, the use of statistics, neural computing, or the desire of large enterprises and libraries to automate document handling. At the end of the day, however, it may very well be that spam emails did for computational linguistics what the adult video industry did for CDs. Pent-up demand made looking at language sexy.

As more people analyzed written language, approaches and algorithms became better and more widely distributed. Applying my well-known tendency to simplify a complex question to the point of utter ridicule, let us look at two basic approaches to language detection.

Before this, however, it may be worthwhile to spend a minute on a discussion of how a human determines the language of a document. Do not say you have not been warned about simplification, so here we go: who, what, how?

If my grandmother picks up a newspaper and starts reading, there is no deep analysis involved: the newspaper will always be in German, and she will read it with gusto. In contrast, if the lawyer I plan to hire to sue my employer for harassment looks through the emails I submitted, she will look for specific "markers" that indicate my claim is good or that I made up the whole thing. She will also look at how the various parts of speech relate to each other and factor this into her evaluation.

Now, what happens if the devious grandson replaces the German newspaper with one that is in English? Gradma will look at it and say, "Was ist das?" Now think of your own grandmother -- she, on the other hand, would pick up that very same newspaper and read it without thinking twice.

With my grandmother, I have basically two choices. I can pick up the newspaper, explain or not that it is in English, give her the German one and never mention it again. Or, as you may have guessed, I can point to some of the words and say, see this is English. The next time she finds an English newspaper where she expected a German one, she may be able to say, oh, right this is English.

And there you have it: you have just invented an algorithm you can use in software. Find a bunch of things, or words, that are typical and unique for English, have the computer match them, and you are done. This approach uses explicit knowledge to determine the language.

A second major solution to our problem uses implicit characteristics. It does not say, oh this specific word, for example, “train”, is English, whereas the word “desu” is Japanese, and so on. This second solution uses the knowledge that a language has a structure, or a fingerprint. It will say, the structure you gave me is typical for, say, Russian. Hence, the language of the text is Russian. The program will say: I am quite certain it is Russian, even though I did not look at a dictionary of words.

This elegant solution that says I do not want to deal with all the moving parts of a language, uses a technique called n-gram. Of course, Wikipedia has a good definition but I like mine better, mostly because it is more entertaining. The "gram" has nothing to do with weights and measures but everything with a representation, think of is as a diagram, if you will. And the "n" in n-gram is simply a placeholder for a number. Small numbers, ranging from 1 to 4, are most commonly used. So, 3-gram really means that you split your data into chunks of 3 letters, or characters.

Imagine the 3-gram to look like a rake. Apply it to this sentence: “You take a sentence”. Your n-gram "rake" will yield "you", "tak", "e a". You can include spaces and punctuation or ignore them. You can shift the rake by three, as in the previous sentence, or maybe by one and get overlapping 3-grams, like "you", "ou ", u t", and so forth. Of course, many languages do not separate words by spaces, and then there are the shape-shifting languages, such as Arabic. You will also encounter numbers, which are of less use for our question "what is the language of this document"?

So, the n-gram technique provides you with a set of fingerprints that allows you to say with some certainty that a document is in this or that language. The fingerprint image is quite appropriate, and I will tell you later why I do not want to use the DNA image just yet. Just as a set of fingerprints can identify a larger entity – namely, you, the n-gram set of data can identify the larger entity "language". Both techniques have very similar essential characteristics. It is no coincidence that forensic linguistics can speak of "manuscript fingerprints" in identifying the author of a ransom note or the latest work of a certain William Shakespeare. The accuracy of both depends heavily on how similar or how different the samples are, as well as on how large a sample is. A single thumbprint may be good enough to exclude you as the suspect in a document forgery case (negative proof of identity) but it may not be good enough to admit you into the United States (positive proof of identity). And this is the reason why all those websites where you paste some text into a box so they can identify the language for you will always say that the result is more accurate if your piece of text is longer.

Just was with the number of fingers, we need a larger number of samples to identify closely related languages with greater accuracy. But just as with the smudged fingerprints from CSI, the quality of the sample can make or break your detection despite a sufficient quantity of items.

With regard to a document, the term smudge should not be interpreted as the ubiquitous coffee stains on the desk of a typical engineer, but is used here for something else that blurs the lines: foreign words, loan words, the overlay of the sentence structure of one language over that of another. Extreme cases of this are language mixes like Spanglish or Franglais. Just as for fingerprints, the approach would be to try and identify the smudge and exclude it as best as possible from the evaluation.

At this point, we have two techniques we can apply to the goal of language detection. Grandma's unique words and her grandson's new-fangled n-grams. Let's go back and look at words in more detail. As you may have wanted to exclaim earlier: what about the frequency? It is not simply about a word as being English or Japanese, it is also about how often it is used. Part of the beauty of the n-gram approach is that it sort of encapsulates this too. Linguists have collected information about the frequency of words for decades. Teachers have used it for generations of students. And authors and programmers have incorporated one version of frequency lists in many programs: stop lists. A nice misnomer that one is -- it should be called skip list because that is its usual function. Skip the entries of that list when searching for something in a text, skip its contents when indexing a database column.

One man's garbage

My treasure. I used a stoplist for the first time in a small program to create a raw document index from a 100 plus pages technical manual. The translation company I was at prided itself as a purveyor of value-added services. Anyway, one day Hewlett Packard asked if we could do a translation of a manual and, while we were at it, could we also create an index? They thought having an index was a good idea and had somehow not realized it until the book was already printed. The lead linguist made a face, clearly conscious of the amount of work it would take. Copy and paste a piece of text, add a page number, and make sure the whole thing is correctly sorted. Enter the engineer. Armed with a good stoplist (10 minutes to find it in an online help authoring package), and after a couple of hours of coding and tweaking, out comes a raw index with page numbers. We have a happy linguist. The story ends with puzzled faces, though. One day, a the folks from HP show up and express their satisfaction with the work. I do not recall who told them we had generated a raw index of the English but I do remember their encouraging comment that the linguistic algorithms must have been quite a challenge. Imagine the look on their faces when I said, oh no, the small program has no knowledge of grammar and such. The approximate date of this episode was late 1996.

Before you start coding your language detection tool, get some stoplists. They are everywhere on the Internet. Then augment the stop lists with entries from "most frequent words" lists. Add a sprinkling of pronouns and a serving of prepositions. Shake well. The "shake well" part is important to avoid any copyright violation. Mix them up and add your own, and you are safe. No, this will not negatively affect accuracy as you will see later.