(C) 20120 sussenburger.com
A sample program in Java that uses the one man's garbage concept is here for free.
A question for you: What results in one big mess that makes a for a spine chilling
As you reach the end of the article, you will have a description of a simple, practical,
quite reliable algorithm for document language detection. There will be no mention
of what I call "the easy way out", which you can roughly equate with the push toward
adding language tags to everything as you author it. This article is limited to documents,
written text. Your smart phone asking you "did you mean water" when you are trying
to call your friend Walter via a voice command is a related but different subject,
and the author admits that he uses voice activated dialing only for entertainment
when all the popular science shows on the local public broadcasting network are re-
Signs of undeniable progress in this area are everywhere. One of the more visible ones is software by Xerox, the veteran company that brought us some earlier notable successes in computational linguistics. Xerox now has a product that does what translators have long seen as the obvious thing to do: it looks at every sentence of a document.
Doesn't this make good sense? How come it took so long? One obvious reason is the complexity of natural languages. Unraveling this complexity and transforming it into something a computer program can use to come up with a good, reproducible result was rightly seen as daunting. Then there was the dire shortage of specialists and the immense boredom associated with going through text after text to mark it up, or tag it, with meta information such as gender, nouns, predicates and the relationship between the various parts of speech.
Another reason for what some saw as slow progress was that these efforts lived in
a niche of little interest to a broader audience. But for documents this situation
changed with the massive expansion of computing itself and the Internet in particular.
There were several developments that spurred advances, for example, the use of statistics,
neural computing, or the desire of large enterprises and libraries to automate document
handling. At the end of the day, however, it may very well be that spam emails did
for computational linguistics what the adult video industry did for CDs. Pent-
As more people analyzed written language, approaches and algorithms became better
and more widely distributed. Applying my well-
Before this, however, it may be worthwhile to spend a minute on a discussion of how a human determines the language of a document. Do not say you have not been warned about simplification, so here we go: who, what, how?
If my grandmother picks up a newspaper and starts reading, there is no deep analysis involved: the newspaper will always be in German, and she will read it with gusto. In contrast, if the lawyer I plan to hire to sue my employer for harassment looks through the emails I submitted, she will look for specific "markers" that indicate my claim is good or that I made up the whole thing. She will also look at how the various parts of speech relate to each other and factor this into her evaluation.
Now, what happens if the devious grandson replaces the German newspaper with one
that is in English? Gradma will look at it and say, "Was ist das?" Now think of your
own grandmother -
With my grandmother, I have basically two choices. I can pick up the newspaper, explain or not that it is in English, give her the German one and never mention it again. Or, as you may have guessed, I can point to some of the words and say, see this is English. The next time she finds an English newspaper where she expected a German one, she may be able to say, oh, right this is English.
And there you have it: you have just invented an algorithm you can use in software. Find a bunch of things, or words, that are typical and unique for English, have the computer match them, and you are done. This approach uses explicit knowledge to determine the language.
A second major solution to our problem uses implicit characteristics. It does not say, oh this specific word, for example, “train”, is English, whereas the word “desu” is Japanese, and so on. This second solution uses the knowledge that a language has a structure, or a fingerprint. It will say, the structure you gave me is typical for, say, Russian. Hence, the language of the text is Russian. The program will say: I am quite certain it is Russian, even though I did not look at a dictionary of words.
This elegant solution that says I do not want to deal with all the moving parts of
a language, uses a technique called n-
Imagine the 3-
So, the n-
Just was with the number of fingers, we need a larger number of samples to identify closely related languages with greater accuracy. But just as with the smudged fingerprints from CSI, the quality of the sample can make or break your detection despite a sufficient quantity of items.
With regard to a document, the term smudge should not be interpreted as the ubiquitous coffee stains on the desk of a typical engineer, but is used here for something else that blurs the lines: foreign words, loan words, the overlay of the sentence structure of one language over that of another. Extreme cases of this are language mixes like Spanglish or Franglais. Just as for fingerprints, the approach would be to try and identify the smudge and exclude it as best as possible from the evaluation.
At this point, we have two techniques we can apply to the goal of language detection.
Grandma's unique words and her grandson's new-
My treasure. I used a stoplist for the first time in a small program to create a
raw document index from a 100 plus pages technical manual. The translation company
I was at prided itself as a purveyor of value-
Before you start coding your language detection tool, get some stoplists. They are everywhere on the Internet. Then augment the stop lists with entries from "most frequent words" lists. Add a sprinkling of pronouns and a serving of prepositions. Shake well. The "shake well" part is important to avoid any copyright violation. Mix them up and add your own, and you are safe. No, this will not negatively affect accuracy as you will see later.