Friday, March 25, 2011

How to detect language of a document

Recently I wrote a program to detect the human language of a given text . I was asked to do this task in 24 hours .After a lil googling I finally found this paper for N-Gram-Based Text Categorization by these two guys
William B. Cavnar and John M. Trenkle from Michigan AnnHarbour
Defn Worth a read...

Firstly what is a ngram ?
An Ngram is an n character slice of a string(From the paper verbatim)
so for APPLE you will have ngrams _,A,P,L,E then _A,AP,PL,LE,_AP,PLE etc

The basic algorithim if you dont have the patience to read this paper is

1)Create a ngram based profile for a document i.e this is basically finding the frequency of occurances of all the NGrams in your language document
2)Sort this ngram based profile with the highest frequency on top this would tell you the most occuring ngrams.
3)Now if you were to find the language of origin of a document then you will need to find its profile and then sort it by highest frequency
4)Now find a minimum distance between these documents i.e if the document is like the language this should be very small so the frequency of occurance of the words/syllables in the document and language would be similar .