Those of you on Twitter will have seen the little ‘translate this tweet’ suggestions that it puts up. If you’re from or in New Zealand you probably will have seen that reo Māori is often recognised by the algorithm as Latvian, presumably because Latvian also has long vowels indicated by macrons. I’ve always been surprised by this, because Latvian looks so different.
It turns out I’m right. Even looking just at individual letters, it’s very easy to distinguish the two. I downloaded 74000 paragraphs of Latvian Wikipedia, a total of 6.5 million letters, and looked at how long the Latvians can go without using letters that don’t appear in te reo: specifically, s,z,j,v,d,c, g not as ng, the six accented consonants, and any consonant at the end of a word. On average, I only needed to wait five letters to know the language is Latvian rather than Māori, and 99% of the time it took less than 21 letters.
Another language that Twitter often guesses is Finnish. That makes more sense: many of the letters not used in Māori are also rare or absent in Finnish, and ‘g’ appears mostly as ‘ng’. However, Finnish does have ‘s’, has ‘ä’ and ‘ö’, and ‘y’, and has words ending in consonants, so it should also be feasible to distinguish.
Update: Indonesian is another popular guess, but it has ‘d’,’j’,’y’,”b”, and it has lots of works ending with consonants. The average time to rule out te reo is slightly longer, at nearly 6 characters, and the 99th percentile is 22 letters. So if the algorithm can’t tell, it should probably guess it’s not Indonesian.
Update: For very short tweets, and those in mixed languages, nothing’s going to work, but this is about tweets where the answer is obvious to a human.