Sunday, April 20, 2008

An input-optimization puzzle

As I was walking home the other day I was thinking about the way predictive text input works on cell phones, and how it's simultaneously cool that, because of the fact that only a tiny frequency of the possible combinations of our 26 letters are actually used as words, with only 8 symbols we're capable of entering English text without a great deal of difficulty, and irritating that a lot of common words have duplicate input sequences: soon/room, of/me, he/if, etc.; and the less frequent but still aggravatingly abundant book/cool, mind/mine/nine, cast/cart/bart, good/home/gone/hood/hoof/hone/goof, and on and on.

For reference, the way my cell phone (I assume this is a standard at this point?) groups the letters is as follows:

2 - abc
3 - def
4 - ghi
5 - jkl
6 - mno
7 - pqrs
8 - tuv
9 - wxyz

Well, I found myself thinking, what if we weren't required to disperse the letters across the keys in alphabetical order? Couldn't there be some kind of optimized arrangment such that the numbers of duplicated sequences would be dramatically reduced? I figured that I would be in a good position to try this out, having spent a lot of time thinking about the frequency of letters in English.

To this end, I read the TWL (US Scrabble) dictionary into a database and set up some automatic calculations to derive cell-phone input sequences for all the words. As it turns out, the standard layout has about 27,000 duplications out of the approximately 163,000 words in the dictionary (16.5% duplication). This shouldn't be hard to beat, right?

As it turns out, my optimism (and ego) was unwarranted. My first attempt gave me 33,000 duplications (20%), significantly worse than the baseline. After about a half hour of tinkering I managed to get it down to 25,000 (15.3%), which salved my wounded self-confidence but still was not nearly as good as I had hoped.

There must be some principles that are going to underlie any successful attempt. One is clearly that no two vowels (including y) should appear on the same key. Another one that I embarrassingly only discovered after the aforementioned half hour is that consonants frequently used in derivation and inflection should also not coincide, notably d, s, r, n.

So here's the challenge: can you come up with a scheme that's better than the 16.5% baseline standard, and/or better than my 15.3% best so far? If you send me your letter groupings, I'll feed them into my program and post the results here. Maybe together we can figure out the Grand Unified Theory of Duplication Avoidance in Cell-Phone Text Entry. Or something.


Sabrina said...

do you have to know how to use a computer to solve this one?

Josh said...

No, not at all, I'll do the actual calculation -- you'd just tell me which letters of the alphabet to assign to which cell phone keys.