gregr:: Discussion #3 Message : 2004-10-31 08.03.43 peter [Changes]   [Calendar]   [Search]   [Index]   
  [Back to discussion: Word Count]  

Discussion #3 Message: 2004-10-31 08.03.43 peter

I've attached a new version of criptodic.py

It has a new function, sigProb that opperates very similarly to countWords save for two differences. One, when it finds a word, it makes sure it hasn't just found the first part of a longer word. For example, countWords, when parsing the string 'fisherman' would find the words 'fish' and 'man', whereas sigProb would find the entire string 'fisherman'. Secondly, instead of simply counting the number of words it has found in a string, it will assign a value a value to each word based on the length word, and return the sum of these values for all of the words it finds. Right now the value is calculated as 4 ^ (word length - 1). It's an exponential function because the probability that a word will show up randomly increases exponentially.

Fred has suggested another possible measure of the 'Englishness' of a string. This function would go through a string and essentially remove every word it finds from it and return the number of remaining characters.

Each of these systems for ranking strings does better sometimes than others, and every tractable implimentation of them seems to have a few flaws or places where it will not rank strings as desired; however, at this point the tool appears (to me at least) to be pretty good at what it aims to do. Whether or not what it aims to do is actually useful remains to be seen.

I still haven't gotten around to optimizing that one recursive function, but with the (significantly shorter) scowl wordlist, it's less of an issue.

(last modified 2004-10-31)       [Login]
(No back references.)