Word Count

Peter: I'm working on a (Python?) program to take lists of strings and rank them according to the number of dictionary words in them. I think this could be useful for working on the cyphers. I'm using the words lists from the mody lexicon project (http://www.dcs.shef.ac.uk/research/ilash/Moby/).

Tuesday, October 26/ I'm writing the main portion of the program in Python, because I was bored in class, and that's the language that was in front of me. I have the list of words loading into the program, and I have a binary search tree. Now all I have to do is put the one in the other, and impliment the reading and ranking of the strings./

Ok, a little farther along. The word list is loaded into the binary search tree. I think I'm going to rewrite the algorithm to do that however, as it is terribly inefficient. Apparently there are some times when recursion isn't the best solution. Who knew.

Discussion:

showing all 9 messages [Show 3 7 14 30 100 *999* days or 10 30 50 *100* 999 messages]

2004-10-28

by peter:

[edit]

Well... the program just did something that I might actually call 'working' at this point. It's still very unoptimized (loading the tree of dictionary words takes longer than it needs to, and the number of comparisons it needs to do to a string is exponential), but I have ideas to get both of those working better.

Yeah, I was hoping to be more productive tonight, but I kept making dumb blunders. Anyway, progress is progress.

Here's the program's final output:

thisisplaintext has 3 words in it.
vjkukurockovgzv has 1 words in it.
uijtjtqmbjmufyu has 0 words in it.
wklvlvspdlpwhaw has 0 words in it.

(it ignores 'is' because it only has two characters; at the moment it ignores words of two characters or less to speed up the processessing)

by peter:

[edit]

Hmmm, I'm having trouble walking through a string, testing different parts in a way that is both complete and tractable.

For example, say you're looking through vjkukurockovgzv, trying to count all of the words therein. Here's the order that I'm checking strings as words right now: v, vj, vjk, vjku, vjkuk, [...], j, jk, jku, jkuk, jkuku, [...], k, ku, kuk, kuku, [...]. That makes this problem an N! function (I think). One solution might be to add an extra function to the word lookup system that can tell that there is no word that begins with a particular string. That could bring the efficiency of the system much closer to N.

by peter:

[edit]

Ok, I've gotten the number of comparisons for the four short strings above from 696 to 387. I'll do a few tests on longer strings to get an idea of the significance of this change.

by peter:

[edit]

Ok, it will move through 25 caesar shifted ciphertexts of length 58 in 21060 comparisons. The problem is now in strings with several words inside. I added a bit of stats-generating code to the program and got this:

0 words in a string gives an average of 385 comparisons.
1 words in a string gives an average of 537 comparisons.
2 words in a string gives an average of 626 comparisons.
3 words in a string gives an average of 1374 comparisons.
4 words in a string gives an average of 3310 comparisons.
7 words in a string gives an average of 13800 comparisons.

Yeah. I think the problem is based in the idea that I want to count the absolute maximum number of words in each string. This doesn't seem to be really necessary at all, and it's terribly inefficient, so I'm going to look into a workaround.

by peter:

[edit]

Ok, yeah, that was the problem. Changing one line of code gives:

0 words in a string gives an average of 385 comparisons.
1 words in a string gives an average of 374 comparisons.
2 words in a string gives an average of 366 comparisons.
3 words in a string gives an average of 349 comparisons.
4 words in a string gives an average of 342 comparisons.
7 words in a string gives an average of 294 comparisons.

With the same number of words counted in each string. Huzzah! It is now fast enough to run well on ciphers of about the size we're considering.

by peter:

[edit]

Ok, I've got some workable sourcecode. It's rough around the edges right now, it assumes input files are named certain things and crashes if it can't find them, it doesn't accept command-line arguments, it spits out a lot of debug info, and it isn't commented. I'll update them periodicly. They require a list of words, seperated by endlines. For wordlists, I suggest the scowl list (http://prdownloads.sourceforge.net/wordlist/scowl-6.zip). I combined a few of the English wordlists there with a special hacker wordlist, ommitting words that were two characters or shorter (wordlist2.zip) and another omitting words that were three characters or shorter (wordlist3.zip)

2004-10-29

by peter:

[edit]

Ok, Fred wanted me to give an explanation of the algorithm. Here it is in pseudocode:

countwords( string )  {
   grab a chunk of the string, from left to right, starting with just one character
       if the chunk is a word {
           return 1 + countwords( string without the chunk)
       }
       if there aren't any words that start with the chunk {
           return countwords( string without the first character)
       }
   get a new chunk of the string, adding more to the length of the chunk, unless it reaches the end of the string, in which case it starts over on the string without the first character
}

Yeah, that was harder than I thought it would be. At this point, the actual code is clearer than that, to me at least.

Here's a walkthrough of it working on a particular string:

vjkukurockovgzv

Check v:
        v is not a word, but words can start with v

check vj:
        vj is not a word, and there aren't any words that start with vj

check j:
        j is not a word, but...

check jk:
        jk is not a word, and...

check k:

[...]


check r:
        r is not a word, but words can...

[...]

check rock:
        rock is a word.

check o:
        o is not a word, but words can...

check ov:
        ov is not a word, and there aren't any words that...

[...]

check zv:
        zv is not a word, and there aren't any...

check v:
        v is not a word, but words can...

I'm at the end of the stream.

2004-10-31

by peter:

[edit]

I've attached a new version of criptodic.py

It has a new function, sigProb that opperates very similarly to countWords save for two differences. One, when it finds a word, it makes sure it hasn't just found the first part of a longer word. For example, countWords, when parsing the string 'fisherman' would find the words 'fish' and 'man', whereas sigProb would find the entire string 'fisherman'. Secondly, instead of simply counting the number of words it has found in a string, it will assign a value a value to each word based on the length word, and return the sum of these values for all of the words it finds. Right now the value is calculated as 4 ^ (word length - 1). It's an exponential function because the probability that a word will show up randomly increases exponentially.

Fred has suggested another possible measure of the 'Englishness' of a string. This function would go through a string and essentially remove every word it finds from it and return the number of remaining characters.

Each of these systems for ranking strings does better sometimes than others, and every tractable implimentation of them seems to have a few flaws or places where it will not rank strings as desired; however, at this point the tool appears (to me at least) to be pretty good at what it aims to do. Whether or not what it aims to do is actually useful remains to be seen.

I still haven't gotten around to optimizing that one recursive function, but with the (significantly shorter) scowl wordlist, it's less of an issue.

2004-11-03

by strick:

[edit]

The English Word Finder is a good idea.

My current work is Tcl Workspace for JohnnyX/PhreakNic7 code. An englishness function (in Tcl) would be nice. Maybe something that uses both letter & bigraph frequency to match.

strick