First I wrote a quick program called 'wf' (for word frequency). It is a perl script that makes a count of how many time each word is used in a file, and produces a report based on the numbers. I ran this on my spam folder (I've saved all my spam messages since 9/96) and found the results intriguing.
Then I wrote a program called 'sq' (spam quotient) that reads in a table generated by wf (and edited by me to exclude commonly-used english words) and weights a message based on its word content. The results are fairly impressive.
Update, 7/30/98: I wrote a modified version of wf that adds up words from my spam folder, and then subtracts words from my other mail folder. Then I edited that table for common words. The result should be a more accurate analysis. I also added code to add 100 points for each exclamation point and 100 points for each all-caps word.
If you're curious, you can paste in a message here and see what score the spam processor gives it. It will return the weight of the message overall, and the weight per word (spam quotient), which takes into account the length of the message.
|
|
|
|
| Thu Dec 4 23:46:04 PST 2008 | spam/index.src | Updated: Thu Aug 18 2005 9:32.04 | Viewed: never |