this is cool

One day I was thinking, "wouldn't it rule if I could tell the probability that a message is spam based on the content of the message?" Sure I can write procmail rules until the chicken-cows come home, but what I wanted to test was some kind of quotient based on word frequency.

First I wrote a quick program called 'wf' (for word frequency). It is a perl script that makes a count of how many time each word is used in a file, and produces a report based on the numbers. I ran this on my spam folder (I've saved all my spam messages since 9/96) and found the results intriguing.

Then I wrote a program called 'sq' (spam quotient) that reads in a table generated by wf (and edited by me to exclude commonly-used english words) and weights a message based on its word content. The results are fairly impressive.

Update, 7/30/98: I wrote a modified version of wf that adds up words from my spam folder, and then subtracts words from my other mail folder. Then I edited that table for common words. The result should be a more accurate analysis. I also added code to add 100 points for each exclamation point and 100 points for each all-caps word.

If you're curious, you can paste in a message here and see what score the spam processor gives it. It will return the weight of the message overall, and the weight per word (spam quotient), which takes into account the length of the message.

Skip email headers


Vi Powered Lynx Now! Powered by FreeBSD
Thu Dec 4 23:46:04 PST 2008   spam/index.src
Updated: Thu Aug 18 2005 9:32.04   Viewed: never

Copyright © 1998-1999 by Nick Johnson. All rights reserved.