The database is very simple. The PostreSQL data definition looks like this:
CREATE TABLE mail ( msgid text NOT NULL, delivery timestamp NOT NULL, body text, flag char NOT NULL default 'H', score decimal(5,2) NOT NULL default '0.00', predicted char NOT NULL default 'Y', PRIMARY KEY (msgid), CHECK (flag = 'H' OR flag = 'S'), CHECK (predicted = 'Y' OR predicted = 'N') ); CREATE INDEX delivery_idx ON mail(delivery);
The column msgid is the Message-Id of the email. If there are duplicates, entries get overwritten. The time of arrival is stored in delivery. The whole message (also headers) goes into body. This column can be NULL to save space.
flag tells whether the message is considered ham or spam. score is the current score attributed by the mailfilter. predicted keeps track if a message was classified correctly.
I added this to .procmailrc:
:0 c | /home/slicks/crm114db/add.sh default :0fw: 0.crm114.lock | /usr/share/crm114/mailfilter.crm -u /home/slicks/crm114db/ :0: * ^X-CRM114-Status: SPAM.* .spam/
The first rule stores all messages in the database. The second classifies the mails. The final rule separates spam from ham.
The traditional way is to pipe wrongly classified messages back to the mailfilter. However, messages tend to get slightly modified.
In this setup, the piped messages only serve to extract the message-id so the original message can be retrieved in the database. The columns predicted will be set to 'N' and flag to the argument provided.
macro index X "
unset wait_key\n /home/slicks/bin/learnspam\n set wait_key\n " "Learn as spam" macro index H " unset wait_key\n /home/slicks/bin/learnnonspam\n set wait_key\n" "Learn as ham" macro pager X " unset wait_key\n /home/slicks/bin/learnspam\n set wait_key\n " "Learn as spam" macro pager H " unset wait_key\n /home/slicks/bin/learnnonspam\n set wait_key\n" "Learn as ham" unignore message-id:
The last line ensures that mutt will pipe that header, though it shouldn't be necessary.
learnspam looks like this:
#!/bin/sh cd /home/slicks/crm114db ./classify.pl S
Substitute S into H for learnnonspam.
There are three perl scripts: add.pl adds mails to the database, classify.pl reclassifies the mails (after they were wrongly classified), and train.pl which will train the messages in the database until their scores and their flag are consistent (TUNE).
The margin is rather calculated ad hoc.
I once deduced that the PDF of a
repeated binary stochastic variable could be
x^a * (1-x)^b
on the interval
with the maximum as expected value, where
a is the number of events 1 and
b the number of events 0. This is quite intuitive.
Finding the smallest interval containing 90% of the integral is not easily solved correctly, especially not in SQL.
I assumed that since the
maximum was almost at 1 (accuracy is close to 100%), that the interval would be as large as for the event of all 1's. One can easily
calculate that in that case, for a 90% margin, the interval would be
n the number of emails.
The interval is very asymmetric around the maximum, so I took the whole interval as error margin.
The margin is useful to avoid 100% accuracy on small sets of mails that are all classified correctly by coincidence or by the small size.