Spammers have gotten smarter

Today I've actually read some text that got through the spam filter. Total nonsense that kinda makes sense. Here's an excerpt:

Popularity of blogs helped also to popularize concept web content mechanisms is such as or Atom have. Xml am perform operations instead using or Feeditem Item want or cannot changed exception readunread property in attached. System time or downloaded via Http Https parsed is normalized unified Identifies updated is Merges reflect last. Those tricky a details platform shields even of supports upcoming Support or Whether is implement innovative scenarios basically deal with Common Feed List. [etc]

This reminded me of a computer-generated text so I sought for Markov chain text generators. Here's one for example. Study its output (links are near the bottom of the page) and it'll be the same kind of "nonsense making sense".

Bayesian filtering is a kind of "inverse" of Markov chain text generation - both methods are based on statistics. The problem with the Markov-generated text is that its statistical properties closely match those of real text, so the Bayesian filter doesn't classify them as spam.

Generating garbage with required statistical properties is relatively easy; it just requires a list of words and a good Markov model. Once generated, it requires real human understanding for classification.

I didn't study theory behind Markov processes and Bayesian filtering deepely. I might be talking half-rubbish :) But given the amount and kind of spam that gets through the filter, I have a feeling that spammers are slowly winning the battle.

1 comment:

Unknown said...

Actually, we're talking almost the same.