2008-01-10

Some regexps

I have started to develop some CRM114 scripts to classify Usenet articles. Internally, it uses the TRE regexp engine and I needed a regexp splits article into headers and body; the separation happens at the first empty line. So here's the regexp; it seems to work :)

/((?:(?n:.+):*:_nl:)+):*:_nl:(.+)/

Some peculiarities: the :*:_nl: is CRM114 literal for newline, and the ?n: is a TRE flag to tell it that dot (.) should not match the newline character (by default it does, and this makes TRE different from PCRE regexp). The headers will end up in the 1st matching group, and the article body in the 2nd.

No comments: