Hi, I've been playing around with the ExtractText plugin and using it to extract text from a PDF and re-inject that text back into the spamassassin stream for processing.
I've written a small perl script (I'm not a perl expert) that runs a few programs to gather info about the PDF, extract the text, and print it to stdout. On occasion with PDFs that contain control characters such as line feeds, it causes spamassassin to stop processing. I believe it's related to the encoding (or lack thereof) of the output. Is this possible? Would extended characters cause spamassassin to choke? I've stored the text output from the PDF into an array, then used encode('ISO-8859-1',$l) before printing each line. It's not entirely reliable, however, and still occasionally causes spamassassin to choke. What's the proper way to encode output from a raw data stream before passing it to spamassassin? if (@pdftext && (scalar(@pdftext) gt 1)) { foreach my $l (@pdftext) { printf("%s",encode('iso-8859-1',$l)); } } else { print "NoText "'; }