Hi,

I've been playing around with the ExtractText plugin and using it to
extract text from a PDF and re-inject that text back into the
spamassassin stream for processing.

I've written a small perl script (I'm not a perl expert) that runs a
few programs to gather info about the PDF, extract the text, and print
it to stdout.

On occasion with PDFs that contain control characters such as line
feeds, it causes spamassassin to stop processing. I believe it's
related to the encoding (or lack thereof) of the output. Is this
possible? Would extended characters cause spamassassin to choke?

I've stored the text output from the PDF into an array, then used
encode('ISO-8859-1',$l) before printing each line. It's not entirely
reliable, however, and still occasionally causes spamassassin to
choke.

What's the proper way to encode output from a raw data stream before
passing it to spamassassin?

if (@pdftext && (scalar(@pdftext) gt 1)) {
   foreach my $l (@pdftext) {
      printf("%s",encode('iso-8859-1',$l));
   }
} else {
   print "NoText "';
}

Reply via email to