Re: Plugin extracting text from docs

Theo Van Dinter Thu, 25 Jun 2009 14:15:39 -0700

On Thu, Jun 25, 2009 at 3:41 PM, Jonas Eckerman<jonas_li...@frukt.org> wrote:
> Matus example was a Word document that contained as PDF wich (might in turn
> contain an image). A plugin that knows how to read word document could
> extract th text of the word document and then use "set_rendered" to make
> that avaiölable to SA. It cannot currently extract the PDF and make it
> available to any plugins that knows how tpo read PDFs though.


My view would be that if someone is going to try making things so
convoluted such as that, a) we've won because no one is going to go
through the trouble of opening that doc, b) the convolution is a
fingerprint that you could write a rule for and then you don't care
what the content actually is.  For example, you'd render something
like "doc_pdf_jpg", which would make an obvious Bayes token.  In the
same way for a zip file, you could do "zip_pdf zip_jpg zip_txt", etc,
and they'd all be different tokes.

But yes, you're right, the Message/Message::Node stuff wasn't designed
with the idea of supporting multiple independent data objects from a
single mime part.  I can see the argument for "treat embeded files
similar to multipart", but I still lean towards mime structure only.

> For some stuff coordination would be needed, yes. But not for what I'm
> thinking of.

Why not?  If you have no coordination, you would possibly look for
images first, then pdfs, then word docs, and end up not getting
anywhere.  If it's all your plugin, you can configure the order.  If
it's not, you need coordination.  For example, as from above, if
there's zip file with a doc which has a pdf which has a jpg, and your
plugin doesn't handle zip but another one does ...

> The most common thing to extract apart from text will most likely be images.
> Any OCR text extractor tied into my plugin would get to see those images,
> but any OCR SA plugins run after my plugin won't. It might be good to make
> extracted images available to those, and other image handling plugins.

But yours already ran, so who cares about the others?

Seriously.

If you're expending the resources to OCR the same image in an email
multiple times ...  You clearly either have a lot of hardware or not a
lot of mail.

Re: Plugin extracting text from docs

Reply via email to