Jim Fulton wrote at 2005-5-27 10:45 -0400:
> ...
>> You cannot make text extraction cheap (as it handles potentially large
>> data).
>You can't make it cheap in all applications.  For most applications,
>text extraction and comparison is very cheap.
>I'm guessing that you are refering to indexing large (book size)
>documents.  I would argue that this is pretty specialized.

No, I am speaking about a repository with office documents (letters,
reports, drafts, documentation, ...) which apparently is not too
rare at least in a Plone like context.

>And it is usually not the case that text extraction is expensive.

I analysed last year text extraction from office documents.

  WVware extraction for documents in the order of 1 MB size
  took time in the order of seconds; OpenOffice text extraction
  in the order of 10 seconds (after optimization; standard - twice
  as much).

Definitely, I do not like this time for any change in a metadatum
or a workflow change. While a user accepts some seconds delays
when he uploads large documents, he feels it unacceptable to
wait for seconds when he performs e.g. a workflow action on such
a document.

