Re: Metadata, TextExtractor

Alexander Klimetschek Fri, 14 Aug 2009 14:33:47 -0700

On Tue, Aug 4, 2009 at 8:16 AM, Zhou Wu<[email protected]> wrote:
> 1. It looks if one wants to put the metadata from a document in a
> repository, one has to do by his/her own. Why cannot we publish the metadata
> (it should be configurable) during the text extracting stages? If I do it by
> myself, I have to process the document once more just for the metadata --
> affecting performance badly. Please note that in v2.0, the metadata object
> is indeed obtained by Tika during the stage, but is discarded. Without
> metadata in place, we miss too much  searchable information in the
> repository.


Meta-data extraction is something that cannot easily be handled
generically, because it depends on your input content, what you want
as metadata and how you define your node structure. Hence such a
solution could only be a hook upon a JCR save() that would let you do
anything with the changed content and add additional properties. But
this is not a good idea, as a save would then always imply many
subsequent changes to your content. And since you need full API access
anyway to be able to express your metadata structure freely, this is
best done on the JCR API level by the application, not the repository.

Fulltext extraction is different, because it does not change the JCR
content. It "only" extracts full-text from binary or string properties
and makes it available for the full-text search index.

Regards,
Alex

-- 
Alexander Klimetschek
[email protected]

Re: Metadata, TextExtractor

Reply via email to