On Tue, Aug 4, 2009 at 8:16 AM, Zhou Wu<[email protected]> wrote: > 1. It looks if one wants to put the metadata from a document in a > repository, one has to do by his/her own. Why cannot we publish the metadata > (it should be configurable) during the text extracting stages? If I do it by > myself, I have to process the document once more just for the metadata -- > affecting performance badly. Please note that in v2.0, the metadata object > is indeed obtained by Tika during the stage, but is discarded. Without > metadata in place, we miss too much searchable information in the > repository.
Meta-data extraction is something that cannot easily be handled generically, because it depends on your input content, what you want as metadata and how you define your node structure. Hence such a solution could only be a hook upon a JCR save() that would let you do anything with the changed content and add additional properties. But this is not a good idea, as a save would then always imply many subsequent changes to your content. And since you need full API access anyway to be able to express your metadata structure freely, this is best done on the JCR API level by the application, not the repository. Fulltext extraction is different, because it does not change the JCR content. It "only" extracts full-text from binary or string properties and makes it available for the full-text search index. Regards, Alex -- Alexander Klimetschek [email protected]
