Re: Controlling Tika's metadata
I have the same problem with discarding the metadata title. I thought the parameter captureAttr (can be provided at the solrconfig.xml and via get/post as a parameter) is responsible for that? I set it to false in in the xml and as a parameter, still, I get not multivalued field errors due to metadata literals delivering content to a no multivalued field. ;( using 3.1 though. On 02.02.2011 17:13, Grant Ingersoll wrote: On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote: Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I had to use the following: fmap.title=tika_title (to discard the Tika title) literal.attr_title=New Title (to provide the correct one) fmap.attr_title=title (to map it back to the field as I would like to use title in searches) Is there anything easier than the above? How can this best be generalized to other metadata provided by Tika (which in our use case will be mostly ignored, as it is provided separately)? You can provide your own ContentHandler (see the wiki docs). I think it would be reasonable to patch the ExtractingRequestHandler to have a no metadata option and it wouldn't be that hard.
Re: Controlling Tika's metadata
This is the same issue I brought up in this thread: http://search-lucene.com/m/s8sOH1YG1TP As a workaround I wrote an UpdateProcessor to copy/move fields around (SOLR-2599). I think we need a separate fmap for TIKA generated fields (say tmap), so the problem could be fixed by: tmap.title=tika_title literal.title=My client provided title In this way we can cleanly rename or ignore TIKA-generated metadata. Perhaps also an option to add a prefix to all Tika generated fields? tika.prefix=tika_ -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 2. feb. 2011, at 17.13, Grant Ingersoll wrote: On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote: Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I had to use the following: fmap.title=tika_title (to discard the Tika title) literal.attr_title=New Title (to provide the correct one) fmap.attr_title=title (to map it back to the field as I would like to use title in searches) Is there anything easier than the above? How can this best be generalized to other metadata provided by Tika (which in our use case will be mostly ignored, as it is provided separately)? You can provide your own ContentHandler (see the wiki docs). I think it would be reasonable to patch the ExtractingRequestHandler to have a no metadata option and it wouldn't be that hard.
Re: Controlling Tika's metadata
On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote: Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I had to use the following: fmap.title=tika_title (to discard the Tika title) literal.attr_title=New Title (to provide the correct one) fmap.attr_title=title (to map it back to the field as I would like to use title in searches) Is there anything easier than the above? How can this best be generalized to other metadata provided by Tika (which in our use case will be mostly ignored, as it is provided separately)? You can provide your own ContentHandler (see the wiki docs). I think it would be reasonable to patch the ExtractingRequestHandler to have a no metadata option and it wouldn't be that hard.
Controlling Tika's metadata
Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I had to use the following: fmap.title=tika_title (to discard the Tika title) literal.attr_title=New Title (to provide the correct one) fmap.attr_title=title (to map it back to the field as I would like to use title in searches) Is there anything easier than the above? How can this best be generalized to other metadata provided by Tika (which in our use case will be mostly ignored, as it is provided separately)? Thanks in advance for your responses.