Re: Controlling Tika's metadata

2011-06-17 Thread alexander sulz

I have the same problem with discarding the metadata title.
I thought the parameter captureAttr (can be provided at the 
solrconfig.xml and via get/post as a parameter) is responsible for that? 
I set it to false in in the xml and as a parameter, still, I get not 
multivalued field errors due to metadata  literals delivering content 
to a no multivalued field. ;(


using 3.1 though.

On 02.02.2011 17:13, Grant Ingersoll wrote:

On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote:


Just getting my feet wet with the text extraction using both schema and
solrconfig settings from the example directory in the 1.4 distribution, so I
might miss something obvious.

Trying to provide my own title (and discarding the one received through Tika's
metadata) wasn't straightforward. I had to use the following:

fmap.title=tika_title (to discard the Tika title)
literal.attr_title=New Title (to provide the correct one)
fmap.attr_title=title (to map it back to the field as I would like to use title
in searches)

Is there anything easier than the above?

How can this best be generalized to other metadata provided by Tika (which in
our use case will be mostly ignored, as it is provided separately)?

You can provide your own ContentHandler (see the wiki docs).  I think it would 
be reasonable to patch the ExtractingRequestHandler to have a no metadata 
option and it wouldn't be that hard.




Re: Controlling Tika's metadata

2011-06-17 Thread Jan Høydahl
This is the same issue I brought up in this thread: 
http://search-lucene.com/m/s8sOH1YG1TP

As a workaround I wrote an UpdateProcessor to copy/move fields around 
(SOLR-2599). 
I think we need a separate fmap for TIKA generated fields (say tmap), so the 
problem could be fixed by:

tmap.title=tika_title
literal.title=My client provided title

In this way we can cleanly rename or ignore TIKA-generated metadata. Perhaps 
also an option to add a prefix to all Tika generated fields?

tika.prefix=tika_

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 2. feb. 2011, at 17.13, Grant Ingersoll wrote:

 
 On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote:
 
 Just getting my feet wet with the text extraction using both schema and 
 solrconfig settings from the example directory in the 1.4 distribution, so I 
 might miss something obvious.
 
 Trying to provide my own title (and discarding the one received through 
 Tika's 
 metadata) wasn't straightforward. I had to use the following:
 
 fmap.title=tika_title (to discard the Tika title)
 literal.attr_title=New Title (to provide the correct one)
 fmap.attr_title=title (to map it back to the field as I would like to use 
 title 
 in searches)
 
 Is there anything easier than the above?
 
 How can this best be generalized to other metadata provided by Tika (which 
 in 
 our use case will be mostly ignored, as it is provided separately)?
 
 You can provide your own ContentHandler (see the wiki docs).  I think it 
 would be reasonable to patch the ExtractingRequestHandler to have a no 
 metadata option and it wouldn't be that hard.



Re: Controlling Tika's metadata

2011-02-02 Thread Grant Ingersoll

On Jan 28, 2011, at 5:38 PM, Andreas Kemkes wrote:

 Just getting my feet wet with the text extraction using both schema and 
 solrconfig settings from the example directory in the 1.4 distribution, so I 
 might miss something obvious.
 
 Trying to provide my own title (and discarding the one received through 
 Tika's 
 metadata) wasn't straightforward. I had to use the following:
 
 fmap.title=tika_title (to discard the Tika title)
 literal.attr_title=New Title (to provide the correct one)
 fmap.attr_title=title (to map it back to the field as I would like to use 
 title 
 in searches)
 
 Is there anything easier than the above?
 
 How can this best be generalized to other metadata provided by Tika (which in 
 our use case will be mostly ignored, as it is provided separately)?

You can provide your own ContentHandler (see the wiki docs).  I think it would 
be reasonable to patch the ExtractingRequestHandler to have a no metadata 
option and it wouldn't be that hard.

Controlling Tika's metadata

2011-01-28 Thread Andreas Kemkes
Just getting my feet wet with the text extraction using both schema and 
solrconfig settings from the example directory in the 1.4 distribution, so I 
might miss something obvious.

Trying to provide my own title (and discarding the one received through Tika's 
metadata) wasn't straightforward. I had to use the following:

fmap.title=tika_title (to discard the Tika title)
literal.attr_title=New Title (to provide the correct one)
fmap.attr_title=title (to map it back to the field as I would like to use title 
in searches)

Is there anything easier than the above?

How can this best be generalized to other metadata provided by Tika (which in 
our use case will be mostly ignored, as it is provided separately)?

Thanks in advance for your responses.