Chris:

Yes, I only see the output below.

I'm familiar with the information in 
 http://wiki.apache.org/solr/ExtractingRequestHandler, except for 
the tika.config part, which I haven't touched.

Even when running documents through Tika directly, the output of metadata is 
highly dependent on what metadata the document contains (obviously).  I haven't 
found the right place in the Tika source code yet either.  Would digging into 
POI, PDFBox, ... help me any further on my pursuit?  A Matrix that lists the 
complete set of metadata for the most popular formats would sure be helpful to 
me.  I would help providing it, if properly directed.

Thanks,

Andreas

PS: I've also noticed some differences in the date formats being used (using 
version 0.9).  Is that something I should be concerned about when using it 
through SolrCell?

<meta name="Creation-Date" content="Mon May 17 10:10:15 PDT 2010"/> (from a 
Word 
document)
<meta name="Creation-Date" content="2011-01-03T18:45:50Z"/> (from a PDF)




________________________________
From: "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Sent: Fri, February 25, 2011 4:11:00 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

> java -jar tika-app-0.9.jar --list-met-models
> TikaMetadataKeys
> PROTECTED
> RESOURCE_NAME_KEY
> TikaMimeKeys
> MIME_TYPE_MAGIC
> TIKA_MIME_FILE
> 
> Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

Strange -- those are the only met models you're seeing listed?

> 
> I'm a bit unclear if that gets me to what I was looking for - metadata 
> like "content_type" or "last_modified".  Or am I confusing Tika metadata 
> with SolrCell metadata?
> 
> I thought SolrCell metadata comes from Tika, or does it not?

It does come from Tika that's for sure, but in SolrCell, there is a 
configuration for the ExtractingRequestHandler that remaps
the field names from Tika to Solr. So that's probably where it's coming from. 
Check this out:

http://wiki.apache.org/solr/ExtractingRequestHandler

HTH!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


      

Reply via email to