Chris: Yes, I only see the output below.
I'm familiar with the information in http://wiki.apache.org/solr/ExtractingRequestHandler, except for the tika.config part, which I haven't touched. Even when running documents through Tika directly, the output of metadata is highly dependent on what metadata the document contains (obviously). I haven't found the right place in the Tika source code yet either. Would digging into POI, PDFBox, ... help me any further on my pursuit? A Matrix that lists the complete set of metadata for the most popular formats would sure be helpful to me. I would help providing it, if properly directed. Thanks, Andreas PS: I've also noticed some differences in the date formats being used (using version 0.9). Is that something I should be concerned about when using it through SolrCell? <meta name="Creation-Date" content="Mon May 17 10:10:15 PDT 2010"/> (from a Word document) <meta name="Creation-Date" content="2011-01-03T18:45:50Z"/> (from a PDF) ________________________________ From: "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> Sent: Fri, February 25, 2011 4:11:00 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, > java -jar tika-app-0.9.jar --list-met-models > TikaMetadataKeys > PROTECTED > RESOURCE_NAME_KEY > TikaMimeKeys > MIME_TYPE_MAGIC > TIKA_MIME_FILE > > Both 0.8 and 0.9 give me the same list. Is that a configuration issue? Strange -- those are the only met models you're seeing listed? > > I'm a bit unclear if that gets me to what I was looking for - metadata > like "content_type" or "last_modified". Or am I confusing Tika metadata > with SolrCell metadata? > > I thought SolrCell metadata comes from Tika, or does it not? It does come from Tika that's for sure, but in SolrCell, there is a configuration for the ExtractingRequestHandler that remaps the field names from Tika to Solr. So that's probably where it's coming from. Check this out: http://wiki.apache.org/solr/ExtractingRequestHandler HTH! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++