Re: Tika metadata extracted per supported document format?
Chris: Yes, I only see the output below. I'm familiar with the information in http://wiki.apache.org/solr/ExtractingRequestHandler, except for the tika.config part, which I haven't touched. Even when running documents through Tika directly, the output of metadata is highly dependent on what metadata the document contains (obviously). I haven't found the right place in the Tika source code yet either. Would digging into POI, PDFBox, ... help me any further on my pursuit? A Matrix that lists the complete set of metadata for the most popular formats would sure be helpful to me. I would help providing it, if properly directed. Thanks, Andreas PS: I've also noticed some differences in the date formats being used (using version 0.9). Is that something I should be concerned about when using it through SolrCell? (from a Word document) (from a PDF) From: "Mattmann, Chris A (388J)" To: "solr-user@lucene.apache.org" Sent: Fri, February 25, 2011 4:11:00 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, > java -jar tika-app-0.9.jar --list-met-models > TikaMetadataKeys > PROTECTED > RESOURCE_NAME_KEY > TikaMimeKeys > MIME_TYPE_MAGIC > TIKA_MIME_FILE > > Both 0.8 and 0.9 give me the same list. Is that a configuration issue? Strange -- those are the only met models you're seeing listed? > > I'm a bit unclear if that gets me to what I was looking for - metadata > like "content_type" or "last_modified". Or am I confusing Tika metadata > with SolrCell metadata? > > I thought SolrCell metadata comes from Tika, or does it not? It does come from Tika that's for sure, but in SolrCell, there is a configuration for the ExtractingRequestHandler that remaps the field names from Tika to Solr. So that's probably where it's coming from. Check this out: http://wiki.apache.org/solr/ExtractingRequestHandler HTH! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Tika metadata extracted per supported document format?
Hi Andreas, > java -jar tika-app-0.9.jar --list-met-models > TikaMetadataKeys > PROTECTED > RESOURCE_NAME_KEY > TikaMimeKeys > MIME_TYPE_MAGIC > TIKA_MIME_FILE > > Both 0.8 and 0.9 give me the same list. Is that a configuration issue? Strange -- those are the only met models you're seeing listed? > > I'm a bit unclear if that gets me to what I was looking for - metadata > like "content_type" or "last_modified". Or am I confusing Tika metadata > with SolrCell metadata? > > I thought SolrCell metadata comes from Tika, or does it not? It does come from Tika that's for sure, but in SolrCell, there is a configuration for the ExtractingRequestHandler that remaps the field names from Tika to Solr. So that's probably where it's coming from. Check this out: http://wiki.apache.org/solr/ExtractingRequestHandler HTH! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Tika metadata extracted per supported document format?
Hi Chris, java -jar tika-app-0.9.jar --list-met-models TikaMetadataKeys PROTECTED RESOURCE_NAME_KEY TikaMimeKeys MIME_TYPE_MAGIC TIKA_MIME_FILE Both 0.8 and 0.9 give me the same list. Is that a configuration issue? I'm a bit unclear if that gets me to what I was looking for - metadata like "content_type" or "last_modified". Or am I confusing Tika metadata with SolrCell metadata? I thought SolrCell metadata comes from Tika, or does it not? Regards, Andreas From: "Mattmann, Chris A (388J)" To: "solr-user@lucene.apache.org" Cc: "u...@tika.apache.org" Sent: Fri, February 25, 2011 1:21:33 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, In Tika 0.8+, you can run the --list-met-models command from tika-app: java -jar tika-app-.jar --list-met-models And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: > Hello, > > I've asked this on the Tika mailing list w/o an answer, so apologies for > cross-posting. > > I'm trying to find information that tells me specifically what metadata is > provided for the different supported document formats. Unfortunately all I > was > > able to find so far is "The Metadata produced depends on the type of document > submitted." > > Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), > so > > I'm particularly interested in that version, but also in changes that are > provided in newer versions of Tika. > > Where are the best places to look for such information? > > Thanks in advance, > > Andreas > > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Tika metadata extracted per supported document format?
Hi Chris, Thank you so much - that's a great start. Andreas From: "Mattmann, Chris A (388J)" To: "solr-user@lucene.apache.org" Cc: "u...@tika.apache.org" Sent: Fri, February 25, 2011 1:21:33 PM Subject: Re: Tika metadata extracted per supported document format? Hi Andreas, In Tika 0.8+, you can run the --list-met-models command from tika-app: java -jar tika-app-.jar --list-met-models And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: > Hello, > > I've asked this on the Tika mailing list w/o an answer, so apologies for > cross-posting. > > I'm trying to find information that tells me specifically what metadata is > provided for the different supported document formats. Unfortunately all I > was > > able to find so far is "The Metadata produced depends on the type of document > submitted." > > Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), > so > > I'm particularly interested in that version, but also in changes that are > provided in newer versions of Tika. > > Where are the best places to look for such information? > > Thanks in advance, > > Andreas > > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Tika metadata extracted per supported document format?
Hi Andreas, In Tika 0.8+, you can run the --list-met-models command from tika-app: java -jar tika-app-.jar --list-met-models And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: > Hello, > > I've asked this on the Tika mailing list w/o an answer, so apologies for > cross-posting. > > I'm trying to find information that tells me specifically what metadata is > provided for the different supported document formats. Unfortunately all I > was > able to find so far is "The Metadata produced depends on the type of document > submitted." > > Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), > so > I'm particularly interested in that version, but also in changes that are > provided in newer versions of Tika. > > Where are the best places to look for such information? > > Thanks in advance, > > Andreas > > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Tika metadata extracted per supported document format?
Hello, I've asked this on the Tika mailing list w/o an answer, so apologies for cross-posting. I'm trying to find information that tells me specifically what metadata is provided for the different supported document formats. Unfortunately all I was able to find so far is "The Metadata produced depends on the type of document submitted." Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so I'm particularly interested in that version, but also in changes that are provided in newer versions of Tika. Where are the best places to look for such information? Thanks in advance, Andreas