Re: Tika metadata extracted per supported document format?

2011-02-28 Thread Andreas Kemkes
Chris:

Yes, I only see the output below.

I'm familiar with the information in 
 http://wiki.apache.org/solr/ExtractingRequestHandler, except for 
the tika.config part, which I haven't touched.

Even when running documents through Tika directly, the output of metadata is 
highly dependent on what metadata the document contains (obviously).  I haven't 
found the right place in the Tika source code yet either.  Would digging into 
POI, PDFBox, ... help me any further on my pursuit?  A Matrix that lists the 
complete set of metadata for the most popular formats would sure be helpful to 
me.  I would help providing it, if properly directed.

Thanks,

Andreas

PS: I've also noticed some differences in the date formats being used (using 
version 0.9).  Is that something I should be concerned about when using it 
through SolrCell?

 (from a 
Word 
document)
 (from a PDF)





From: "Mattmann, Chris A (388J)" 
To: "solr-user@lucene.apache.org" 
Sent: Fri, February 25, 2011 4:11:00 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

> java -jar tika-app-0.9.jar --list-met-models
> TikaMetadataKeys
> PROTECTED
> RESOURCE_NAME_KEY
> TikaMimeKeys
> MIME_TYPE_MAGIC
> TIKA_MIME_FILE
> 
> Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

Strange -- those are the only met models you're seeing listed?

> 
> I'm a bit unclear if that gets me to what I was looking for - metadata 
> like "content_type" or "last_modified".  Or am I confusing Tika metadata 
> with SolrCell metadata?
> 
> I thought SolrCell metadata comes from Tika, or does it not?

It does come from Tika that's for sure, but in SolrCell, there is a 
configuration for the ExtractingRequestHandler that remaps
the field names from Tika to Solr. So that's probably where it's coming from. 
Check this out:

http://wiki.apache.org/solr/ExtractingRequestHandler

HTH!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Mattmann, Chris A (388J)
Hi Andreas,

> java -jar tika-app-0.9.jar --list-met-models
> TikaMetadataKeys
> PROTECTED
> RESOURCE_NAME_KEY
> TikaMimeKeys
> MIME_TYPE_MAGIC
> TIKA_MIME_FILE
> 
> Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

Strange -- those are the only met models you're seeing listed?

> 
> I'm a bit unclear if that gets me to what I was looking for - metadata 
> like "content_type" or "last_modified".  Or am I confusing Tika metadata 
> with SolrCell metadata?
> 
> I thought SolrCell metadata comes from Tika, or does it not?

It does come from Tika that's for sure, but in SolrCell, there is a 
configuration for the ExtractingRequestHandler that remaps
the field names from Tika to Solr. So that's probably where it's coming from. 
Check this out:

http://wiki.apache.org/solr/ExtractingRequestHandler

HTH!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hi Chris,

java -jar tika-app-0.9.jar --list-met-models
TikaMetadataKeys
 PROTECTED
 RESOURCE_NAME_KEY
TikaMimeKeys
 MIME_TYPE_MAGIC
 TIKA_MIME_FILE

Both 0.8 and 0.9 give me the same list.  Is that a configuration issue?

I'm a bit unclear if that gets me to what I was looking for - metadata 
like "content_type" or "last_modified".  Or am I confusing Tika metadata 
with SolrCell metadata?

I thought SolrCell metadata comes from Tika, or does it not?

Regards,

Andreas




From: "Mattmann, Chris A (388J)" 
To: "solr-user@lucene.apache.org" 
Cc: "u...@tika.apache.org" 
Sent: Fri, February 25, 2011 1:21:33 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

In Tika 0.8+, you can run the --list-met-models command from tika-app:

java -jar tika-app-.jar --list-met-models

And get a print out of the met keys that Tika supports. Some parsers add their 
own that aren't part of this met listing, but this is a relatively 
comprehensive 
list.

Cheers,
Chris

On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:

> Hello,
> 
> I've asked this on the Tika mailing list w/o an answer, so apologies for 
> cross-posting.
> 
> I'm trying to find information that tells me specifically what metadata is 
> provided for the different supported document formats.  Unfortunately all I 
> was 
>
> able to find so far is "The Metadata produced depends on the type of document 
> submitted."
> 
> Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), 
> so 
>
> I'm particularly interested in that version, but also in changes that are 
> provided in newer versions of Tika.
> 
> Where are the best places to look for such information?
> 
> Thanks in advance,
> 
> Andreas
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hi Chris,

Thank you so much - that's a great start.

Andreas




From: "Mattmann, Chris A (388J)" 
To: "solr-user@lucene.apache.org" 
Cc: "u...@tika.apache.org" 
Sent: Fri, February 25, 2011 1:21:33 PM
Subject: Re: Tika metadata extracted per supported document format?

Hi Andreas,

In Tika 0.8+, you can run the --list-met-models command from tika-app:

java -jar tika-app-.jar --list-met-models

And get a print out of the met keys that Tika supports. Some parsers add their 
own that aren't part of this met listing, but this is a relatively 
comprehensive 
list.

Cheers,
Chris

On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:

> Hello,
> 
> I've asked this on the Tika mailing list w/o an answer, so apologies for 
> cross-posting.
> 
> I'm trying to find information that tells me specifically what metadata is 
> provided for the different supported document formats.  Unfortunately all I 
> was 
>
> able to find so far is "The Metadata produced depends on the type of document 
> submitted."
> 
> Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), 
> so 
>
> I'm particularly interested in that version, but also in changes that are 
> provided in newer versions of Tika.
> 
> Where are the best places to look for such information?
> 
> Thanks in advance,
> 
> Andreas
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


  

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Mattmann, Chris A (388J)
Hi Andreas,

In Tika 0.8+, you can run the --list-met-models command from tika-app:

java -jar tika-app-.jar --list-met-models

And get a print out of the met keys that Tika supports. Some parsers add their 
own that aren't part of this met listing, but this is a relatively 
comprehensive list.

Cheers,
Chris

On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:

> Hello,
> 
> I've asked this on the Tika mailing list w/o an answer, so apologies for 
> cross-posting.
> 
> I'm trying to find information that tells me specifically what metadata is 
> provided for the different supported document formats.  Unfortunately all I 
> was 
> able to find so far is "The Metadata produced depends on the type of document 
> submitted."
> 
> Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), 
> so 
> I'm particularly interested in that version, but also in changes that are 
> provided in newer versions of Tika.
> 
> Where are the best places to look for such information?
> 
> Thanks in advance,
> 
> Andreas
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hello,

I've asked this on the Tika mailing list w/o an answer, so apologies for 
cross-posting.

I'm trying to find information that tells me specifically what metadata is 
provided for the different supported document formats.  Unfortunately all I was 
able to find so far is "The Metadata produced depends on the type of document 
submitted."

Currently, I'm using ExtractingRequestHandler from Solr 1.4 (with Tika 0.4), so 
I'm particularly interested in that version, but also in changes that are 
provided in newer versions of Tika.

Where are the best places to look for such information?

Thanks in advance,

Andreas