Solr ExtractingRequestHandler with Compressed files

Joey Hanzel Fri, 22 Oct 2010 15:02:57 -0700

Hi,

Has anyone had success using ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) ?


I am sending solr the archived.tar file using curl. curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true";
-H 'Content-type:application/octet-stream' --data-binary
"@/home/archived.tar"
The result I get when I query the document is that the filenames inside the
archive are indexed as the "body_texts", but the content of those files is
not extracted or included.  This is not the behvior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the "body_texts" field.  Am
I missing a step for the compressed files?

I have added all the extraction depednenices as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to succesfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions.
  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file.  Any help or suggestions would
be appreciated.

Solr ExtractingRequestHandler with Compressed files

Reply via email to