There was this issue with the previous version of Solr, wherein only the
file names from the zip used to get indexed.
We had faced the same issue and ended up using the Solr trunk which has the
Tika version upgraded and works fine.

The Solr version 1.4.1 should also have the fix included. Try using it.

Regards,
Jayendra

On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel <phan...@nearinfinity.com>wrote:

> Hi,
>
> Has anyone had success using ExtractingRequestHandler and Tika with any of
> the compressed file formats (zip, tar, gz, etc) ?
>
> I am sending solr the archived.tar file using curl. curl "
>
> http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true
> "
> -H 'Content-type:application/octet-stream' --data-binary
> "@/home/archived.tar"
> The result I get when I query the document is that the filenames inside the
> archive are indexed as the "body_texts", but the content of those files is
> not extracted or included.  This is not the behvior I expected. Ref:
>
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
> .
> When I send 1 of the actual documents inside the archive using the same
> curl
> command the extracted content is then stored in the "body_texts" field.  Am
> I missing a step for the compressed files?
>
> I have added all the extraction depednenices as indicated by mat in
> http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
> am able to succesfully extract data from MS Word, PDF, HTML documents.
>
> I'm using the following library versions.
>  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4
>
> Given everything I have read this version of Tika should support extracting
> data from all files within a compressed file.  Any help or suggestions
> would
> be appreciated.
>

Reply via email to