Re: Solr ExtractingRequestHandler with Compressed files

2010-10-26 Thread Joey Hanzel
Hi Javendra,

Thanks for the suggestion, I updated to Solr 1.4.1 and Solr Cell 1.4.1 and
tried sending a zip file that contained several html documents.
Unfortunately, that did not solve the problem.

Here's the curl command I used:
curl 
http://localhost:8983/solr/update/extract?literla.id=d...@uprefix=attr_fmap.content=attri_contentcommit=true;
-F file=data.zip

When I query for id:doc1, the attr_content lists each filename within the
zip archive. It also indexed the stream_size, stream_source and
content_type.  It does not appear to be opening up the individual files
within the zip.

Did you have to make any other configuration changes to your solrconfig.xml
or schema.xml to read the contents of the individual files?  Would it help
to pass the specific mime type on the curl line ?

On Mon, Oct 25, 2010 at 3:27 PM, Jayendra Patil 
jayendra.patil@gmail.com wrote:

 There was this issue with the previous version of Solr, wherein only the
 file names from the zip used to get indexed.
 We had faced the same issue and ended up using the Solr trunk which has the
 Tika version upgraded and works fine.

 The Solr version 1.4.1 should also have the fix included. Try using it.

 Regards,
 Jayendra

 On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.com
 wrote:

  Hi,
 
  Has anyone had success using ExtractingRequestHandler and Tika with any
 of
  the compressed file formats (zip, tar, gz, etc) ?
 
  I am sending solr the archived.tar file using curl. curl 
 
 
 http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true
  
  -H 'Content-type:application/octet-stream' --data-binary
  @/home/archived.tar
  The result I get when I query the document is that the filenames inside
 the
  archive are indexed as the body_texts, but the content of those files
 is
  not extracted or included.  This is not the behvior I expected. Ref:
 
 
 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
  .
  When I send 1 of the actual documents inside the archive using the same
  curl
  command the extracted content is then stored in the body_texts field.
  Am
  I missing a step for the compressed files?
 
  I have added all the extraction depednenices as indicated by mat in
  http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-celland
  am able to succesfully extract data from MS Word, PDF, HTML documents.
 
  I'm using the following library versions.
   Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4
 
  Given everything I have read this version of Tika should support
 extracting
  data from all files within a compressed file.  Any help or suggestions
  would
  be appreciated.
 



Re: Solr ExtractingRequestHandler with Compressed files

2010-10-25 Thread Jayendra Patil
There was this issue with the previous version of Solr, wherein only the
file names from the zip used to get indexed.
We had faced the same issue and ended up using the Solr trunk which has the
Tika version upgraded and works fine.

The Solr version 1.4.1 should also have the fix included. Try using it.

Regards,
Jayendra

On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel phan...@nearinfinity.comwrote:

 Hi,

 Has anyone had success using ExtractingRequestHandler and Tika with any of
 the compressed file formats (zip, tar, gz, etc) ?

 I am sending solr the archived.tar file using curl. curl 

 http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true
 
 -H 'Content-type:application/octet-stream' --data-binary
 @/home/archived.tar
 The result I get when I query the document is that the filenames inside the
 archive are indexed as the body_texts, but the content of those files is
 not extracted or included.  This is not the behvior I expected. Ref:

 http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
 .
 When I send 1 of the actual documents inside the archive using the same
 curl
 command the extracted content is then stored in the body_texts field.  Am
 I missing a step for the compressed files?

 I have added all the extraction depednenices as indicated by mat in
 http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
 am able to succesfully extract data from MS Word, PDF, HTML documents.

 I'm using the following library versions.
  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4

 Given everything I have read this version of Tika should support extracting
 data from all files within a compressed file.  Any help or suggestions
 would
 be appreciated.



Solr ExtractingRequestHandler with Compressed files

2010-10-22 Thread Joey Hanzel
Hi,

Has anyone had success using ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) ?

I am sending solr the archived.tar file using curl. curl 
http://localhost:8983/solr/update/extract?literal.id=doc1fmap.content=body_textscommit=true;
-H 'Content-type:application/octet-stream' --data-binary
@/home/archived.tar
The result I get when I query the document is that the filenames inside the
archive are indexed as the body_texts, but the content of those files is
not extracted or included.  This is not the behvior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the body_texts field.  Am
I missing a step for the compressed files?

I have added all the extraction depednenices as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to succesfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions.
  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file.  Any help or suggestions would
be appreciated.