Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Joey Hanzel
Awesome. Thanks Jayendra.  I hadn't caught these patches yet.

I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the
problem of archive files not being unpacked and indexed with Solr CELL.
Thanks for the FYI.
https://issues.apache.org/jira/browse/SOLR-2416

On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil <
jayendra.patil@gmail.com> wrote:

> The migration of Tika to the latest 0.8 version seems to have
> reintroduced the issue.
>
> I was able to get this working again with the following patches. (Solr
> Cell and Data Import handler)
>
> https://issues.apache.org/jira/browse/SOLR-2416
> https://issues.apache.org/jira/browse/SOLR-2332
>
> You can try these.
>
> Regards,
> Jayendra
>
> On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel 
> wrote:
> > Hi Gary,
> >
> > I have been experiencing the same problem... Unable to extract content
> from
> > archive file formats.  I just tried again with a clean install of Solr
> 3.1.0
> > (using Tika 0.8) and continue to experience the same results.  Did you
> have
> > any success with this problem with Solr 1.4.1 or 3.1.0 ?
> >
> > I'm using this curl command to send data to Solr.
> > curl "
> >
> http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true
> "
> > -H "application/octet-stream" -F  "myfile=@data.zip"
> >
> > No problem extracting single rich text documents, but archive files only
> > result in the file names within the archive being indexed. Am I missing
> > something else in my configuration? Solr doesn't seem to be unpacking the
> > archive files. Based on the email chain associated with your first
> message,
> > some people have been able to get this functionality to work as desired.
> >
> > On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor  wrote:
> >
> >> Can anyone shed any light on this, and whether it could be a config
> issue?
> >>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
> >>
> >> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt)
> to
> >> the ExtractingRequestHandler, I get the following log entry (formatted
> for
> >> ease of reading) :
> >>
> >> SolrInputDocument[
> >>{
> >>ignored_meta=ignored_meta(1.0)={
> >>[stream_source_info, file, stream_content_type,
> >> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> >> Content-Type, application/zip]
> >>},
> >>ignored_=ignored_(1.0)={
> >>[package-entry, package-entry]
> >>},
> >>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  
> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
> >>
> >>ignored_stream_size=ignored_stream_size(1.0)={260},
> >>ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
> >>ignored_content_type=ignored_content_type(1.0)={application/zip},
> >>docid=docid(1.0)={74},
> >>type=type(1.0)={5},
> >>text=text(1.0)={  doc2.txtdoc1.txt}
> >>}
> >> ]
> >>
> >> So, the data coming back from Tika when parsing a ZIP file does not
> include
> >> the file contents, only the names of the files contained therein.  I've
> >> tried forcing stream.type=application/zip in the CURL string, but that
> makes
> >> no difference.  If I specify an invalid stream.type then I get an
> exception
> >> response, so I know it's being used.
> >>
> >> When I send one of those txt files individually to the
> >> ExtractingRequestHandler, I get:
> >>
> >> SolrInputDocument[
> >>{
> >>ignored_meta=ignored_meta(1.0)={
> >>[stream_source_info, file, stream_content_type, text/plain,
> >> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
> >>},
> >>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
> >>
> >>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
> >>ignored_stream_size=ignored_stream_size(1.0)={30},
> >>ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
> >>ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
> >>docid=docid(1.0)={74},
> >>type=type(1.0)={5},
> >>text=text(1.0)={The quick brown fo

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-10 Thread Joey Hanzel
Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl "
http://localhost:8080/solr/update/extract?literal.id=doc1&fmap.content=attr_content&commit=true";
-H "application/octet-stream" -F  "myfile=@data.zip"

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.

On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor  wrote:

> Can anyone shed any light on this, and whether it could be a config issue?
>  I'm now using the latest SVN trunk, which includes the Tika 0.8 jars.
>
> When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to
> the ExtractingRequestHandler, I get the following log entry (formatted for
> ease of reading) :
>
> SolrInputDocument[
>{
>ignored_meta=ignored_meta(1.0)={
>[stream_source_info, file, stream_content_type,
> application/octet-stream, stream_size, 260, stream_name, solr1.zip,
> Content-Type, application/zip]
>},
>ignored_=ignored_(1.0)={
>[package-entry, package-entry]
>},
>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  
> ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream},
>
>ignored_stream_size=ignored_stream_size(1.0)={260},
>ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
>ignored_content_type=ignored_content_type(1.0)={application/zip},
>docid=docid(1.0)={74},
>type=type(1.0)={5},
>text=text(1.0)={  doc2.txtdoc1.txt}
>}
> ]
>
> So, the data coming back from Tika when parsing a ZIP file does not include
> the file contents, only the names of the files contained therein.  I've
> tried forcing stream.type=application/zip in the CURL string, but that makes
> no difference.  If I specify an invalid stream.type then I get an exception
> response, so I know it's being used.
>
> When I send one of those txt files individually to the
> ExtractingRequestHandler, I get:
>
> SolrInputDocument[
>{
>ignored_meta=ignored_meta(1.0)={
>[stream_source_info, file, stream_content_type, text/plain,
> stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]
>},
>ignored_stream_source_info=ignored_stream_source_info(1.0)={file},
>
>  ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},
>ignored_stream_size=ignored_stream_size(1.0)={30},
>ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
>ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
>docid=docid(1.0)={74},
>type=type(1.0)={5},
>text=text(1.0)={The quick brown fox  }
>}
> ]
>
> and we see the file contents in the "text" field.
>
> I'm using the following requestHandler definition in solrconfig.xml:
>
> 
>  class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
> startup="lazy">
> 
> 
> text
> true
> ignored_
>
> 
> true
> links
> ignored_
> 
> 
>
> Is there any further debug or diagnostic I can get out of Tika to help me
> work out why it's only returning the file names and not the file contents
> when parsing a ZIP file?
>
>
> Thanks and kind regards,
> Gary.
>
>
>
> On 25/01/2011 16:48, Jayendra Patil wrote:
>
>> Hi Gary,
>>
>> The latest Solr Trunk was able to extract and index the contents of the
>> zip
>> file using the ExtractingRequestHandler.
>> The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
>> worked pretty well.
>>
>> Tested again with sample url and works fine -
>> curl "
>>
>> http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
>> "
>>
>> You would probably need to drill down to the Tika Jars and
>> the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.
>>
>> Regards,
>> Jayendra
>>
>>
>


Re: Solr ExtractingRequestHandler with Compressed files

2010-10-26 Thread Joey Hanzel
Hi Javendra,

Thanks for the suggestion, I updated to Solr 1.4.1 and Solr Cell 1.4.1 and
tried sending a zip file that contained several html documents.
Unfortunately, that did not solve the problem.

Here's the curl command I used:
curl "
http://localhost:8983/solr/update/extract?literla.id=d...@uprefix=attr_&fmap.content=attri_content&commit=true";
-F "file=data.zip"

When I query for id:doc1, the attr_content lists each filename within the
zip archive. It also indexed the stream_size, stream_source and
content_type.  It does not appear to be opening up the individual files
within the zip.

Did you have to make any other configuration changes to your solrconfig.xml
or schema.xml to read the contents of the individual files?  Would it help
to pass the specific mime type on the curl line ?

On Mon, Oct 25, 2010 at 3:27 PM, Jayendra Patil <
jayendra.patil@gmail.com> wrote:

> There was this issue with the previous version of Solr, wherein only the
> file names from the zip used to get indexed.
> We had faced the same issue and ended up using the Solr trunk which has the
> Tika version upgraded and works fine.
>
> The Solr version 1.4.1 should also have the fix included. Try using it.
>
> Regards,
> Jayendra
>
> On Fri, Oct 22, 2010 at 6:02 PM, Joey Hanzel  >wrote:
>
> > Hi,
> >
> > Has anyone had success using ExtractingRequestHandler and Tika with any
> of
> > the compressed file formats (zip, tar, gz, etc) ?
> >
> > I am sending solr the archived.tar file using curl. curl "
> >
> >
> http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true
> > "
> > -H 'Content-type:application/octet-stream' --data-binary
> > "@/home/archived.tar"
> > The result I get when I query the document is that the filenames inside
> the
> > archive are indexed as the "body_texts", but the content of those files
> is
> > not extracted or included.  This is not the behvior I expected. Ref:
> >
> >
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example
> > .
> > When I send 1 of the actual documents inside the archive using the same
> > curl
> > command the extracted content is then stored in the "body_texts" field.
>  Am
> > I missing a step for the compressed files?
> >
> > I have added all the extraction depednenices as indicated by mat in
> > http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-celland
> > am able to succesfully extract data from MS Word, PDF, HTML documents.
> >
> > I'm using the following library versions.
> >  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4
> >
> > Given everything I have read this version of Tika should support
> extracting
> > data from all files within a compressed file.  Any help or suggestions
> > would
> > be appreciated.
> >
>


Solr ExtractingRequestHandler with Compressed files

2010-10-22 Thread Joey Hanzel
Hi,

Has anyone had success using ExtractingRequestHandler and Tika with any of
the compressed file formats (zip, tar, gz, etc) ?

I am sending solr the archived.tar file using curl. curl "
http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true";
-H 'Content-type:application/octet-stream' --data-binary
"@/home/archived.tar"
The result I get when I query the document is that the filenames inside the
archive are indexed as the "body_texts", but the content of those files is
not extracted or included.  This is not the behvior I expected. Ref:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example.
When I send 1 of the actual documents inside the archive using the same curl
command the extracted content is then stored in the "body_texts" field.  Am
I missing a step for the compressed files?

I have added all the extraction depednenices as indicated by mat in
http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and
am able to succesfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions.
  Solr 1.40,  Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting
data from all files within a compressed file.  Any help or suggestions would
be appreciated.