subject:"Extracting contents of zipped files with Tika and Solr 1.4.1 \(now Solr 3.1\)"

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-23 Thread Gary Taylor


Jayendra,

I cleared out my local repository, and replayed all of my steps from 
Friday and it now it works.  The only difference (or the only one that's 
obvious to me) was that I applied the patch before doing a full 
compile/test/dist.  But I assumed that given I was seeing my new log 
entries (from ExtractingDocumentLoader.java) I was running the correct 
code anyway.


However, I'm very pleased that it's working now - I get the full 
contents of the zipped files indexed and not just the file names.


Thank you again for your assistance, and the patch!

Kind regards,
Gary.


On 21/05/2011 03:12, Jayendra Patil wrote:

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl 
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Gary Taylor

Hello again. Unfortunately, I'm still getting nowhere with this. I
have checked-out the 3.1 source and applied Jayendra's patches (see
below) and it still appears that the contents of the files in the
zipfile are not being indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F commit=true -F file=@solr1.zip

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm
expecting the contents of those txt files to be extracted from the zip
and indexed, but this isn't happening - or at least, I don't get the
desired result when I do a query afterwards. I do get a match if I
search for either doc1.txt or doc2.txt, but not if I search for a
word that appears in their contents.

If I index one of the txt files (instead of the zipfile), I can query
the content OK, so I'm assuming my query is sensible and matches the
field specified on the CURL string (ie. text). I'm also happy that
the Solr Cell content extraction is working because I can successfully
index PDF, Word, etc. files.

In a fit of desperation I have added log.info statements into the files
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see
those in the log when I submit the zipfile with CURL, so I know I'm
running those patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this
topic cropped up again. It's currently a background task for me, so
I'll try and take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do
list, as is testing out the patches referenced by Jayendra. I'll post
my findings on this thread - if you manage to test the patches before
me, let me know how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey
Hanzelphan...@nearinfinity.com wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract
content from
archive file formats. I just tried again with a clean install of
Solr 3.1.0
(using Tika 0.8) and continue to experience the same results. Did
you have

any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;

-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files
only

result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be
unpacking the
archive files. Based on the email chain associated with your first
message,
some people have been able to get this functionality to work as
desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Jayendra Patil

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra

On Fri, May 20, 2011 at 11:15 AM, Gary Taylor g...@inovem.com wrote:
Hello again. Unfortunately, I'm still getting nowhere with this. I have
checked-out the 3.1 source and applied Jayendra's patches (see below) and it
still appears that the contents of the files in the zipfile are not being
indexed, only the filenames of those contained files.

I'm using a simple CURL invocation to test this:

curl
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5;
-F commit=true -F file=@solr1.zip

solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm
expecting the contents of those txt files to be extracted from the zip and
indexed, but this isn't happening - or at least, I don't get the desired
result when I do a query afterwards. I do get a match if I search for
either doc1.txt or doc2.txt, but not if I search for a word that appears
in their contents.

If I index one of the txt files (instead of the zipfile), I can query the
content OK, so I'm assuming my query is sensible and matches the field
specified on the CURL string (ie. text). I'm also happy that the Solr
Cell content extraction is working because I can successfully index PDF,
Word, etc. files.

In a fit of desperation I have added log.info statements into the files
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those
in the log when I submit the zipfile with CURL, so I know I'm running those
patched files in the build.

If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.

On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this topic
cropped up again. It's currently a background task for me, so I'll try and
take a look at the patches and re-test soon.

Joey - glad you brought this issue up again. I haven't progressed any
further with it. I've not yet moved to Solr 3.1 but it's on my to-do list,
as is testing out the patches referenced by Jayendra. I'll post my findings
on this thread - if you manage to test the patches before me, let me know
how you get on.

Thanks and kind regards,
Gary.

On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com
wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content
from
archive file formats. I just tried again with a clean install of Solr
3.1.0
(using Tika 0.8) and continue to experience the same results. Did you
have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl

http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking
the
archive files. Based on the email chain associated with your first
message,
some people have been able to get this functionality to work as desired.

--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

3 matches

Site Navigation

Mail list logo

Footer information