Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-27 Thread Gary Taylor

Alex,

I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174

In response to your suggestions below:

1. No exceptions are reported, even with onError removed.
2. ProcessMonitor shows only the very first epub file is being read 
(repeatedly)

3. I can repeat this on Ubuntu (14.04) by following the same steps.
4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174)

Additionally (and I've added this on the ticket), if I change the 
dataConfig to use FileDataSource and PlainTextEntityProcessor, and just 
list *.txt files, it works!


dataConfig
dataSource type=FileDataSource name=bin /
document
entity name=files dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=c:/Users/gt/Documents/HackerMonthly/epub 
fileName=.*txt

field column=fileAbsolutePath name=id /
field column=fileSize name=size /
field column=fileLastModified name=lastModified /

entity name=documentImport 
processor=PlainTextEntityProcessor
url=${files.fileAbsolutePath} format=text 
dataSource=bin

field column=plainText name=content/
/entity
/entity
/document
/dataConfig

So it's something related to BinFileDataSource and TikaEntityProcessor.

Thanks,
Gary.

On 26/02/2015 14:24, Gary Taylor wrote:

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. 
all files

fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/





--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using 
TikeEntityProcessor though) and get exactly the same result - ie. all 
files fetched, but only one document indexed in Solr.


With verbose output, I get a row for each file in the directory, but 
only the first one has a non-empty documentImport entity.   All 
subsequent documentImport entities just have an empty document#2 entry.  eg:


 
  verbose-output: [
entity:files,
[
  null,
  --- row #1-,
  fileSize,
  2609004,
  fileLastModified,
  2015-02-25T11:37:25.217Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue018.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue018.epub,
  null,
  -,
  entity:documentImport,
  [
document#1,
[
  query,
  c:\\Users\\gt\\Documents\\epub\\issue018.epub,
  time-taken,
  0:0:0.0,
  null,
  --- row #1-,
  text,
   ... parsed epub text - snip ... 
  title,
  Issue 18 title,
  Author,
  Author text,
  null,
  -
],
document#2,
[]
  ],
  null,
  --- row #2-,
  fileSize,
  4428804,
  fileLastModified,
  2015-02-25T11:37:36.399Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue019.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue019.epub,
  null,
  -,
  entity:documentImport,
  [
document#2,
[]
  ],
  null,
  --- row #3-,
  fileSize,
  2580266,
  fileLastModified,
  2015-02-25T11:37:41.188Z,
  fileAbsolutePath,
  c:\\Users\\gt\\Documents\\epub\\issue020.epub,
  fileDir,
  c:\\Users\\gt\\Documents\\epub,
  file,
  issue020.epub,
  null,
  -,
  entity:documentImport,
  [
document#2,
[]
  ],






Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-26 Thread Gary Taylor

Alex,

That's great.  Thanks for the pointers.  I'll try and get more info on 
this and file a JIRA issue.


Kind regards,
Gary.

On 26/02/2015 14:16, Alexandre Rafalovitch wrote:

On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote:

Alex,

Same results on recursive=true / recursive=false.

I also tried importing plain text files instead of epub (still using
TikeEntityProcessor though) and get exactly the same result - ie. all files
fetched, but only one document indexed in Solr.

To me, this would indicate that something is a problem with the inner
DIH entity then. As a next set of steps, I would probably
1) remove both onError statements and see if there is an exception
that is being swallowed.
2) run the import under ProcessMonitor and see if the other files are
actually being read
https://technet.microsoft.com/en-us/library/bb896645.aspx
3) Assume a Windows bug and test this on Mac/Linux
4) File a JIRA with a replication case. If there is a full replication
setup, I'll test it machines I have access to with full debugger
step-through

For example, I wonder if FileBinDataSource is somehow not cleaning up
after the first file properly on Windows and fails to open the second
one.

Regards,
Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor
I can't get the FileListEntityProcessor and TikeEntityProcessor to 
correctly add a Solr document for each epub file in my local directory.


I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start 
and then solr create -c hn2 to create a new core.


I want to index a load of epub files that I've got in a directory. So I 
created a data-import.xml (in solr\hn2\conf):


dataConfig
dataSource type=BinFileDataSource name=bin /
document
entity name=files dataSource=null rootEntity=false
processor=FileListEntityProcessor
baseDir=c:/Users/gt/Documents/epub fileName=.*epub
onError=skip
recursive=true
field column=fileAbsolutePath name=id /
field column=fileSize name=size /
field column=fileLastModified name=lastModified /

entity name=documentImport processor=TikaEntityProcessor
url=${files.fileAbsolutePath} format=text 
dataSource=bin onError=skip

field column=file name=fileName/
field column=Author name=author meta=true/
field column=title name=title meta=true/
field column=text name=content/
/entity
/entity
/document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my 
data-import.xml:


  requestHandler name=/dataimport 
class=org.apache.solr.handler.dataimport.DataImportHandler

  lst name=defaults
  str name=configdata-import.xml/str
  /lst
  /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc 
fields were setup:


  field name=id type=string indexed=true stored=true 
required=true multiValued=false /

  field name=fileName type=string indexed=true stored=true /
  field name=author type=string indexed=true stored=true /
  field name=title type=string indexed=true stored=true /

  field name=size type=long indexed=true stored=true /
  field name=lastModified type=date indexed=true 
stored=true /


  field name=content type=text_en indexed=false 
stored=true multiValued=false/
  field name=text type=text_en indexed=true stored=false 
multiValued=true/


copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and 
renames schema.xml to schema.xml.back


All good so far.

Now I go to the web admin for dataimport 
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and 
execute a full import.


But, the results show Requests: 0, Fetched: 58, Skipped: 0, 
Processed:1 - ie. it only adds one document (the very first one) even 
though it's iterated over 58!


No errors are reported in the logs.

I can search on the contents of that first epub document, so it's 
extracting OK in Tika, but there's a problem somewhere in my config 
that's causing only 1 document to be indexed in Solr.


Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: Can't index all docs in a local folder with DIH in Solr 5.0.0

2015-02-25 Thread Gary Taylor

Alex,

Thanks for the suggestions.  It always just indexes 1 doc, regardless of 
the first epub file it sees.  Debug / verbose don't show anything 
obvious to me.  I can include the output here if you think it would help.


I tried using the SimplePostTool first ( *java 
-Dtype=application/epub+zip 
-Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar 
\Users\gt\Documents\epub\*.epub) to index the docs and check the Tika 
parsing and that works OK so I don't think it's the e*pubs.


I was trying to use DIH so that I could more easily specify the schema 
fields and store content in the index in preparation for trying out the 
search highlighting. Couldn't work out how to do that with post.jar 


Thanks,
Gary

On 25/02/2015 17:09, Alexandre Rafalovitch wrote:

Try removing that first epub from the directory and rerunning. If you
now index 0 documents, then there is something unexpected about them
and DIH skips. If it indexes 1 document again but a different one,
then it is definitely something about the repeat logic.

Also, try running with debug and verbose modes and see if something
specific shows up.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote:

I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly
add a Solr document for each epub file in my local directory.

I've just downloaded Solr 5.0.0, on a Windows 7 PC.   I ran solr start and
then solr create -c hn2 to create a new core.

I want to index a load of epub files that I've got in a directory. So I
created a data-import.xml (in solr\hn2\conf):

dataConfig
 dataSource type=BinFileDataSource name=bin /
 document
 entity name=files dataSource=null rootEntity=false
 processor=FileListEntityProcessor
 baseDir=c:/Users/gt/Documents/epub fileName=.*epub
 onError=skip
 recursive=true
 field column=fileAbsolutePath name=id /
 field column=fileSize name=size /
 field column=fileLastModified name=lastModified /

 entity name=documentImport processor=TikaEntityProcessor
 url=${files.fileAbsolutePath} format=text
dataSource=bin onError=skip
 field column=file name=fileName/
 field column=Author name=author meta=true/
 field column=title name=title meta=true/
 field column=text name=content/
 /entity
 /entity
 /document
/dataConfig

In my solrconfig.xml, I added a requestHandler entry to reference my
data-import.xml:

   requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
   lst name=defaults
   str name=configdata-import.xml/str
   /lst
   /requestHandler

I renamed managed-schema to schema.xml, and ensured the following doc fields
were setup:

   field name=id type=string indexed=true stored=true
required=true multiValued=false /
   field name=fileName type=string indexed=true stored=true /
   field name=author type=string indexed=true stored=true /
   field name=title type=string indexed=true stored=true /

   field name=size type=long indexed=true stored=true /
   field name=lastModified type=date indexed=true stored=true /

   field name=content type=text_en indexed=false stored=true
multiValued=false/
   field name=text type=text_en indexed=true stored=false
multiValued=true/

 copyField source=content dest=text/

I copied all the jars from dist and contrib\* into server\solr\lib.

Stopping and restarting solr then creates a new managed-schema file and
renames schema.xml to schema.xml.back

All good so far.

Now I go to the web admin for dataimport
(http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and
execute a full import.

But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 -
ie. it only adds one document (the very first one) even though it's iterated
over 58!

No errors are reported in the logs.

I can search on the contents of that first epub document, so it's extracting
OK in Tika, but there's a problem somewhere in my config that's causing only
1 document to be indexed in Solr.

Thanks for any assistance / pointers.

Regards,
Gary

--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



--
Gary Taylor | www.inovem.com | www.kahootz.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
kahootz.com is a trading name of INOVEM Ltd.



Re: tika integration exception and other related queries

2011-06-09 Thread Gary Taylor

Naveen,

Not sure our requirement matches yours, but one of the things we index 
is a comment item that can have one or more files attached to it.  To 
index the whole thing as a single Solr document we create a zipfile 
containing a file with the comment details in it and any additional 
attached files.  This is submitted to Solr as a TEXT field in an XML 
doc, along with other meta-data fields from the comment.  In our schema 
the TEXT field is indexed but not stored, so when we search and get a 
match back it doesn't contain all of the contents from the attached 
files etc., only the stored fields in our schema.   Admittedly, the user 
can therefore get back a comment match with no indication as to WHERE 
the match occurred (ie. was it in the meta-data or the contents of the 
attached files), but at the moment we're only interested in getting 
appropriate matches, not explaining where the match is.


Hope that helps.

Kind regards,
Gary.



On 09/06/2011 03:00, Naveen Gupta wrote:

Hi Gary

It started working .. though i did not test for Zip files, but for rar
files, it is working fine ..

only thing what i wanted to do is to index the metadata (text mapped to
content) not store the data  Also in search result, i want to filter the
stuffs ... and it started working fine .. i don't want to show the content
stuffs to the end user, since the way it extracts the information is not
very helpful to the user .. although we can apply few of the analyzers and
filters to remove the unnecessary tags ..still the information would not be
of much help .. looking for your opinion ... what you did in order to filter
out the content or are you showing the content extracted to the end user?

Even in case, we are showing the text part to the end user, how can i limit
the number of characters while querying the search results ... is there any
feature where we can achieve this ... the concept of snippet kind of thing
...

Thanks
Naveen

On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylorg...@inovem.com  wrote:


Naveen,

For indexing Zip files with Tika, take a look at the following thread :


http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html

I got it to work with the 3.1 source and a couple of patches.

Hope this helps.

Regards,
Gary.



On 08/06/2011 04:12, Naveen Gupta wrote:


Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.





Re: tika integration exception and other related queries

2011-06-08 Thread Gary Taylor

Naveen,

For indexing Zip files with Tika, take a look at the following thread :

http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html

I got it to work with the 3.1 source and a couple of patches.

Hope this helps.

Regards,
Gary.


On 08/06/2011 04:12, Naveen Gupta wrote:

Hi Can somebody answer this ...

3. can somebody tell me an idea how to do indexing for a zip file ?

1. while sending docx, we are getting following error.




Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-23 Thread Gary Taylor

Jayendra,

I cleared out my local repository, and replayed all of my steps from 
Friday and it now it works.  The only difference (or the only one that's 
obvious to me) was that I applied the patch before doing a full 
compile/test/dist.  But I assumed that given I was seeing my new log 
entries (from ExtractingDocumentLoader.java) I was running the correct 
code anyway.


However, I'm very pleased that it's working now - I get the full 
contents of the zipped files indexed and not just the file names.


Thank you again for your assistance, and the patch!

Kind regards,
Gary.


On 21/05/2011 03:12, Jayendra Patil wrote:

Hi Gary,

I tried the patch on the the 3.1 source code (@
http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/)
as well and it worked fine.
@Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals
with the Solr Cell module.

You may want to verify the contents from the results by enabling the
stored attribute on the text field.

e.g. URL curl 
http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true;

Let me know if it works. I would be happy to share the generated
artifact you can test on.

Regards,
Jayendra




Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)

2011-05-20 Thread Gary Taylor
Hello again.  Unfortunately, I'm still getting nowhere with this.  I 
have checked-out the 3.1 source and applied Jayendra's patches (see 
below) and it still appears that the contents of the files in the 
zipfile are not being indexed, only the filenames of those contained files.


I'm using a simple CURL invocation to test this:

curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; 
-F commit=true -F file=@solr1.zip


solr1.zip contains two simple txt files (doc1.txt and doc2.txt).  I'm 
expecting the contents of those txt files to be extracted from the zip 
and indexed, but this isn't happening - or at least, I don't get the 
desired result when I do a query afterwards.  I do get a match if I 
search for either doc1.txt or doc2.txt, but not if I search for a 
word that appears in their contents.


If I index one of the txt files (instead of the zipfile), I can query 
the content OK, so I'm assuming my query is sensible and matches the 
field specified on the CURL string (ie. text).  I'm also happy that 
the Solr Cell content extraction is working because I can successfully 
index PDF, Word, etc. files.


In a fit of desperation I have added log.info statements into the files 
referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see 
those in the log when I submit the zipfile with CURL, so I know I'm 
running those patched files in the build.


If anyone can shed any light on what's happening here, I'd be very grateful.

Thanks and kind regards,
Gary.


On 11/04/2011 11:12, Gary Taylor wrote:

Jayendra,

Thanks for the info - been keeping an eye on this list in case this 
topic cropped up again.  It's currently a background task for me, so 
I'll try and take a look at the patches and re-test soon.


Joey - glad you brought this issue up again.  I haven't progressed any 
further with it.  I've not yet moved to Solr 3.1 but it's on my to-do 
list, as is testing out the patches referenced by Jayendra.  I'll post 
my findings on this thread - if you manage to test the patches before 
me, let me know how you get on.


Thanks and kind regards,
Gary.


On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey 
Hanzelphan...@nearinfinity.com  wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract 
content from
archive file formats.  I just tried again with a clean install of 
Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did 
you have

any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl 
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; 


-H application/octet-stream -F  myfile=@data.zip

No problem extracting single rich text documents, but archive files 
only

result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be 
unpacking the
archive files. Based on the email chain associated with your first 
message,
some people have been able to get this functionality to work as 
desired.










--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE



Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-04-11 Thread Gary Taylor

Jayendra,

Thanks for the info - been keeping an eye on this list in case this 
topic cropped up again.  It's currently a background task for me, so 
I'll try and take a look at the patches and re-test soon.


Joey - glad you brought this issue up again.  I haven't progressed any 
further with it.  I've not yet moved to Solr 3.1 but it's on my to-do 
list, as is testing out the patches referenced by Jayendra.  I'll post 
my findings on this thread - if you manage to test the patches before 
me, let me know how you get on.


Thanks and kind regards,
Gary.


On 11/04/2011 05:02, Jayendra Patil wrote:

The migration of Tika to the latest 0.8 version seems to have
reintroduced the issue.

I was able to get this working again with the following patches. (Solr
Cell and Data Import handler)

https://issues.apache.org/jira/browse/SOLR-2416
https://issues.apache.org/jira/browse/SOLR-2332

You can try these.

Regards,
Jayendra

On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com  wrote:

Hi Gary,

I have been experiencing the same problem... Unable to extract content from
archive file formats.  I just tried again with a clean install of Solr 3.1.0
(using Tika 0.8) and continue to experience the same results.  Did you have
any success with this problem with Solr 1.4.1 or 3.1.0 ?

I'm using this curl command to send data to Solr.
curl 
http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true;
-H application/octet-stream -F  myfile=@data.zip

No problem extracting single rich text documents, but archive files only
result in the file names within the archive being indexed. Am I missing
something else in my configuration? Solr doesn't seem to be unpacking the
archive files. Based on the email chain associated with your first message,
some people have been able to get this functionality to work as desired.






--
Gary Taylor
INOVEM

Tel +44 (0)1488 648 480
Fax +44 (0)7092 115 933
gary.tay...@inovem.com
www.inovem.com

INOVEM Ltd is registered in England and Wales No 4228932
Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE



Re: adding a document using curl

2011-03-03 Thread Gary Taylor

As an example, I run this in the same directory as the msword1.doc file:

curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74literal.type=5; 
-F file=@msword1.doc


The type literal is just part of my schema.

Gary.


On 03/03/2011 11:45, Ken Foskey wrote:

On Thu, 2011-03-03 at 12:36 +0100, Markus Jelsma wrote:

Here's a complete example
http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL

I should have been clearer.   A rich text document,  XML I can make work
and a script is in the example docs folder

http://wiki.apache.org/solr/ExtractingRequestHandler

I also read the solr 1.4 book and tried samples in there,   could not
make them work.

Ta






Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-31 Thread Gary Taylor
Can anyone shed any light on this, and whether it could be a config 
issue?  I'm now using the latest SVN trunk, which includes the Tika 0.8 
jars.


When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) 
to the ExtractingRequestHandler, I get the following log entry 
(formatted for ease of reading) :


SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, 
application/octet-stream, stream_size, 260, stream_name, solr1.zip, 
Content-Type, application/zip]

},
ignored_=ignored_(1.0)={
[package-entry, package-entry]
},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, 


ignored_stream_size=ignored_stream_size(1.0)={260},
ignored_stream_name=ignored_stream_name(1.0)={solr1.zip},
ignored_content_type=ignored_content_type(1.0)={application/zip},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={  doc2.txtdoc1.txt}
}
]

So, the data coming back from Tika when parsing a ZIP file does not 
include the file contents, only the names of the files contained 
therein.  I've tried forcing stream.type=application/zip in the CURL 
string, but that makes no difference.  If I specify an invalid 
stream.type then I get an exception response, so I know it's being used.


When I send one of those txt files individually to the 
ExtractingRequestHandler, I get:


SolrInputDocument[
{
ignored_meta=ignored_meta(1.0)={
[stream_source_info, file, stream_content_type, text/plain, 
stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt]

},
ignored_stream_source_info=ignored_stream_source_info(1.0)={file},

ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain},

ignored_stream_size=ignored_stream_size(1.0)={30},
ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1},
ignored_stream_name=ignored_stream_name(1.0)={doc1.txt},
docid=docid(1.0)={74},
type=type(1.0)={5},
text=text(1.0)={The quick brown fox  }
}
]

and we see the file contents in the text field.

I'm using the following requestHandler definition in solrconfig.xml:

!-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler --
requestHandler name=/update/extract 
class=org.apache.solr.handler.extraction.ExtractingRequestHandler 
startup=lazy

lst name=defaults
!-- All the main content goes into text... if you need to return
   the extracted text or do highlighting, use a stored field. --
str name=fmap.contenttext/str
str name=lowernamestrue/str
str name=uprefixignored_/str

!-- capture link hrefs but ignore div attributes --
str name=captureAttrtrue/str
str name=fmap.alinks/str
str name=fmap.divignored_/str
/lst
/requestHandler

Is there any further debug or diagnostic I can get out of Tika to help 
me work out why it's only returning the file names and not the file 
contents when parsing a ZIP file?


Thanks and kind regards,
Gary.



On 25/01/2011 16:48, Jayendra Patil wrote:

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl 
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true


You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra





Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

Hi,

I posted a question in November last year about indexing content from 
multiple binary files into a single Solr document and Jayendra responded 
with a simple solution to zip them up and send that single file to Solr.


I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't 
currently allow this to work and only the file names of the zipped files 
are indexed (and not their contents).


I've tried downloading and building the latest Tika (0.8) and replacing 
the tika-parsers and tika-core JARS in 
solr-root\contrib\extraction\lib but this still isn't indexing the 
file contents, and not doesn't even index the file names!


Is there a version of Tika that works with the Solr 1.4.1 released 
distribution which does index the contents of the zipped files?


Thanks and kind regards,
Gary



Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 
code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend






Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as 
text files, but it won't index the content of files inside a zip.


As an example, I have two txt files - doc1.txt and doc2.txt.  If I index 
either of them individually using:


curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; 
-F file=@doc1.txt


and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl 
http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; 
-F file=@solr1.zip


and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used 
standalone with the tika-app jar - it outputs both the filenames and 
contents.  Should I be able to index the contents of files stored in a 
zip by using extract ?


Thanks and kind regards,
Gary.


On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest 
trunk code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've 
checked the CHANGES.txt and found the following in the change list to 
1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does 
anyone have an example fieldType stanza (for schema.xml) for stripping 
out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote not instead of now. Sorry, I replied in a hurry.

And to clarify, by content I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend








Extracting and indexing content from multiple binary files into a single Solr document

2010-11-17 Thread Gary Taylor

Hi,

We're trying to use Solr to replace a custom Lucene server.  One 
requirement we have is to be able to index the content of multiple 
binary files into a single Solr document.  For example, a uniquely named 
object in our app can have multiple attached-files (eg. Word, PDF etc.), 
and we want to index (but not store) the contents of those files in the 
single Solr doc for that named object.


At the moment, we're issuing HTTP requests direct from ColdFusion and 
using the /update/extract servlet, but can only specify a single file on 
each request.


Is the best way to achieve this to extend ExtractingRequestHandler to 
allow multiple binary files and thus specify our own RequestHandler, or 
would using the SolrJ interface directly be a better bet, or am I 
missing something fundamental?


Thanks and regards,
Gary.


Re: Extracting and indexing content from multiple binary files into a single Solr document

2010-11-17 Thread Gary Taylor
Jayendra,

Brilliant! A very simple solution. Thank you for your help.

Kind regards,
Gary


On 17 Nov 2010 22:09, Jayendra Patil lt;jayendra.patil@gmail.comgt; 
wrote: 

The way we implemented the same scenario is zipping all the attachments into

a single zip file which can be passed to the ExtractingRequestHandler for

indexing and included as a part of single Solr document.



Regards,

Jayendra



On Wed, Nov 17, 2010 at 6:27 AM, Gary Taylor lt;g...@inovem.comgt; wrote:



gt; Hi,

gt;

gt; We're trying to use Solr to replace a custom Lucene server.  One

gt; requirement we have is to be able to index the content of multiple binary

gt; files into a single Solr document.  For example, a uniquely named object in

gt; our app can have multiple attached-files (eg. Word, PDF etc.), and we want

gt; to index (but not store) the contents of those files in the single Solr doc

gt; for that named object.

gt;

gt; At the moment, we're issuing HTTP requests direct from ColdFusion and using

gt; the /update/extract servlet, but can only specify a single file on each

gt; request.

gt;

gt; Is the best way to achieve this to extend ExtractingRequestHandler to allow

gt; multiple binary files and thus specify our own RequestHandler, or would

gt; using the SolrJ interface directly be a better bet, or am I missing

gt; something fundamental?

gt;

gt; Thanks and regards,

gt; Gary.

gt;