Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, I've created JIRA ticket: https://issues.apache.org/jira/browse/SOLR-7174 In response to your suggestions below: 1. No exceptions are reported, even with onError removed. 2. ProcessMonitor shows only the very first epub file is being read (repeatedly) 3. I can repeat this on Ubuntu (14.04) by following the same steps. 4. Ticket raised (https://issues.apache.org/jira/browse/SOLR-7174) Additionally (and I've added this on the ticket), if I change the dataConfig to use FileDataSource and PlainTextEntityProcessor, and just list *.txt files, it works! dataConfig dataSource type=FileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/HackerMonthly/epub fileName=.*txt field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=PlainTextEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin field column=plainText name=content/ /entity /entity /document /dataConfig So it's something related to BinFileDataSource and TikaEntityProcessor. Thanks, Gary. On 26/02/2015 14:24, Gary Taylor wrote: Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. With verbose output, I get a row for each file in the directory, but only the first one has a non-empty documentImport entity. All subsequent documentImport entities just have an empty document#2 entry. eg: verbose-output: [ entity:files, [ null, --- row #1-, fileSize, 2609004, fileLastModified, 2015-02-25T11:37:25.217Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue018.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue018.epub, null, -, entity:documentImport, [ document#1, [ query, c:\\Users\\gt\\Documents\\epub\\issue018.epub, time-taken, 0:0:0.0, null, --- row #1-, text, ... parsed epub text - snip ... title, Issue 18 title, Author, Author text, null, - ], document#2, [] ], null, --- row #2-, fileSize, 4428804, fileLastModified, 2015-02-25T11:37:36.399Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue019.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue019.epub, null, -, entity:documentImport, [ document#2, [] ], null, --- row #3-, fileSize, 2580266, fileLastModified, 2015-02-25T11:37:41.188Z, fileAbsolutePath, c:\\Users\\gt\\Documents\\epub\\issue020.epub, fileDir, c:\\Users\\gt\\Documents\\epub, file, issue020.epub, null, -, entity:documentImport, [ document#2, [] ],
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, That's great. Thanks for the pointers. I'll try and get more info on this and file a JIRA issue. Kind regards, Gary. On 26/02/2015 14:16, Alexandre Rafalovitch wrote: On 26 February 2015 at 08:32, Gary Taylor g...@inovem.com wrote: Alex, Same results on recursive=true / recursive=false. I also tried importing plain text files instead of epub (still using TikeEntityProcessor though) and get exactly the same result - ie. all files fetched, but only one document indexed in Solr. To me, this would indicate that something is a problem with the inner DIH entity then. As a next set of steps, I would probably 1) remove both onError statements and see if there is an exception that is being swallowed. 2) run the import under ProcessMonitor and see if the other files are actually being read https://technet.microsoft.com/en-us/library/bb896645.aspx 3) Assume a Windows bug and test this on Mac/Linux 4) File a JIRA with a replication case. If there is a full replication setup, I'll test it machines I have access to with full debugger step-through For example, I wonder if FileBinDataSource is somehow not cleaning up after the first file properly on Windows and fails to open the second one. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Can't index all docs in a local folder with DIH in Solr 5.0.0
I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: Can't index all docs in a local folder with DIH in Solr 5.0.0
Alex, Thanks for the suggestions. It always just indexes 1 doc, regardless of the first epub file it sees. Debug / verbose don't show anything obvious to me. I can include the output here if you think it would help. I tried using the SimplePostTool first ( *java -Dtype=application/epub+zip -Durl=http://localhost:8983/solr/hn1/update/extract -jar post.jar \Users\gt\Documents\epub\*.epub) to index the docs and check the Tika parsing and that works OK so I don't think it's the e*pubs. I was trying to use DIH so that I could more easily specify the schema fields and store content in the index in preparation for trying out the search highlighting. Couldn't work out how to do that with post.jar Thanks, Gary On 25/02/2015 17:09, Alexandre Rafalovitch wrote: Try removing that first epub from the directory and rerunning. If you now index 0 documents, then there is something unexpected about them and DIH skips. If it indexes 1 document again but a different one, then it is definitely something about the repeat logic. Also, try running with debug and verbose modes and see if something specific shows up. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 25 February 2015 at 11:14, Gary Taylor g...@inovem.com wrote: I can't get the FileListEntityProcessor and TikeEntityProcessor to correctly add a Solr document for each epub file in my local directory. I've just downloaded Solr 5.0.0, on a Windows 7 PC. I ran solr start and then solr create -c hn2 to create a new core. I want to index a load of epub files that I've got in a directory. So I created a data-import.xml (in solr\hn2\conf): dataConfig dataSource type=BinFileDataSource name=bin / document entity name=files dataSource=null rootEntity=false processor=FileListEntityProcessor baseDir=c:/Users/gt/Documents/epub fileName=.*epub onError=skip recursive=true field column=fileAbsolutePath name=id / field column=fileSize name=size / field column=fileLastModified name=lastModified / entity name=documentImport processor=TikaEntityProcessor url=${files.fileAbsolutePath} format=text dataSource=bin onError=skip field column=file name=fileName/ field column=Author name=author meta=true/ field column=title name=title meta=true/ field column=text name=content/ /entity /entity /document /dataConfig In my solrconfig.xml, I added a requestHandler entry to reference my data-import.xml: requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configdata-import.xml/str /lst /requestHandler I renamed managed-schema to schema.xml, and ensured the following doc fields were setup: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=fileName type=string indexed=true stored=true / field name=author type=string indexed=true stored=true / field name=title type=string indexed=true stored=true / field name=size type=long indexed=true stored=true / field name=lastModified type=date indexed=true stored=true / field name=content type=text_en indexed=false stored=true multiValued=false/ field name=text type=text_en indexed=true stored=false multiValued=true/ copyField source=content dest=text/ I copied all the jars from dist and contrib\* into server\solr\lib. Stopping and restarting solr then creates a new managed-schema file and renames schema.xml to schema.xml.back All good so far. Now I go to the web admin for dataimport (http://localhost:8983/solr/#/hn2/dataimport//dataimport) and try and execute a full import. But, the results show Requests: 0, Fetched: 58, Skipped: 0, Processed:1 - ie. it only adds one document (the very first one) even though it's iterated over 58! No errors are reported in the logs. I can search on the contents of that first epub document, so it's extracting OK in Tika, but there's a problem somewhere in my config that's causing only 1 document to be indexed in Solr. Thanks for any assistance / pointers. Regards, Gary -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd. -- Gary Taylor | www.inovem.com | www.kahootz.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE kahootz.com is a trading name of INOVEM Ltd.
Re: tika integration exception and other related queries
Naveen, Not sure our requirement matches yours, but one of the things we index is a comment item that can have one or more files attached to it. To index the whole thing as a single Solr document we create a zipfile containing a file with the comment details in it and any additional attached files. This is submitted to Solr as a TEXT field in an XML doc, along with other meta-data fields from the comment. In our schema the TEXT field is indexed but not stored, so when we search and get a match back it doesn't contain all of the contents from the attached files etc., only the stored fields in our schema. Admittedly, the user can therefore get back a comment match with no indication as to WHERE the match occurred (ie. was it in the meta-data or the contents of the attached files), but at the moment we're only interested in getting appropriate matches, not explaining where the match is. Hope that helps. Kind regards, Gary. On 09/06/2011 03:00, Naveen Gupta wrote: Hi Gary It started working .. though i did not test for Zip files, but for rar files, it is working fine .. only thing what i wanted to do is to index the metadata (text mapped to content) not store the data Also in search result, i want to filter the stuffs ... and it started working fine .. i don't want to show the content stuffs to the end user, since the way it extracts the information is not very helpful to the user .. although we can apply few of the analyzers and filters to remove the unnecessary tags ..still the information would not be of much help .. looking for your opinion ... what you did in order to filter out the content or are you showing the content extracted to the end user? Even in case, we are showing the text part to the end user, how can i limit the number of characters while querying the search results ... is there any feature where we can achieve this ... the concept of snippet kind of thing ... Thanks Naveen On Wed, Jun 8, 2011 at 1:45 PM, Gary Taylorg...@inovem.com wrote: Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
Re: tika integration exception and other related queries
Naveen, For indexing Zip files with Tika, take a look at the following thread : http://lucene.472066.n3.nabble.com/Extracting-contents-of-zipped-files-with-Tika-and-Solr-1-4-1-td2327933.html I got it to work with the 3.1 source and a couple of patches. Hope this helps. Regards, Gary. On 08/06/2011 04:12, Naveen Gupta wrote: Hi Can somebody answer this ... 3. can somebody tell me an idea how to do indexing for a zip file ? 1. while sending docx, we are getting following error.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)
Jayendra, I cleared out my local repository, and replayed all of my steps from Friday and it now it works. The only difference (or the only one that's obvious to me) was that I applied the patch before doing a full compile/test/dist. But I assumed that given I was seeing my new log entries (from ExtractingDocumentLoader.java) I was running the correct code anyway. However, I'm very pleased that it's working now - I get the full contents of the zipped files indexed and not just the file names. Thank you again for your assistance, and the patch! Kind regards, Gary. On 21/05/2011 03:12, Jayendra Patil wrote: Hi Gary, I tried the patch on the the 3.1 source code (@ http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_1/) as well and it worked fine. @Patch - https://issues.apache.org/jira/browse/SOLR-2416, which deals with the Solr Cell module. You may want to verify the contents from the results by enabling the stored attribute on the text field. e.g. URL curl http://localhost:8983/solr/update/extract?stream.file=C:/Test.zipliteral.id=777045literal.title=Testcommit=true; Let me know if it works. I would be happy to share the generated artifact you can test on. Regards, Jayendra
Re: Extracting contents of zipped files with Tika and Solr 1.4.1 (now Solr 3.1)
Hello again. Unfortunately, I'm still getting nowhere with this. I have checked-out the 3.1 source and applied Jayendra's patches (see below) and it still appears that the contents of the files in the zipfile are not being indexed, only the filenames of those contained files. I'm using a simple CURL invocation to test this: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F commit=true -F file=@solr1.zip solr1.zip contains two simple txt files (doc1.txt and doc2.txt). I'm expecting the contents of those txt files to be extracted from the zip and indexed, but this isn't happening - or at least, I don't get the desired result when I do a query afterwards. I do get a match if I search for either doc1.txt or doc2.txt, but not if I search for a word that appears in their contents. If I index one of the txt files (instead of the zipfile), I can query the content OK, so I'm assuming my query is sensible and matches the field specified on the CURL string (ie. text). I'm also happy that the Solr Cell content extraction is working because I can successfully index PDF, Word, etc. files. In a fit of desperation I have added log.info statements into the files referenced by Jayendra's patches (SOLR-2416 and SOLR-2332) and I see those in the log when I submit the zipfile with CURL, so I know I'm running those patched files in the build. If anyone can shed any light on what's happening here, I'd be very grateful. Thanks and kind regards, Gary. On 11/04/2011 11:12, Gary Taylor wrote: Jayendra, Thanks for the info - been keeping an eye on this list in case this topic cropped up again. It's currently a background task for me, so I'll try and take a look at the patches and re-test soon. Joey - glad you brought this issue up again. I haven't progressed any further with it. I've not yet moved to Solr 3.1 but it's on my to-do list, as is testing out the patches referenced by Jayendra. I'll post my findings on this thread - if you manage to test the patches before me, let me know how you get on. Thanks and kind regards, Gary. On 11/04/2011 05:02, Jayendra Patil wrote: The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. -- Gary Taylor INOVEM Tel +44 (0)1488 648 480 Fax +44 (0)7092 115 933 gary.tay...@inovem.com www.inovem.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Jayendra, Thanks for the info - been keeping an eye on this list in case this topic cropped up again. It's currently a background task for me, so I'll try and take a look at the patches and re-test soon. Joey - glad you brought this issue up again. I haven't progressed any further with it. I've not yet moved to Solr 3.1 but it's on my to-do list, as is testing out the patches referenced by Jayendra. I'll post my findings on this thread - if you manage to test the patches before me, let me know how you get on. Thanks and kind regards, Gary. On 11/04/2011 05:02, Jayendra Patil wrote: The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. -- Gary Taylor INOVEM Tel +44 (0)1488 648 480 Fax +44 (0)7092 115 933 gary.tay...@inovem.com www.inovem.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
Re: adding a document using curl
As an example, I run this in the same directory as the msword1.doc file: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74literal.type=5; -F file=@msword1.doc The type literal is just part of my schema. Gary. On 03/03/2011 11:45, Ken Foskey wrote: On Thu, 2011-03-03 at 12:36 +0100, Markus Jelsma wrote: Here's a complete example http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL I should have been clearer. A rich text document, XML I can make work and a script is in the example docs folder http://wiki.apache.org/solr/ExtractingRequestHandler I also read the solr 1.4 book and tried samples in there, could not make them work. Ta
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Can anyone shed any light on this, and whether it could be a config issue? I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to the ExtractingRequestHandler, I get the following log entry (formatted for ease of reading) : SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, application/octet-stream, stream_size, 260, stream_name, solr1.zip, Content-Type, application/zip] }, ignored_=ignored_(1.0)={ [package-entry, package-entry] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, ignored_stream_size=ignored_stream_size(1.0)={260}, ignored_stream_name=ignored_stream_name(1.0)={solr1.zip}, ignored_content_type=ignored_content_type(1.0)={application/zip}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={ doc2.txtdoc1.txt} } ] So, the data coming back from Tika when parsing a ZIP file does not include the file contents, only the names of the files contained therein. I've tried forcing stream.type=application/zip in the CURL string, but that makes no difference. If I specify an invalid stream.type then I get an exception response, so I know it's being used. When I send one of those txt files individually to the ExtractingRequestHandler, I get: SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, text/plain, stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain}, ignored_stream_size=ignored_stream_size(1.0)={30}, ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1}, ignored_stream_name=ignored_stream_name(1.0)={doc1.txt}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={The quick brown fox } } ] and we see the file contents in the text field. I'm using the following requestHandler definition in solrconfig.xml: !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler -- requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler Is there any further debug or diagnostic I can get out of Tika to help me work out why it's only returning the file names and not the file contents when parsing a ZIP file? Thanks and kind regards, Gary. On 25/01/2011 16:48, Jayendra Patil wrote: Hi Gary, The latest Solr Trunk was able to extract and index the contents of the zip file using the ExtractingRequestHandler. The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and worked pretty well. Tested again with sample url and works fine - curl http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zipliteral.id=777045literal.title=Testcommit=true You would probably need to drill down to the Tika Jars and the apache-solr-cell-4.0-dev.jar used for Rich documents indexing. Regards, Jayendra
Extracting contents of zipped files with Tika and Solr 1.4.1
Hi, I posted a question in November last year about indexing content from multiple binary files into a single Solr document and Jayendra responded with a simple solution to zip them up and send that single file to Solr. I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't currently allow this to work and only the file names of the zipped files are indexed (and not their contents). I've tried downloading and building the latest Tika (0.8) and replacing the tika-parsers and tika-core JARS in solr-root\contrib\extraction\lib but this still isn't indexing the file contents, and not doesn't even index the file names! Is there a version of Tika that works with the Solr 1.4.1 released distribution which does index the contents of the zipped files? Thanks and kind regards, Gary
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) : 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji) Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ? Thanks and kind regards, Gary. On 25/01/2011 14:17, Erlend Garåsen wrote: On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote not instead of now. Sorry, I replied in a hurry. And to clarify, by content I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
OK, got past the schema.xml problem, but now I'm back to square one. I can index the contents of binary files (Word, PDF etc...), as well as text files, but it won't index the content of files inside a zip. As an example, I have two txt files - doc1.txt and doc2.txt. If I index either of them individually using: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F file=@doc1.txt and commit, Solr will index the contents and searches will match. If I zip those two files up into solr1.zip, and index that using: curl http://localhost:8983/solr/core0/update/extract?literal.docid=74fmap.content=textliteral.type=5; -F file=@solr1.zip and commit, the file names are indexed, but not their contents. I have checked that Tika can correctly process the zip file when used standalone with the tika-app jar - it outputs both the filenames and contents. Should I be able to index the contents of files stored in a zip by using extract ? Thanks and kind regards, Gary. On 25/01/2011 15:32, Gary Taylor wrote: Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) : 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji) Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ? Thanks and kind regards, Gary. On 25/01/2011 14:17, Erlend Garåsen wrote: On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote not instead of now. Sorry, I replied in a hurry. And to clarify, by content I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend
Extracting and indexing content from multiple binary files into a single Solr document
Hi, We're trying to use Solr to replace a custom Lucene server. One requirement we have is to be able to index the content of multiple binary files into a single Solr document. For example, a uniquely named object in our app can have multiple attached-files (eg. Word, PDF etc.), and we want to index (but not store) the contents of those files in the single Solr doc for that named object. At the moment, we're issuing HTTP requests direct from ColdFusion and using the /update/extract servlet, but can only specify a single file on each request. Is the best way to achieve this to extend ExtractingRequestHandler to allow multiple binary files and thus specify our own RequestHandler, or would using the SolrJ interface directly be a better bet, or am I missing something fundamental? Thanks and regards, Gary.
Re: Extracting and indexing content from multiple binary files into a single Solr document
Jayendra, Brilliant! A very simple solution. Thank you for your help. Kind regards, Gary On 17 Nov 2010 22:09, Jayendra Patil lt;jayendra.patil@gmail.comgt; wrote: The way we implemented the same scenario is zipping all the attachments into a single zip file which can be passed to the ExtractingRequestHandler for indexing and included as a part of single Solr document. Regards, Jayendra On Wed, Nov 17, 2010 at 6:27 AM, Gary Taylor lt;g...@inovem.comgt; wrote: gt; Hi, gt; gt; We're trying to use Solr to replace a custom Lucene server. One gt; requirement we have is to be able to index the content of multiple binary gt; files into a single Solr document. For example, a uniquely named object in gt; our app can have multiple attached-files (eg. Word, PDF etc.), and we want gt; to index (but not store) the contents of those files in the single Solr doc gt; for that named object. gt; gt; At the moment, we're issuing HTTP requests direct from ColdFusion and using gt; the /update/extract servlet, but can only specify a single file on each gt; request. gt; gt; Is the best way to achieve this to extend ExtractingRequestHandler to allow gt; multiple binary files and thus specify our own RequestHandler, or would gt; using the SolrJ interface directly be a better bet, or am I missing gt; something fundamental? gt; gt; Thanks and regards, gt; Gary. gt;