RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler
lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=indenton/str str name=start0/str str name=qid:doc4/str str name=version2.2/str str name=rows10/str /lst /lst result name=response numFound=1 start=0 doc arr name=attr_character_count str3177/str /arr arr name=attr_creation_date str2010-09-16T15:50:00Z/str /arr arr name=attr_creator strdroy/str /arr arr name=attr_date str2010-09-16T15:51:00Z/str /arr arr name=attr_edit_time strPT60S/str /arr arr name=attr_editing_cycles str1/str /arr arr name=attr_generator strMicrosoftOffice/12.0 MicrosoftWord/str /arr arr name=attr_initial_creator strdroy/str /arr arr name=attr_nbcharacter str3177/str /arr arr name=attr_nbpage str2/str /arr arr name=attr_nbpara str6/str /arr arr name=attr_nbword str475/str /arr arr name=attr_page_count str2/str /arr arr name=attr_paragraph_count str6/str /arr arr name=attr_stream_content_type strapplication/xml/str /arr arr name=attr_stream_size str9130/str /arr arr name=attr_word_count str475/str /arr arr name=attr_xmptpg_npages str2/str /arr arr name=content_type strapplication/vnd.oasis.opendocument.text/str /arr str name=iddoc4/str arr name=text str/str /arr /doc /result /response Brett. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, November 27, 2012 7:38 AM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Not an issue that I know of. I expect you've got some obscure problem in your definitions, but I'm guession. Try modifying your schema so the glob pattern maps to a stored field, something like: dynamicField name=* type=string multiValued=true stored=true / remove all other fields except id, remove your mapping, and try it again. If you query with fl=* you should see everything that was extracted. That'll tell you whether it is a problem with Solr/Tika or something in how you're using them. Best Erick On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi Erik, The document is committed successfully... it is just missing all the extracted content from Tika when I query for that document. i.e. the mapped content field attr_content is empty (fmap.content=attr_content) result name=response numFound=1 start=0 maxScore=1.9162908 doc float name=score1.9162908/float arr name=attr_character_count str24/str /arr arr name=attr_content str/str /arr arr name=attr_creation_date str2009-04-16T11:32:00/str /arr arr name=attr_date str2012-11-23T00:29:39.73/str /arr ... /result Brett. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, November 25, 2012 9:27 PM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Did you commit after you added the document but before you tried the search? Best Erick On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi all, I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty. Executing a simple index request (based on the example in the wiki): curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=at tr _fmap.content=attr_contentcommit=true http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=at tr _fmap.content=attr_contentcommit=true%22 -F myfile=@testfile.odt returns no errors, and does not generate any exceptions in the log/console. A query for doc1 returns an empty attr_content field: arr name=attr_content str/str /arr Oddly enough, executing an extractOnly=true request against the ExtractingRequestHandler with the same ODT file correctly returns the text of the file. I am wondering: * Is this a known issue? (I couldn't find any mention of this particular issue anywhere...) * Are there any workarounds or does anyone have any suggestions? Thanks, Brett.
Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler
Not an issue that I know of. I expect you've got some obscure problem in your definitions, but I'm guession. Try modifying your schema so the glob pattern maps to a stored field, something like: dynamicField name=* type=string multiValued=true stored=true / remove all other fields except id, remove your mapping, and try it again. If you query with fl=* you should see everything that was extracted. That'll tell you whether it is a problem with Solr/Tika or something in how you're using them. Best Erick On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi Erik, The document is committed successfully... it is just missing all the extracted content from Tika when I query for that document. i.e. the mapped content field attr_content is empty (fmap.content=attr_content) result name=response numFound=1 start=0 maxScore=1.9162908 doc float name=score1.9162908/float arr name=attr_character_count str24/str /arr arr name=attr_content str/str /arr arr name=attr_creation_date str2009-04-16T11:32:00/str /arr arr name=attr_date str2012-11-23T00:29:39.73/str /arr ... /result Brett. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, November 25, 2012 9:27 PM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Did you commit after you added the document but before you tried the search? Best Erick On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi all, I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty. Executing a simple index request (based on the example in the wiki): curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr _fmap.content=attr_contentcommit=true http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr _fmap.content=attr_contentcommit=true%22 -F myfile=@testfile.odt returns no errors, and does not generate any exceptions in the log/console. A query for doc1 returns an empty attr_content field: arr name=attr_content str/str /arr Oddly enough, executing an extractOnly=true request against the ExtractingRequestHandler with the same ODT file correctly returns the text of the file. I am wondering: * Is this a known issue? (I couldn't find any mention of this particular issue anywhere...) * Are there any workarounds or does anyone have any suggestions? Thanks, Brett.
RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler
Hi Erik, The document is committed successfully... it is just missing all the extracted content from Tika when I query for that document. i.e. the mapped content field attr_content is empty (fmap.content=attr_content) result name=response numFound=1 start=0 maxScore=1.9162908 doc float name=score1.9162908/float arr name=attr_character_count str24/str /arr arr name=attr_content str/str /arr arr name=attr_creation_date str2009-04-16T11:32:00/str /arr arr name=attr_date str2012-11-23T00:29:39.73/str /arr ... /result Brett. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, November 25, 2012 9:27 PM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Did you commit after you added the document but before you tried the search? Best Erick On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi all, I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty. Executing a simple index request (based on the example in the wiki): curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr _fmap.content=attr_contentcommit=true http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr _fmap.content=attr_contentcommit=true%22 -F myfile=@testfile.odt returns no errors, and does not generate any exceptions in the log/console. A query for doc1 returns an empty attr_content field: arr name=attr_content str/str /arr Oddly enough, executing an extractOnly=true request against the ExtractingRequestHandler with the same ODT file correctly returns the text of the file. I am wondering: * Is this a known issue? (I couldn't find any mention of this particular issue anywhere...) * Are there any workarounds or does anyone have any suggestions? Thanks, Brett.
Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler
Did you commit after you added the document but before you tried the search? Best Erick On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne bmelbou...@halogensoftware.com wrote: Hi all, I am encountering a problem where Solr 3.6.1 is not able to extract the text content from ODT (Open Office Document) files submitted to the ExtractingRequestHandler. I can reproduce this issue against the example schema running with jetty. Executing a simple index request (based on the example in the wiki): curl http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentcommit=true http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentcommit=true%22 -F myfile=@testfile.odt returns no errors, and does not generate any exceptions in the log/console. A query for doc1 returns an empty attr_content field: arr name=attr_content str/str /arr Oddly enough, executing an extractOnly=true request against the ExtractingRequestHandler with the same ODT file correctly returns the text of the file. I am wondering: * Is this a known issue? (I couldn't find any mention of this particular issue anywhere...) * Are there any workarounds or does anyone have any suggestions? Thanks, Brett.