Hi Erick, Thanks for the reply!
I don't think there is a problem with my schema, because I can successfully extract text from other file types. For example, Tika is able to extract the content from a docx: FINEST: Trying class name org.apache.solr.handler.extraction.ExtractingRequestHandler Dec 7, 2012 3:18:35 PM org.apache.solr.handler.extraction.SolrContentHandler newDocument FINE: Doc: SolrInputDocument[{attr_meta=attr_meta(1.0)={[stream_content_type, application/xml, stream_size, 9935, Content-Type, application/vnd.openxmlformats-officedocument.wordprocessingml.document]}, attr_revision_number=attr_revision_number(1.0)={1}, attr_template=attr_template(1.0)={Normal.dotm}, attr_last_author=attr_last_author(1.0)={Brett Melbourne}, attr_page_count=attr_page_count(1.0)={1}, attr_application_name=attr_application_name(1.0)={Microsoft Office Word}, author=author(1.0)={Brett Melbourne}, last_modified=last_modified(1.0)={2012-12-07T19:18:00.000Z}, attr_application_version=attr_application_version(1.0)={12.0000}, attr_character_count_with_spaces=attr_character_count_with_spaces(1.0)={60}, attr_date=attr_date(1.0)={2012-12-07T19:17:00Z}, attr_total_time=attr_total_time(1.0)={1}, attr_publisher=attr_publisher(1.0)={}, attr_creator=attr_creator(1.0)={Brett Melbourne}, attr_word_count=attr_word_count(1.0)={9}, attr_xmptpg_npages=attr_xmptpg_npages(1.0)={1}, attr_creation_date=attr_creation_date(1.0)={2012-12-07T19:17:00Z}, attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, attr_line_count=attr_line_count(1.0)={1}, attr_character_count=attr_character_count(1.0)={52}, attr_stream_size=attr_stream_size(1.0)={9935}, content_type=content_type(1.0)={application/vnd.openxmlformats-officedocument.wordprocessingml.document}, attr_paragraph_count=attr_paragraph_count(1.0)={1}, id=id(1.0)={doc3}, text=text(1.0)={ This is some text content that Solr should be able to parse. }}] The docx content in Solr is: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="start">0</str> <str name="q">id:doc3</str> <str name="version">2.2</str> <str name="rows">10</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <arr name="attr_application_name"> <str>Microsoft Office Word</str> </arr> <arr name="attr_application_version"> <str>12.0000</str> </arr> <arr name="attr_character_count"> <str>52</str> </arr> <arr name="attr_character_count_with_spaces"> <str>60</str> </arr> <arr name="attr_creation_date"> <str>2012-12-07T19:17:00Z</str> </arr> <arr name="attr_creator"> <str>Brett Melbourne</str> </arr> <arr name="attr_date"> <str>2012-12-07T19:17:00Z</str> </arr> <arr name="attr_last_author"> <str>Brett Melbourne</str> </arr> <arr name="attr_line_count"> <str>1</str> </arr> <arr name="attr_meta"> <str>stream_content_type</str> <str>application/xml</str> <str>stream_size</str> <str>9935</str> <str>Content-Type</str> <str> application/vnd.openxmlformats-officedocument.wordprocessingml.document </str> </arr> <arr name="attr_page_count"> <str>1</str> </arr> <arr name="attr_paragraph_count"> <str>1</str> </arr> <arr name="attr_publisher"> <str/> </arr> <arr name="attr_revision_number"> <str>1</str> </arr> <arr name="attr_stream_content_type"> <str>application/xml</str> </arr> <arr name="attr_stream_size"> <str>9935</str> </arr> <arr name="attr_template"> <str>Normal.dotm</str> </arr> <arr name="attr_total_time"> <str>1</str> </arr> <arr name="attr_word_count"> <str>9</str> </arr> <arr name="attr_xmptpg_npages"> <str>1</str> </arr> <str name="author">Brett Melbourne</str> <arr name="content_type"> <str> application/vnd.openxmlformats-officedocument.wordprocessingml.document </str> </arr> <str name="id">doc3</str> <date name="last_modified">2012-12-07T19:18:00Z</date> <arr name="text"> <str> This is some text content that Solr should be able to parse. </str> </arr> </doc> </result> </response> When I attempt to index an ODT, it apparently works fine.. however notice that the text field is empty: Dec 7, 2012 4:18:43 PM org.apache.solr.handler.extraction.SolrContentHandler newDocument FINE: Doc: SolrInputDocument[{attr_editing_cycles=attr_editing_cycles(1.0)={1}, attr_page_count=attr_page_count(1.0)={2}, attr_date=attr_date(1.0)={2010-09-16T15:51:00Z}, attr_creator=attr_creator(1.0)={droy}, attr_word_count=attr_word_count(1.0)={475}, attr_xmptpg_npages=attr_xmptpg_npages(1.0)={2}, attr_edit_time=attr_edit_time(1.0)={PT60S}, attr_creation_date=attr_creation_date(1.0)={2010-09-16T15:50:00Z}, attr_nbpara=attr_nbpara(1.0)={6}, attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, attr_initial_creator=attr_initial_creator(1.0)={droy}, attr_character_count=attr_character_count(1.0)={3177}, attr_stream_size=attr_stream_size(1.0)={9130}, attr_generator=attr_generator(1.0)={MicrosoftOffice/12.0 MicrosoftWord}, attr_nbword=attr_nbword(1.0)={475}, attr_nbpage=attr_nbpage(1.0)={2}, content_type=content_type(1.0)={application/vnd.oasis.opendocument.text}, attr_nbcharacter=attr_nbcharacter(1.0)={3177}, attr_paragraph_count=attr_paragraph_count(1.0)={6}, id=id(1.0)={doc4}, text=text(1.0)={ }}] The corresponding document in solr looks like this: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> <lst name="params"> <str name="indent">on</str> <str name="start">0</str> <str name="q">id:doc4</str> <str name="version">2.2</str> <str name="rows">10</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <arr name="attr_character_count"> <str>3177</str> </arr> <arr name="attr_creation_date"> <str>2010-09-16T15:50:00Z</str> </arr> <arr name="attr_creator"> <str>droy</str> </arr> <arr name="attr_date"> <str>2010-09-16T15:51:00Z</str> </arr> <arr name="attr_edit_time"> <str>PT60S</str> </arr> <arr name="attr_editing_cycles"> <str>1</str> </arr> <arr name="attr_generator"> <str>MicrosoftOffice/12.0 MicrosoftWord</str> </arr> <arr name="attr_initial_creator"> <str>droy</str> </arr> <arr name="attr_nbcharacter"> <str>3177</str> </arr> <arr name="attr_nbpage"> <str>2</str> </arr> <arr name="attr_nbpara"> <str>6</str> </arr> <arr name="attr_nbword"> <str>475</str> </arr> <arr name="attr_page_count"> <str>2</str> </arr> <arr name="attr_paragraph_count"> <str>6</str> </arr> <arr name="attr_stream_content_type"> <str>application/xml</str> </arr> <arr name="attr_stream_size"> <str>9130</str> </arr> <arr name="attr_word_count"> <str>475</str> </arr> <arr name="attr_xmptpg_npages"> <str>2</str> </arr> <arr name="content_type"> <str>application/vnd.oasis.opendocument.text</str> </arr> <str name="id">doc4</str> <arr name="text"> <str></str> </arr> </doc> </result> </response> Brett. -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, November 27, 2012 7:38 AM To: solr-user@lucene.apache.org Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler Not an issue that I know of. I expect you've got some obscure problem in your definitions, but I'm guession. Try modifying your schema so the glob pattern maps to a stored field, something like: <dynamicField name="*" type="string" multiValued="true" stored="true" /> remove all other fields except id, remove your mapping, and try it again. If you query with fl=* you should see everything that was extracted. That'll tell you whether it is a problem with Solr/Tika or something in how you're using them. Best Erick On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne < bmelbou...@halogensoftware.com> wrote: > Hi Erik, > > The document is committed successfully... it is just missing all the > extracted content from Tika when I query for that document. > > i.e. the mapped content field attr_content is empty > (fmap.content=attr_content) > > <result name="response" numFound="1" start="0" maxScore="1.9162908"> > <doc> <float name="score">1.9162908</float> <arr > name="attr_character_count"> <str>24</str> </arr> <arr > name="attr_content"> <str></str> </arr> <arr > name="attr_creation_date"> <str>2009-04-16T11:32:00</str> </arr> <arr > name="attr_date"> <str>2012-11-23T00:29:39.73</str> </arr> > > ... > > </result> > > > Brett. > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Sunday, November 25, 2012 9:27 PM > To: solr-user@lucene.apache.org > Subject: Re: Problem with Solr 3.6.1 extracting ODT content using > SolrCell's ExtractingRequestHandler > > Did you commit after you added the document but before you tried the > search? > > Best > Erick > > > On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne < > bmelbou...@halogensoftware.com> wrote: > > > Hi all, > > > > I am encountering a problem where Solr 3.6.1 is not able to extract > > the text content from ODT (Open Office Document) files submitted to > > the ExtractingRequestHandler. I can reproduce this issue against the > > example schema running with jetty. > > > > Executing a simple index request (based on the example in the wiki): > > curl " > > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at > > tr _&fmap.content=attr_content&commit=true > > "< > > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at > > tr _&fmap.content=attr_content&commit=true%22> > > -F "myfile=@testfile.odt" > > returns no errors, and does not generate any exceptions in the > log/console. > > > > A query for doc1 returns an empty attr_content field: > > <arr name="attr_content"> <str></str> </arr> > > > > Oddly enough, executing an "extractOnly=true" request against the > > ExtractingRequestHandler with the same ODT file correctly returns > > the text of the file. > > > > I am wondering: > > > > * Is this a known issue? (I couldn't find any mention of this > > particular issue anywhere...) > > > > * Are there any workarounds or does anyone have any suggestions? > > > > Thanks, > > > > Brett. > > > > >