RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Brett Melbourne Fri, 07 Dec 2012 13:41:04 -0800

Hi Erick,

Thanks for the reply!

I don't think there is a problem with my schema, because I can successfully 
extract text from other file types.

For example, Tika is able to extract the content from a docx:

FINEST: Trying class name 
org.apache.solr.handler.extraction.ExtractingRequestHandler
Dec 7, 2012 3:18:35 PM org.apache.solr.handler.extraction.SolrContentHandler 
newDocument
FINE: Doc: SolrInputDocument[{attr_meta=attr_meta(1.0)={[stream_content_type, 
application/xml, stream_size, 9935, Content-Type, 
application/vnd.openxmlformats-officedocument.wordprocessingml.document]}, 
attr_revision_number=attr_revision_number(1.0)={1}, 
attr_template=attr_template(1.0)={Normal.dotm}, 
attr_last_author=attr_last_author(1.0)={Brett Melbourne}, 
attr_page_count=attr_page_count(1.0)={1}, 
attr_application_name=attr_application_name(1.0)={Microsoft Office Word}, 
author=author(1.0)={Brett Melbourne}, 
last_modified=last_modified(1.0)={2012-12-07T19:18:00.000Z}, 
attr_application_version=attr_application_version(1.0)={12.0000}, 
attr_character_count_with_spaces=attr_character_count_with_spaces(1.0)={60}, 
attr_date=attr_date(1.0)={2012-12-07T19:17:00Z}, 
attr_total_time=attr_total_time(1.0)={1}, 
attr_publisher=attr_publisher(1.0)={}, attr_creator=attr_creator(1.0)={Brett 
Melbourne}, attr_word_count=attr_word_count(1.0)={9}, 
attr_xmptpg_npages=attr_xmptpg_npages(1.0)={1}, 
attr_creation_date=attr_creation_date(1.0)={2012-12-07T19:17:00Z}, 
attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, 
attr_line_count=attr_line_count(1.0)={1}, 
attr_character_count=attr_character_count(1.0)={52}, 
attr_stream_size=attr_stream_size(1.0)={9935}, 
content_type=content_type(1.0)={application/vnd.openxmlformats-officedocument.wordprocessingml.document},
 attr_paragraph_count=attr_paragraph_count(1.0)={1}, id=id(1.0)={doc3}, 
text=text(1.0)={             This is some text content that Solr should be able 
to parse.   }}]

The docx content in Solr is:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">id:doc3</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="attr_application_name">
<str>Microsoft Office Word</str>
</arr>
<arr name="attr_application_version">
<str>12.0000</str>
</arr>
<arr name="attr_character_count">
<str>52</str>
</arr>
<arr name="attr_character_count_with_spaces">
<str>60</str>
</arr>
<arr name="attr_creation_date">
<str>2012-12-07T19:17:00Z</str>
</arr>
<arr name="attr_creator">
<str>Brett Melbourne</str>
</arr>
<arr name="attr_date">
<str>2012-12-07T19:17:00Z</str>
</arr>
<arr name="attr_last_author">
<str>Brett Melbourne</str>
</arr>
<arr name="attr_line_count">
<str>1</str>
</arr>
<arr name="attr_meta">
<str>stream_content_type</str>
<str>application/xml</str>
<str>stream_size</str>
<str>9935</str>
<str>Content-Type</str>
<str>
application/vnd.openxmlformats-officedocument.wordprocessingml.document
</str>
</arr>
<arr name="attr_page_count">
<str>1</str>
</arr>
<arr name="attr_paragraph_count">
<str>1</str>
</arr>
<arr name="attr_publisher">
<str/>
</arr>
<arr name="attr_revision_number">
<str>1</str>
</arr>
<arr name="attr_stream_content_type">
<str>application/xml</str>
</arr>
<arr name="attr_stream_size">
<str>9935</str>
</arr>
<arr name="attr_template">
<str>Normal.dotm</str>
</arr>
<arr name="attr_total_time">
<str>1</str>
</arr>
<arr name="attr_word_count">
<str>9</str>
</arr>
<arr name="attr_xmptpg_npages">
<str>1</str>
</arr>
<str name="author">Brett Melbourne</str>
<arr name="content_type">
<str>
application/vnd.openxmlformats-officedocument.wordprocessingml.document
</str>
</arr>
<str name="id">doc3</str>
<date name="last_modified">2012-12-07T19:18:00Z</date>
<arr name="text">
<str>
This is some text content that Solr should be able to parse.
</str>
</arr>
</doc>
</result>
</response>

When I attempt to index an ODT, it apparently works fine.. however notice that 
the text field is empty:

Dec 7, 2012 4:18:43 PM org.apache.solr.handler.extraction.SolrContentHandler 
newDocument
FINE: Doc: SolrInputDocument[{attr_editing_cycles=attr_editing_cycles(1.0)={1}, 
attr_page_count=attr_page_count(1.0)={2}, 
attr_date=attr_date(1.0)={2010-09-16T15:51:00Z}, 
attr_creator=attr_creator(1.0)={droy}, 
attr_word_count=attr_word_count(1.0)={475}, 
attr_xmptpg_npages=attr_xmptpg_npages(1.0)={2}, 
attr_edit_time=attr_edit_time(1.0)={PT60S}, 
attr_creation_date=attr_creation_date(1.0)={2010-09-16T15:50:00Z}, 
attr_nbpara=attr_nbpara(1.0)={6}, 
attr_stream_content_type=attr_stream_content_type(1.0)={application/xml}, 
attr_initial_creator=attr_initial_creator(1.0)={droy}, 
attr_character_count=attr_character_count(1.0)={3177}, 
attr_stream_size=attr_stream_size(1.0)={9130}, 
attr_generator=attr_generator(1.0)={MicrosoftOffice/12.0 MicrosoftWord}, 
attr_nbword=attr_nbword(1.0)={475}, attr_nbpage=attr_nbpage(1.0)={2}, 
content_type=content_type(1.0)={application/vnd.oasis.opendocument.text}, 
attr_nbcharacter=attr_nbcharacter(1.0)={3177}, 
attr_paragraph_count=attr_paragraph_count(1.0)={6}, id=id(1.0)={doc4}, 
text=text(1.0)={  }}]

The corresponding document in solr looks like this:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">id:doc4</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<arr name="attr_character_count">
<str>3177</str>
</arr>
<arr name="attr_creation_date">
<str>2010-09-16T15:50:00Z</str>
</arr>
<arr name="attr_creator">
<str>droy</str>
</arr>
<arr name="attr_date">
<str>2010-09-16T15:51:00Z</str>
</arr>
<arr name="attr_edit_time">
<str>PT60S</str>
</arr>
<arr name="attr_editing_cycles">
<str>1</str>
</arr>
<arr name="attr_generator">
<str>MicrosoftOffice/12.0 MicrosoftWord</str>
</arr>
<arr name="attr_initial_creator">
<str>droy</str>
</arr>
<arr name="attr_nbcharacter">
<str>3177</str>
</arr>
<arr name="attr_nbpage">
<str>2</str>
</arr>
<arr name="attr_nbpara">
<str>6</str>
</arr>
<arr name="attr_nbword">
<str>475</str>
</arr>
<arr name="attr_page_count">
<str>2</str>
</arr>
<arr name="attr_paragraph_count">
<str>6</str>
</arr>
<arr name="attr_stream_content_type">
<str>application/xml</str>
</arr>
<arr name="attr_stream_size">
<str>9130</str>
</arr>
<arr name="attr_word_count">
<str>475</str>
</arr>
<arr name="attr_xmptpg_npages">
<str>2</str>
</arr>
<arr name="content_type">
<str>application/vnd.oasis.opendocument.text</str>
</arr>
<str name="id">doc4</str>
<arr name="text">
<str></str>
</arr>
</doc>
</result>
</response>

Brett.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 27, 2012 7:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's 
ExtractingRequestHandler

Not an issue that I know of. I expect you've got some obscure problem in your 
definitions, but I'm guession. Try modifying your schema so the glob pattern 
maps to a stored field, something like:
<dynamicField name="*" type="string" multiValued="true" stored="true" /> remove 
all other fields except id, remove your mapping, and try it again.
If you query with fl=* you should see everything that was extracted. That'll 
tell you whether it is a problem with Solr/Tika or something in how you're 
using them.

Best
Erick

On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne < 
bmelbou...@halogensoftware.com> wrote:

> Hi Erik,
>
> The document is committed successfully... it is just missing all the 
> extracted content from Tika when I query for that document.
>
> i.e. the mapped content field attr_content is empty
> (fmap.content=attr_content)
>
> <result name="response" numFound="1" start="0" maxScore="1.9162908"> 
> <doc> <float name="score">1.9162908</float> <arr 
> name="attr_character_count"> <str>24</str> </arr> <arr 
> name="attr_content"> <str></str> </arr> <arr 
> name="attr_creation_date"> <str>2009-04-16T11:32:00</str> </arr> <arr 
> name="attr_date"> <str>2012-11-23T00:29:39.73</str> </arr>
>
> ...
>
> </result>
>
>
> Brett.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Sunday, November 25, 2012 9:27 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with Solr 3.6.1 extracting ODT content using 
> SolrCell's ExtractingRequestHandler
>
> Did you commit after you added the document but before you tried the 
> search?
>
> Best
> Erick
>
>
> On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne < 
> bmelbou...@halogensoftware.com> wrote:
>
> > Hi all,
> >
> > I am encountering a problem where Solr 3.6.1 is not able to extract 
> > the text content from ODT (Open Office Document) files submitted to 
> > the ExtractingRequestHandler. I can reproduce this issue against the 
> > example schema running with jetty.
> >
> > Executing a simple index request (based on the example in the wiki):
> > curl "
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at
> > tr _&fmap.content=attr_content&commit=true
> > "<
> > http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=at
> > tr _&fmap.content=attr_content&commit=true%22>
> > -F "myfile=@testfile.odt"
> > returns no errors, and does not generate any exceptions in the
> log/console.
> >
> > A query for doc1 returns an empty attr_content field:
> > <arr name="attr_content"> <str></str> </arr>
> >
> > Oddly enough, executing an "extractOnly=true" request against the 
> > ExtractingRequestHandler with the same ODT file correctly returns 
> > the text of the file.
> >
> > I am wondering:
> >
> > *         Is this a known issue? (I couldn't find any mention of this
> > particular issue anywhere...)
> >
> > *         Are there any workarounds or does anyone have any suggestions?
> >
> > Thanks,
> >
> > Brett.
> >
> >
>

RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

Reply via email to