RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-12-07 Thread Brett Melbourne
lst name=responseHeader
int name=status0/int
int name=QTime0/int
lst name=params
str name=indenton/str
str name=start0/str
str name=qid:doc4/str
str name=version2.2/str
str name=rows10/str
/lst
/lst
result name=response numFound=1 start=0
doc
arr name=attr_character_count
str3177/str
/arr
arr name=attr_creation_date
str2010-09-16T15:50:00Z/str
/arr
arr name=attr_creator
strdroy/str
/arr
arr name=attr_date
str2010-09-16T15:51:00Z/str
/arr
arr name=attr_edit_time
strPT60S/str
/arr
arr name=attr_editing_cycles
str1/str
/arr
arr name=attr_generator
strMicrosoftOffice/12.0 MicrosoftWord/str
/arr
arr name=attr_initial_creator
strdroy/str
/arr
arr name=attr_nbcharacter
str3177/str
/arr
arr name=attr_nbpage
str2/str
/arr
arr name=attr_nbpara
str6/str
/arr
arr name=attr_nbword
str475/str
/arr
arr name=attr_page_count
str2/str
/arr
arr name=attr_paragraph_count
str6/str
/arr
arr name=attr_stream_content_type
strapplication/xml/str
/arr
arr name=attr_stream_size
str9130/str
/arr
arr name=attr_word_count
str475/str
/arr
arr name=attr_xmptpg_npages
str2/str
/arr
arr name=content_type
strapplication/vnd.oasis.opendocument.text/str
/arr
str name=iddoc4/str
arr name=text
str/str
/arr
/doc
/result
/response


Brett.




-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, November 27, 2012 7:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's 
ExtractingRequestHandler

Not an issue that I know of. I expect you've got some obscure problem in your 
definitions, but I'm guession. Try modifying your schema so the glob pattern 
maps to a stored field, something like:
dynamicField name=* type=string multiValued=true stored=true / remove 
all other fields except id, remove your mapping, and try it again.
If you query with fl=* you should see everything that was extracted. That'll 
tell you whether it is a problem with Solr/Tika or something in how you're 
using them.

Best
Erick


On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne  
bmelbou...@halogensoftware.com wrote:

 Hi Erik,

 The document is committed successfully... it is just missing all the 
 extracted content from Tika when I query for that document.

 i.e. the mapped content field attr_content is empty
 (fmap.content=attr_content)

 result name=response numFound=1 start=0 maxScore=1.9162908 
 doc float name=score1.9162908/float arr 
 name=attr_character_count str24/str /arr arr 
 name=attr_content str/str /arr arr 
 name=attr_creation_date str2009-04-16T11:32:00/str /arr arr 
 name=attr_date str2012-11-23T00:29:39.73/str /arr

 ...

 /result


 Brett.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, November 25, 2012 9:27 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with Solr 3.6.1 extracting ODT content using 
 SolrCell's ExtractingRequestHandler

 Did you commit after you added the document but before you tried the 
 search?

 Best
 Erick


 On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne  
 bmelbou...@halogensoftware.com wrote:

  Hi all,
 
  I am encountering a problem where Solr 3.6.1 is not able to extract 
  the text content from ODT (Open Office Document) files submitted to 
  the ExtractingRequestHandler. I can reproduce this issue against the 
  example schema running with jetty.
 
  Executing a simple index request (based on the example in the wiki):
  curl 
  http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=at
  tr _fmap.content=attr_contentcommit=true
  
  http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=at
  tr _fmap.content=attr_contentcommit=true%22
  -F myfile=@testfile.odt
  returns no errors, and does not generate any exceptions in the
 log/console.
 
  A query for doc1 returns an empty attr_content field:
  arr name=attr_content str/str /arr
 
  Oddly enough, executing an extractOnly=true request against the 
  ExtractingRequestHandler with the same ODT file correctly returns 
  the text of the file.
 
  I am wondering:
 
  * Is this a known issue? (I couldn't find any mention of this
  particular issue anywhere...)
 
  * Are there any workarounds or does anyone have any suggestions?
 
  Thanks,
 
  Brett.
 
 



Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-27 Thread Erick Erickson
Not an issue that I know of. I expect you've got some obscure problem in
your definitions, but I'm guession. Try modifying your schema so the glob
pattern maps to a stored field, something like:
dynamicField name=* type=string multiValued=true stored=true /
remove all other fields except id, remove your mapping, and try it again.
If you query with fl=* you should see everything that was extracted. That'll
tell you whether it is a problem with Solr/Tika or something in how you're
using
them.

Best
Erick


On Mon, Nov 26, 2012 at 10:19 AM, Brett Melbourne 
bmelbou...@halogensoftware.com wrote:

 Hi Erik,

 The document is committed successfully... it is just missing all the
 extracted content from Tika when I query for that document.

 i.e. the mapped content field attr_content is empty
 (fmap.content=attr_content)

 result name=response numFound=1 start=0 maxScore=1.9162908
 doc
 float name=score1.9162908/float
 arr name=attr_character_count
 str24/str
 /arr
 arr name=attr_content
 str/str
 /arr
 arr name=attr_creation_date
 str2009-04-16T11:32:00/str
 /arr
 arr name=attr_date
 str2012-11-23T00:29:39.73/str
 /arr

 ...

 /result


 Brett.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, November 25, 2012 9:27 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Problem with Solr 3.6.1 extracting ODT content using
 SolrCell's ExtractingRequestHandler

 Did you commit after you added the document but before you tried the
 search?

 Best
 Erick


 On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne 
 bmelbou...@halogensoftware.com wrote:

  Hi all,
 
  I am encountering a problem where Solr 3.6.1 is not able to extract
  the text content from ODT (Open Office Document) files submitted to
  the ExtractingRequestHandler. I can reproduce this issue against the
  example schema running with jetty.
 
  Executing a simple index request (based on the example in the wiki):
  curl 
  http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr
  _fmap.content=attr_contentcommit=true
  
  http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr
  _fmap.content=attr_contentcommit=true%22
  -F myfile=@testfile.odt
  returns no errors, and does not generate any exceptions in the
 log/console.
 
  A query for doc1 returns an empty attr_content field:
  arr name=attr_content str/str /arr
 
  Oddly enough, executing an extractOnly=true request against the
  ExtractingRequestHandler with the same ODT file correctly returns the
  text of the file.
 
  I am wondering:
 
  * Is this a known issue? (I couldn't find any mention of this
  particular issue anywhere...)
 
  * Are there any workarounds or does anyone have any suggestions?
 
  Thanks,
 
  Brett.
 
 



RE: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-26 Thread Brett Melbourne
Hi Erik,

The document is committed successfully... it is just missing all the extracted 
content from Tika when I query for that document.

i.e. the mapped content field attr_content is empty (fmap.content=attr_content)

result name=response numFound=1 start=0 maxScore=1.9162908
doc
float name=score1.9162908/float
arr name=attr_character_count
str24/str
/arr
arr name=attr_content
str/str
/arr
arr name=attr_creation_date
str2009-04-16T11:32:00/str
/arr
arr name=attr_date
str2012-11-23T00:29:39.73/str
/arr

...

/result


Brett.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Sunday, November 25, 2012 9:27 PM
To: solr-user@lucene.apache.org
Subject: Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's 
ExtractingRequestHandler

Did you commit after you added the document but before you tried the search?

Best
Erick


On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne  
bmelbou...@halogensoftware.com wrote:

 Hi all,

 I am encountering a problem where Solr 3.6.1 is not able to extract 
 the text content from ODT (Open Office Document) files submitted to 
 the ExtractingRequestHandler. I can reproduce this issue against the 
 example schema running with jetty.

 Executing a simple index request (based on the example in the wiki):
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr
 _fmap.content=attr_contentcommit=true
 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr
 _fmap.content=attr_contentcommit=true%22
 -F myfile=@testfile.odt
 returns no errors, and does not generate any exceptions in the log/console.

 A query for doc1 returns an empty attr_content field:
 arr name=attr_content str/str /arr

 Oddly enough, executing an extractOnly=true request against the 
 ExtractingRequestHandler with the same ODT file correctly returns the 
 text of the file.

 I am wondering:

 * Is this a known issue? (I couldn't find any mention of this
 particular issue anywhere...)

 * Are there any workarounds or does anyone have any suggestions?

 Thanks,

 Brett.




Re: Problem with Solr 3.6.1 extracting ODT content using SolrCell's ExtractingRequestHandler

2012-11-25 Thread Erick Erickson
Did you commit after you added the document but before you tried the search?

Best
Erick


On Fri, Nov 23, 2012 at 6:25 PM, Brett Melbourne 
bmelbou...@halogensoftware.com wrote:

 Hi all,

 I am encountering a problem where Solr 3.6.1 is not able to extract the
 text content from ODT (Open Office Document) files submitted to the
 ExtractingRequestHandler. I can reproduce this issue against the example
 schema running with jetty.

 Executing a simple index request (based on the example in the wiki):
 curl 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentcommit=true
 
 http://localhost:8983/solr/update/extract?literal.id=doc1uprefix=attr_fmap.content=attr_contentcommit=true%22
 -F myfile=@testfile.odt
 returns no errors, and does not generate any exceptions in the log/console.

 A query for doc1 returns an empty attr_content field:
 arr name=attr_content str/str /arr

 Oddly enough, executing an extractOnly=true request against the
 ExtractingRequestHandler with the same ODT file correctly returns the text
 of the file.

 I am wondering:

 * Is this a known issue? (I couldn't find any mention of this
 particular issue anywhere...)

 * Are there any workarounds or does anyone have any suggestions?

 Thanks,

 Brett.