Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-04-01 Thread Andreas Kemkes
Thank you.  That is valuable guidance.

In light of the recent release of Solr 3.1, I decided to first try that 
distribution, as it already uses Tika 0.8, which is much closer to my target.

Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests 
pass, yet I see the error below.  When I change

ignoreException("unknown field 'a'");

to 

ignoreException("unknown field 'meta'");

in the testDefaultField test, the error output goes away.

I am wondering, if that particular error is expected, or whether the error 
should in fact be "unknown field 'a'" and I'm only masking an issue with the 
change.

All extraction test pass also after I replace the Tika and PDFBox libraries 
with 
the newer versions.

-- Andreas

test:
[junit] Testsuite: org.apache.solr.handler.ExtractingRequestHandlerTest
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 6.424 sec
[junit] 
[junit] - Standard Error -
[junit] 01/04/2011 22:49:59 org.apache.solr.common.SolrException log
[junit] SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 
'meta'
[junit] at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:321)
[junit] at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:121)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:126)

[junit] at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:198)

[junit] at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55)

[junit] at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

[junit] at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
[junit] at 
org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:337)
[junit] at 
org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:373)

[junit] at 
org.apache.solr.handler.ExtractingRequestHandlerTest.testDefaultField(ExtractingRequestHandlerTest.java:156)







From: Chris Hostetter 
To: solr-user@lucene.apache.org
Sent: Thu, March 31, 2011 7:19:05 PM
Subject: Re: Solr 1.4.1 and Tika 0.9 - some tests not passing


: I'm still interested on what steps I could take to get to the bottom of the 
: failing tests.  Is there additional information that I should provide?

i'm not really up to speed on what might have changed in Tika 0.9 to cause 
this, but the best thing to do would probably be to look at what *does* 
work compared to what doesn't work.

if *none* of hte asserts for dealing with an html doc work, that suggests 
that fundementally something is just completley broken about the html 
parsing.

Consider this first assertion failure...

: assertQ(req("title:Welcome"), "//*[@numFound='1']");

...in the context of what you said tika 0.9 gives you for that doc on the 
command line...

: $ java -jar tika-app-0.9.jar 
: 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html

...
: Welcome to Solr

...if that basic little bit of info can't be extracted, then i'm guessing 
nothing is being extracted.

I would suggest you run the example (with the 0.9 tika jars) and manually 
attempt to index one document, and then use the schema browser to see 
exactly what gets indexed.

you may need to experiment with tweaking the config options for the 
extraction handler.

-Hoss


Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-31 Thread Chris Hostetter

: I'm still interested on what steps I could take to get to the bottom of the 
: failing tests.  Is there additional information that I should provide?

i'm not really up to speed on what might have changed in Tika 0.9 to cause 
this, but the best thing to do would probably be to look at what *does* 
work compared to what doesn't work.

if *none* of hte asserts for dealing with an html doc work, that suggests 
that fundementally something is just completley broken about the html 
parsing.

Consider this first assertion failure...

: assertQ(req("title:Welcome"), "//*[@numFound='1']");

...in the context of what you said tika 0.9 gives you for that doc on the 
command line...

: $ java -jar tika-app-0.9.jar 
: 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
...
: Welcome to Solr

...if that basic little bit of info can't be extracted, then i'm guessing 
nothing is being extracted.

I would suggest you run the example (with the 0.9 tika jars) and manually 
attempt to index one document, and then use the schema browser to see 
exactly what gets indexed.

you may need to experiment with tweaking the config options for the 
extraction handler.

-Hoss


Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-28 Thread Andreas Kemkes
I'm still interested on what steps I could take to get to the bottom of the 
failing tests.  Is there additional information that I should provide?

Some of the output below got mangled in the email - here are the (hopefully) 
complete lines:

This has a http://www.apache.org";>link</a>. (Tika 0.9)
This has a <a href="http://www.apache.org";>link</a>. (Tika 0.4)




From: Andreas Kemkes 
To: solr-user@lucene.apache.org
Sent: Tue, March 22, 2011 10:30:57 AM
Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing

Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like 
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.

With the changes we made to Solr 1.4.1, we can successfully index the 
previously 

failing PDF documents.

Unfortunately we cannot get the HTML-related tests to pass.

The following asserts in ExtractingRequestHandlerTest.java are failing:

assertQ(req("title:Welcome"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertQ(req("t_href:http"), "//*[@numFound='2']");
assertQ(req("t_href:http"), "//doc[1]/str[.='simple3']");
assertQ(req("+id:simple4 +t_content:Solr"), "//*[@numFound='1']");
assertQ(req("defaultExtr:http\\://www.apache.org"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == 
true);//there are two  tags, and they get collapesd

Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html.

Tika 0.9 has additional meta tags, a shape attribute, and some additional white 
space.  Is this what throws it off?  

What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output 
correctly?

Do we need to configure different filters and tokenizers?  Which ones?

Or is it something else entirely?

Thanks in advance for any help,

Andreas

$ java -jar tika-app-0.4.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html




Welcome to Solr



  Here is some text


Here is some text in a div
This has a link'>http://www.apache.org";>link.





$ java -jar tika-app-0.9.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
 







Welcome to Solr



  Here is some text


Here is some text in a div

This has a link'>http://www.apache.org";>link.




Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-22 Thread Andreas Kemkes
Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like 
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.

With the changes we made to Solr 1.4.1, we can successfully index the 
previously 
failing PDF documents.

Unfortunately we cannot get the HTML-related tests to pass.

The following asserts in ExtractingRequestHandlerTest.java are failing:

assertQ(req("title:Welcome"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertQ(req("t_href:http"), "//*[@numFound='2']");
assertQ(req("t_href:http"), "//doc[1]/str[.='simple3']");
assertQ(req("+id:simple4 +t_content:Solr"), "//*[@numFound='1']");
assertQ(req("defaultExtr:http\\://www.apache.org"), "//*[@numFound='1']");
assertQ(req("+id:simple2 +t_href:[* TO *]"), "//*[@numFound='1']");
assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == 
true);//there are two  tags, and they get collapesd

Below are the differences in output from Tika 0.4 and Tika 0.9 for simple.html.

Tika 0.9 has additional meta tags, a shape attribute, and some additional white 
space.  Is this what throws it off?  

What do we need to consider so that Solr 1.4.1 will process the Tika 0.9 output 
correctly?

Do we need to configure different filters and tokenizers?  Which ones?

Or is it something else entirely?

Thanks in advance for any help,

Andreas

$ java -jar tika-app-0.4.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html



Welcome to Solr



  Here is some text


Here is some text in a div
This has a link'>http://www.apache.org";>link.





$ java -jar tika-app-0.9.jar 
../../../apache-solr-1.4.1-with-tika-0.9/contrib/extraction/src/test/resources/simple.html
 







Welcome to Solr



  Here is some text


Here is some text in a div

This has a link'>http://www.apache.org";>link.