My setup is:

*Nutch 2.2.1**
**Solr 4.4**
**Hbase 0.90.6*

I try to activate the boilerpipe support & I have done the following:

1) https://issues.apache.org/jira/browse/NUTCH-961

Applied the patch and edited manually TikaParser.java where the patch
did not work. Now when I type
"ant runtime" and try to compile nutch again I get the following error:

    [javac]
/srv/apache-nutch-2.2.1/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java:176:
error: no suitable constructor found for ByteArrayInputStream(ByteBuffer)
    [javac]     parser.parse(new ByteArrayInputStream(raw),
(ContentHandler)domHandler, tikamd, context);
    [javac]                  ^
    [javac]     constructor
ByteArrayInputStream.ByteArrayInputStream(byte[],int,int) is not applicable
    [javac]       (actual and formal argument lists differ in length)
    [javac]     constructor
ByteArrayInputStream.ByteArrayInputStream(byte[]) is not applicable
    [javac]       (actual argument ByteBuffer cannot be converted to
byte[] by method invocation conversion)

Now I have looked into the TikaParse.java file once again and looked at
the line which gives the error:

    parser.parse(new ByteArrayInputStream*(raw)*, domHandler, tikamd,
context);

and I have tried to replace (raw) with several options, eg: 
*raw.array(), raw.arrayOffset() + raw.position(), raw.remaining()*
   
but still no luck. _Although it compiles fine with the edit above_ when
I run Nutch _I dont get my content boilerpiped_ :(

I also have changed the *parse-plugins.xml* like this:

*<mimeType name="*"> **
**<plugin id="parse-tika" /> **
**</mimeType> **
**<mimeType name="text/html"> **
**<plugin id="parse-tika" /> **
**</mimeType> **
**<mimeType name="application/xhtml+xml"> **
**<plugin id="parse-tika" /> **
**</mimeType>**
*
and edited *nutch-site.xml*:

*<property> **
**<name>tika.use_boilerpipe</name> **
**<value>true</value> </property> **
**<property> **
**<name>tika.boilerpipe.extractor</name> **
**<value>ArticleExtractor</value> **
**</property>

*Any help or advise would be welcomed. I am since 10+ hours working on
it and have studied any tutorial, hint on the mailinglist/web before. I
am quite sure the problem is within
TikaParser.java -> Line: parser.parse(new ByteArrayInputStream*(raw)*,
domHandler, tikamd, context);

Thank you!













Reply via email to