Hi Robert,

why not switching on boilerpipe for parse-tika?

<property>
  <name>tika.extractor</name>
  <value>none</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, 
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>



Regarding the NoClassDefFoundError:
- the plugin.xml must list all required dependencies:
   https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml
- see also
   
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt


Best,
Sebastian


On 06/25/2018 02:33 PM, Robert Scavilla wrote:
> Hello and thank you. I'm working on a plugin and need to use Tika to
> extract the boilerpipe content. The code compiles fine but I'm getting a
> runtime error. The problem is outlined below:
> 
>             Tika tika = new Tika();
>             input = new ByteArrayInputStream(content.getBytes(
> StandardCharsets.UTF_8.name()));
>             String mediaType = tika.detect(input);
>             LOG.info("RSS: MediaType= " + mediaType);
> 
>                 org.apache.tika.metadata.Metadata md = new
> org.apache.tika.metadata.Metadata();
>                 AutoDetectParser parser = new AutoDetectParser();
>                 ParseContext pContext = new ParseContext();
>                 BodyContentHandler textHandler = new BodyContentHandler();
> 
> *The detection and creation of new objects works well. *
> 
> The problem is when I try to create a new BoilerpipeContentHandler :
> 
>                parser.parse(input, new
> BoilerpipeContentHandler(textHandler), md, pContext);
> 
> I get the error: *NoClassDefFoundError:
> org/apache/tika/parser/html/BoilerpipeContentHandler*
> 
> I tried adding a dependency to ivy.xml similar to parse-tika but got same
> results:
> 
> <dependency org="org.apache.tika" name="tika-parsers" rev="1.17"
> conf="*->default">
>       <exclude org="org.apache.tika" name="tika-core" />
>       <exclude org="org.apache.httpcomponents" name="httpclient" />
>       <exclude org="org.apache.httpcomponents" name="httpcore" />
>       <exclude org="org.slf4j" name="slf4j-log4j12" />
>       <exclude org="org.slf4j" name="slf4j-api" />
>       <exclude org="commons-lang" name="commons-lang" />
>       <exclude org="com.google.protobuf" name="protobuf-java" />
>     </dependency>
> 
> 
> *Thank you for your help,*
> 
> *...bob*
> 

Reply via email to