Hi Robert, why not switching on boilerpipe for parse-tika?
<property> <name>tika.extractor</name> <value>none</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>ArticleExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property> Regarding the NoClassDefFoundError: - the plugin.xml must list all required dependencies: https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/plugin.xml - see also https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/howto_upgrade_tika.txt Best, Sebastian On 06/25/2018 02:33 PM, Robert Scavilla wrote: > Hello and thank you. I'm working on a plugin and need to use Tika to > extract the boilerpipe content. The code compiles fine but I'm getting a > runtime error. The problem is outlined below: > > Tika tika = new Tika(); > input = new ByteArrayInputStream(content.getBytes( > StandardCharsets.UTF_8.name())); > String mediaType = tika.detect(input); > LOG.info("RSS: MediaType= " + mediaType); > > org.apache.tika.metadata.Metadata md = new > org.apache.tika.metadata.Metadata(); > AutoDetectParser parser = new AutoDetectParser(); > ParseContext pContext = new ParseContext(); > BodyContentHandler textHandler = new BodyContentHandler(); > > *The detection and creation of new objects works well. * > > The problem is when I try to create a new BoilerpipeContentHandler : > > parser.parse(input, new > BoilerpipeContentHandler(textHandler), md, pContext); > > I get the error: *NoClassDefFoundError: > org/apache/tika/parser/html/BoilerpipeContentHandler* > > I tried adding a dependency to ivy.xml similar to parse-tika but got same > results: > > <dependency org="org.apache.tika" name="tika-parsers" rev="1.17" > conf="*->default"> > <exclude org="org.apache.tika" name="tika-core" /> > <exclude org="org.apache.httpcomponents" name="httpclient" /> > <exclude org="org.apache.httpcomponents" name="httpcore" /> > <exclude org="org.slf4j" name="slf4j-log4j12" /> > <exclude org="org.slf4j" name="slf4j-api" /> > <exclude org="commons-lang" name="commons-lang" /> > <exclude org="com.google.protobuf" name="protobuf-java" /> > </dependency> > > > *Thank you for your help,* > > *...bob* >