Thanks Lewis and Jorge. Thanks for all the pointers.
Very helpful as I feel I am almost there in getting it working.
When I run it in local mode then I am able to get the dictionary working
but on Hadoop it still fails with NPE.
java.lang.NullPointerException
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at sun.nio.cs.StreamDecoder.inReady(StreamDecoder.java:362)
at sun.nio.cs.StreamDecoder.implReady(StreamDecoder.java:370)
at sun.nio.cs.StreamDecoder.ready(StreamDecoder.java:184)
at java.io.InputStreamReader.ready(InputStreamReader.java:195)
at java.io.BufferedReader.ready(BufferedReader.java:456)
at
org.apache.nutch.parse.html.db.docscience.JarFileProvider.open(JarFileProvider.java:214)
Line where it fails:
BufferedReader br = new BufferedReader(conf.getConfResourceAsReader("data"
));
data is the directory name under conf folder.
best
Dave
On Thu, Jun 29, 2017 at 9:26 AM, lewis john mcgibbney <[email protected]>
wrote:
> Hi Dave,
> Does this need to be done in parsing phase? Parsing is already an IO
> intensive process... could you possible do it at another phase?
> Right now, the only plugin I can think of which ships with Nutch source,
> and which consults an external resource (not packaged with Nutch) is the
> index-geoip plugin [0]. This works in distributed mode.
> Please also consider looking into the parsefilter-naivebayes [1] which
> loads in a prebuild model [2] as a resource which is then obviously used
> the filtering.
> hth
> Lewis
>
> [0] https://github.com/apache/nutch/tree/master/src/plugin/index-geoip
> [1]
> https://github.com/apache/nutch/tree/master/src/plugin/
> parsefilter-naivebayes
> [2]
> https://github.com/apache/nutch/blob/master/src/plugin/
> parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/
> NaiveBayesParseFilter.java#L132-L137
>
> On Thu, Jun 29, 2017 at 8:29 AM, <[email protected]>
> wrote:
>
> >
> >
> > From: SJC Multimedia <[email protected]>
> > To: [email protected]
> > Cc:
> > Bcc:
> > Date: Thu, 29 Jun 2017 08:28:54 -0700
> > Subject: Custom Plugin Resources Files
> > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> plugin
> > code, I need to pull in a dictionary of files and run some comparisons
> > while parsing the document.
> >
> > Is there a way to include directory of files through the custom plugin
> ant
> > build framework that will work on both local and cluster(hadoop MR) mode?
> >
> > Any pointers will be helpful.
> >
> > Thanks
> > Dave
> >
> >
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>