Re: Custom Plugin Resources Files

2017-06-29 Thread SJC Multimedia
Here is how I am getting conf ...

Configuration conf = NutchConfiguration.create();

BufferedReader br = new BufferedReader(conf.getConfResourceAsReader("data"
));

System.out.println(br.ready());

On Thu, Jun 29, 2017 at 4:33 PM, SJC Multimedia <sjcmultime...@gmail.com>
wrote:

> Thanks Lewis and Jorge. Thanks for all the pointers.
>
> Very helpful as I feel I am almost there in getting it working.
>
> When I run it in local mode then I am able to get the dictionary working
> but on Hadoop it still fails with NPE.
>
> java.lang.NullPointerException
>   at java.io.FilterInputStream.available(FilterInputStream.java:168)
>   at sun.nio.cs.StreamDecoder.inReady(StreamDecoder.java:362)
>   at sun.nio.cs.StreamDecoder.implReady(StreamDecoder.java:370)
>   at sun.nio.cs.StreamDecoder.ready(StreamDecoder.java:184)
>   at java.io.InputStreamReader.ready(InputStreamReader.java:195)
>   at java.io.BufferedReader.ready(BufferedReader.java:456)
>   at 
> org.apache.nutch.parse.html.db.docscience.JarFileProvider.open(JarFileProvider.java:214)
>
> Line where it fails:
>
> BufferedReader br = new BufferedReader(conf.getConfResourceAsReader("data"
> ));
> data is the directory name under conf folder.
>
> best
> Dave
>
> On Thu, Jun 29, 2017 at 9:26 AM, lewis john mcgibbney <lewi...@apache.org>
> wrote:
>
>> Hi Dave,
>> Does this need to be done in parsing phase? Parsing is already an IO
>> intensive process... could you possible do it at another phase?
>> Right now, the only plugin I can think of which ships with Nutch source,
>> and which consults an external resource (not packaged with Nutch) is the
>> index-geoip plugin [0]. This works in distributed mode.
>> Please also consider looking into the parsefilter-naivebayes [1] which
>> loads in a prebuild model [2] as a resource which is then obviously used
>> the filtering.
>> hth
>> Lewis
>>
>> [0] https://github.com/apache/nutch/tree/master/src/plugin/index-geoip
>> [1]
>> https://github.com/apache/nutch/tree/master/src/plugin/parse
>> filter-naivebayes
>> [2]
>> https://github.com/apache/nutch/blob/master/src/plugin/parse
>> filter-naivebayes/src/java/org/apache/nutch/parsefilter/
>> naivebayes/NaiveBayesParseFilter.java#L132-L137
>>
>> On Thu, Jun 29, 2017 at 8:29 AM, <user-digest-h...@nutch.apache.org>
>> wrote:
>>
>> >
>> >
>> > From: SJC Multimedia <sjcmultime...@gmail.com>
>> > To: user@nutch.apache.org
>> > Cc:
>> > Bcc:
>> > Date: Thu, 29 Jun 2017 08:28:54 -0700
>> > Subject: Custom Plugin Resources Files
>> > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
>> plugin
>> > code, I need to pull in a dictionary of files and run some comparisons
>> > while parsing the document.
>> >
>> > Is there a way to include directory of files through the custom plugin
>> ant
>> > build framework that will work on both local and cluster(hadoop MR)
>> mode?
>> >
>> > Any pointers will be helpful.
>> >
>> > Thanks
>> > Dave
>> >
>> >
>>
>>
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>>
>
>


Re: Custom Plugin Resources Files

2017-06-29 Thread SJC Multimedia
Thanks Lewis and Jorge. Thanks for all the pointers.

Very helpful as I feel I am almost there in getting it working.

When I run it in local mode then I am able to get the dictionary working
but on Hadoop it still fails with NPE.

java.lang.NullPointerException
at java.io.FilterInputStream.available(FilterInputStream.java:168)
at sun.nio.cs.StreamDecoder.inReady(StreamDecoder.java:362)
at sun.nio.cs.StreamDecoder.implReady(StreamDecoder.java:370)
at sun.nio.cs.StreamDecoder.ready(StreamDecoder.java:184)
at java.io.InputStreamReader.ready(InputStreamReader.java:195)
at java.io.BufferedReader.ready(BufferedReader.java:456)
at 
org.apache.nutch.parse.html.db.docscience.JarFileProvider.open(JarFileProvider.java:214)

Line where it fails:

BufferedReader br = new BufferedReader(conf.getConfResourceAsReader("data"
));
data is the directory name under conf folder.

best
Dave

On Thu, Jun 29, 2017 at 9:26 AM, lewis john mcgibbney <lewi...@apache.org>
wrote:

> Hi Dave,
> Does this need to be done in parsing phase? Parsing is already an IO
> intensive process... could you possible do it at another phase?
> Right now, the only plugin I can think of which ships with Nutch source,
> and which consults an external resource (not packaged with Nutch) is the
> index-geoip plugin [0]. This works in distributed mode.
> Please also consider looking into the parsefilter-naivebayes [1] which
> loads in a prebuild model [2] as a resource which is then obviously used
> the filtering.
> hth
> Lewis
>
> [0] https://github.com/apache/nutch/tree/master/src/plugin/index-geoip
> [1]
> https://github.com/apache/nutch/tree/master/src/plugin/
> parsefilter-naivebayes
> [2]
> https://github.com/apache/nutch/blob/master/src/plugin/
> parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/
> NaiveBayesParseFilter.java#L132-L137
>
> On Thu, Jun 29, 2017 at 8:29 AM, <user-digest-h...@nutch.apache.org>
> wrote:
>
> >
> >
> > From: SJC Multimedia <sjcmultime...@gmail.com>
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Thu, 29 Jun 2017 08:28:54 -0700
> > Subject: Custom Plugin Resources Files
> > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> plugin
> > code, I need to pull in a dictionary of files and run some comparisons
> > while parsing the document.
> >
> > Is there a way to include directory of files through the custom plugin
> ant
> > build framework that will work on both local and cluster(hadoop MR) mode?
> >
> > Any pointers will be helpful.
> >
> > Thanks
> > Dave
> >
> >
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>


Re: Custom Plugin Resources Files

2017-06-29 Thread lewis john mcgibbney
Hi Dave,
Does this need to be done in parsing phase? Parsing is already an IO
intensive process... could you possible do it at another phase?
Right now, the only plugin I can think of which ships with Nutch source,
and which consults an external resource (not packaged with Nutch) is the
index-geoip plugin [0]. This works in distributed mode.
Please also consider looking into the parsefilter-naivebayes [1] which
loads in a prebuild model [2] as a resource which is then obviously used
the filtering.
hth
Lewis

[0] https://github.com/apache/nutch/tree/master/src/plugin/index-geoip
[1]
https://github.com/apache/nutch/tree/master/src/plugin/parsefilter-naivebayes
[2]
https://github.com/apache/nutch/blob/master/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java#L132-L137

On Thu, Jun 29, 2017 at 8:29 AM, <user-digest-h...@nutch.apache.org> wrote:

>
>
> From: SJC Multimedia <sjcmultime...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 29 Jun 2017 08:28:54 -0700
> Subject: Custom Plugin Resources Files
> I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the plugin
> code, I need to pull in a dictionary of files and run some comparisons
> while parsing the document.
>
> Is there a way to include directory of files through the custom plugin ant
> build framework that will work on both local and cluster(hadoop MR) mode?
>
> Any pointers will be helpful.
>
> Thanks
> Dave
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Custom Plugin Resources Files

2017-06-29 Thread Jorge Betancourt
Sure, no problem,

Is not specifically for Nutch 2.x but on master you can take a look at the
scoring-similarity [1] plugin, this is just a text file but it's used
internally by the plugin. Usually, a lot of plugins define their own
additional conf files this way, so it's not uncommon or very different from
your use case, I think.

[1]
https://github.com/apache/nutch/blob/master/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/Model.java#L65-L71


On Thu, Jun 29, 2017 at 5:48 PM SJC Multimedia 
wrote:

> Okay makes sense.
>
> If you dont mind can you point me to a specific plugin that does something
> similar?
>
> On Thu, Jun 29, 2017 at 8:39 AM, Jorge Betancourt <
> betancourt.jo...@gmail.com> wrote:
>
> > Hi Dave,
> >
> > My advice would be to leave your resources out of the plugins, if there
> is
> > a configuration file (or additional files), just load what you need from
> > the conf directory if the files dictionary can change just make it
> > configurable on the nutch-site.xml.
> >
> > Best Regards,
> > Jorge
> >
> > PS: You can take a look at how additional files are loaded on the
> different
> > Nutch plugins.
> >
> > On Thu, Jun 29, 2017 at 5:36 PM SJC Multimedia 
> > wrote:
> >
> > > Thing I have already tried is to bundle these resources in the job jar
> > > and load them from the
> > > classpath but that didn't work. I also tried copying them to HDFS and
> > > loading them from there but that too failed.
> > >
> > >
> > > What is the best way to bundle such static resources and reference
> > > them in the custom plugin?
> > >
> > >
> > > Thanks
> > >
> > > Akshar
> > >
> > >
> > >
> > > On Thu, Jun 29, 2017 at 8:28 AM, SJC Multimedia <
> sjcmultime...@gmail.com
> > >
> > > wrote:
> > >
> > > > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> > > > plugin code, I need to pull in a dictionary of files and run some
> > > > comparisons while parsing the document.
> > > >
> > > > Is there a way to include directory of files through the custom
> plugin
> > > ant
> > > > build framework that will work on both local and cluster(hadoop MR)
> > mode?
> > > >
> > > > Any pointers will be helpful.
> > > >
> > > > Thanks
> > > > Dave
> > > >
> > >
> >
>


Re: Custom Plugin Resources Files

2017-06-29 Thread SJC Multimedia
Okay makes sense.

If you dont mind can you point me to a specific plugin that does something
similar?

On Thu, Jun 29, 2017 at 8:39 AM, Jorge Betancourt <
betancourt.jo...@gmail.com> wrote:

> Hi Dave,
>
> My advice would be to leave your resources out of the plugins, if there is
> a configuration file (or additional files), just load what you need from
> the conf directory if the files dictionary can change just make it
> configurable on the nutch-site.xml.
>
> Best Regards,
> Jorge
>
> PS: You can take a look at how additional files are loaded on the different
> Nutch plugins.
>
> On Thu, Jun 29, 2017 at 5:36 PM SJC Multimedia 
> wrote:
>
> > Thing I have already tried is to bundle these resources in the job jar
> > and load them from the
> > classpath but that didn't work. I also tried copying them to HDFS and
> > loading them from there but that too failed.
> >
> >
> > What is the best way to bundle such static resources and reference
> > them in the custom plugin?
> >
> >
> > Thanks
> >
> > Akshar
> >
> >
> >
> > On Thu, Jun 29, 2017 at 8:28 AM, SJC Multimedia  >
> > wrote:
> >
> > > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> > > plugin code, I need to pull in a dictionary of files and run some
> > > comparisons while parsing the document.
> > >
> > > Is there a way to include directory of files through the custom plugin
> > ant
> > > build framework that will work on both local and cluster(hadoop MR)
> mode?
> > >
> > > Any pointers will be helpful.
> > >
> > > Thanks
> > > Dave
> > >
> >
>


Re: Custom Plugin Resources Files

2017-06-29 Thread Jorge Betancourt
Hi Dave,

My advice would be to leave your resources out of the plugins, if there is
a configuration file (or additional files), just load what you need from
the conf directory if the files dictionary can change just make it
configurable on the nutch-site.xml.

Best Regards,
Jorge

PS: You can take a look at how additional files are loaded on the different
Nutch plugins.

On Thu, Jun 29, 2017 at 5:36 PM SJC Multimedia 
wrote:

> Thing I have already tried is to bundle these resources in the job jar
> and load them from the
> classpath but that didn't work. I also tried copying them to HDFS and
> loading them from there but that too failed.
>
>
> What is the best way to bundle such static resources and reference
> them in the custom plugin?
>
>
> Thanks
>
> Akshar
>
>
>
> On Thu, Jun 29, 2017 at 8:28 AM, SJC Multimedia 
> wrote:
>
> > I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> > plugin code, I need to pull in a dictionary of files and run some
> > comparisons while parsing the document.
> >
> > Is there a way to include directory of files through the custom plugin
> ant
> > build framework that will work on both local and cluster(hadoop MR) mode?
> >
> > Any pointers will be helpful.
> >
> > Thanks
> > Dave
> >
>


Re: Custom Plugin Resources Files

2017-06-29 Thread SJC Multimedia
Thing I have already tried is to bundle these resources in the job jar
and load them from the
classpath but that didn't work. I also tried copying them to HDFS and
loading them from there but that too failed.


What is the best way to bundle such static resources and reference
them in the custom plugin?


Thanks

Akshar



On Thu, Jun 29, 2017 at 8:28 AM, SJC Multimedia 
wrote:

> I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the
> plugin code, I need to pull in a dictionary of files and run some
> comparisons while parsing the document.
>
> Is there a way to include directory of files through the custom plugin ant
> build framework that will work on both local and cluster(hadoop MR) mode?
>
> Any pointers will be helpful.
>
> Thanks
> Dave
>


Custom Plugin Resources Files

2017-06-29 Thread SJC Multimedia
I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the plugin
code, I need to pull in a dictionary of files and run some comparisons
while parsing the document.

Is there a way to include directory of files through the custom plugin ant
build framework that will work on both local and cluster(hadoop MR) mode?

Any pointers will be helpful.

Thanks
Dave