RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
If you look at the code of the HTML parser, you'll see that the parameter is 
passed the variable "root", the same variable that is passed to the methods 
that extract the outlinks, the title, and the text. So it simply can’t be null. 
It may be an issue with what toString is printing for this element (for example 
it may be printing the name of the root element, and it happens to not have a 
name).
Again, I strongly recommend debugging, so you can see the real value there.

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 15 March 2018 10:26
> To: user@nutch.apache.org
> Subject: RE: RE: Dependency between plugins
> 
> Yes  I am using Html parser and yes the document is getting parsed but
> document fragment is printing null.
> 
> On 15 Mar 2018 13:52, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> 
> > Is your parser the HTML parser? I can say from experience that the
> > document is passed.
> > I really recommend debugging in local mode rather than using sysout.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 15 March 2018 10:13
> > > To: user@nutch.apache.org
> > > Subject: RE: RE: Dependency between plugins
> > >
> > > I tried printing the contents of document fragment in
> > > parsefilter-regex
> > by writing
> > > System.out.println(doc) but its printing null!! And document is
> > > getting
> > parsed!!
> > >
> > > On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> > >
> > > > Parse filters receive a DocumentFragment as their fourth parameter.
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > Sent: 15 March 2018 08:50
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: RE: Dependency between plugins
> > > > >
> > > > > Hi Jorge and Yossi,
> > > > > The reason why I am trying to do it is exactly what yossi said
> > > > > "removing
> > > > nutch
> > > > > overhead", I didn't thought that it would be that complicated,
> > > > > All I am
> > > > trying is to
> > > > > call the existing parsers from my own parser, but I am not able
> > > > > to do it
> > > > correctly,
> > > > > may be chain approach is a better idea to do that but *do parse
> > > > > filter
> > > > receives
> > > > > any DOM object?* as a parameter so by accessing that I can
> > > > > extract the
> > > > data I
> > > > > want??
> > > > >
> > > > >
> > > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > > > <yossi.tam...@pipl.com>
> > > > > wrote:
> > > > >
> > > > > > There is no built-in mechanism for this. However, are you sure
> > > > > > you really want a parser for each website, rather than a
> > > > > > parse-filter for each website (which will take the results of
> > > > > > the HTML parser and apply some domain specific customizations)?
> > > > > > In both cases you can use a dispatcher approach, which your
> > > > > > custom parser is, or a chain approach (every parser that is
> > > > > > not intended for this domain returns null, or each
> > > > > > parse-filter that is not intended for this domain returns the 
> > > > > > ParseResult
> that it received).
> > > > > > The advantage of the chain approach is that each new website
> > > > > > parser is a first-class, reusable Nutch object. The advantage
> > > > > > of the dispatcher approach is that you don't need to deal with
> > > > > > a lot of the Nutch overhead, but it is more monolithic (You
> > > > > > can end up with one huge plugin that needs to be constantly
> > > > > > modified whenever one of the
> > > > websites is
> > > > > modified).
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > > Sent: 14 March 2018 15:28
> > > > > > > To: user@nutch.apache.org
> > > > > >

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
Yes  I am using Html parser and yes the document is getting parsed but
document fragment is printing null.

On 15 Mar 2018 13:52, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> Is your parser the HTML parser? I can say from experience that the
> document is passed.
> I really recommend debugging in local mode rather than using sysout.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 15 March 2018 10:13
> > To: user@nutch.apache.org
> > Subject: RE: RE: Dependency between plugins
> >
> > I tried printing the contents of document fragment in parsefilter-regex
> by writing
> > System.out.println(doc) but its printing null!! And document is getting
> parsed!!
> >
> > On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >
> > > Parse filters receive a DocumentFragment as their fourth parameter.
> > >
> > > > -Original Message-----
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 15 March 2018 08:50
> > > > To: user@nutch.apache.org
> > > > Subject: Re: RE: Dependency between plugins
> > > >
> > > > Hi Jorge and Yossi,
> > > > The reason why I am trying to do it is exactly what yossi said
> > > > "removing
> > > nutch
> > > > overhead", I didn't thought that it would be that complicated, All I
> > > > am
> > > trying is to
> > > > call the existing parsers from my own parser, but I am not able to
> > > > do it
> > > correctly,
> > > > may be chain approach is a better idea to do that but *do parse
> > > > filter
> > > receives
> > > > any DOM object?* as a parameter so by accessing that I can extract
> > > > the
> > > data I
> > > > want??
> > > >
> > > >
> > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > > <yossi.tam...@pipl.com>
> > > > wrote:
> > > >
> > > > > There is no built-in mechanism for this. However, are you sure you
> > > > > really want a parser for each website, rather than a parse-filter
> > > > > for each website (which will take the results of the HTML parser
> > > > > and apply some domain specific customizations)?
> > > > > In both cases you can use a dispatcher approach, which your custom
> > > > > parser is, or a chain approach (every parser that is not intended
> > > > > for this domain returns null, or each parse-filter that is not
> > > > > intended for this domain returns the ParseResult that it received).
> > > > > The advantage of the chain approach is that each new website
> > > > > parser is a first-class, reusable Nutch object. The advantage of
> > > > > the dispatcher approach is that you don't need to deal with a lot
> > > > > of the Nutch overhead, but it is more monolithic (You can end up
> > > > > with one huge plugin that needs to be constantly modified whenever
> > > > > one of the
> > > websites is
> > > > modified).
> > > > >
> > > > > > -Original Message-
> > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > Sent: 14 March 2018 15:28
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: RE: Dependency between plugins
> > > > > >
> > > > > > Is there a way in nutch by which we can use different parser for
> > > > > different
> > > > > > websites?
> > > > > > I am trying to do this by writing a custom parser which will
> > > > > > call
> > > > > different parsers
> > > > > > for different websites?
> > > > > >
> > > > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > > > <semyon.semyo...@mail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > As a side note,
> > > > > > >
> > > > > > > I had to implement my own parser with extra functionality,
> > > > > > > simple copy/past of the code of HTMLparser did the job.
> > > > > > >
> > > > > > > If you want to inherit instead of copy paste it can be a bad
> > > > > > > idea at

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
Is your parser the HTML parser? I can say from experience that the document is 
passed.
I really recommend debugging in local mode rather than using sysout.

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 15 March 2018 10:13
> To: user@nutch.apache.org
> Subject: RE: RE: Dependency between plugins
> 
> I tried printing the contents of document fragment in parsefilter-regex by 
> writing
> System.out.println(doc) but its printing null!! And document is getting 
> parsed!!
> 
> On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> 
> > Parse filters receive a DocumentFragment as their fourth parameter.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 15 March 2018 08:50
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: Dependency between plugins
> > >
> > > Hi Jorge and Yossi,
> > > The reason why I am trying to do it is exactly what yossi said
> > > "removing
> > nutch
> > > overhead", I didn't thought that it would be that complicated, All I
> > > am
> > trying is to
> > > call the existing parsers from my own parser, but I am not able to
> > > do it
> > correctly,
> > > may be chain approach is a better idea to do that but *do parse
> > > filter
> > receives
> > > any DOM object?* as a parameter so by accessing that I can extract
> > > the
> > data I
> > > want??
> > >
> > >
> > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > <yossi.tam...@pipl.com>
> > > wrote:
> > >
> > > > There is no built-in mechanism for this. However, are you sure you
> > > > really want a parser for each website, rather than a parse-filter
> > > > for each website (which will take the results of the HTML parser
> > > > and apply some domain specific customizations)?
> > > > In both cases you can use a dispatcher approach, which your custom
> > > > parser is, or a chain approach (every parser that is not intended
> > > > for this domain returns null, or each parse-filter that is not
> > > > intended for this domain returns the ParseResult that it received).
> > > > The advantage of the chain approach is that each new website
> > > > parser is a first-class, reusable Nutch object. The advantage of
> > > > the dispatcher approach is that you don't need to deal with a lot
> > > > of the Nutch overhead, but it is more monolithic (You can end up
> > > > with one huge plugin that needs to be constantly modified whenever
> > > > one of the
> > websites is
> > > modified).
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > Sent: 14 March 2018 15:28
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: RE: Dependency between plugins
> > > > >
> > > > > Is there a way in nutch by which we can use different parser for
> > > > different
> > > > > websites?
> > > > > I am trying to do this by writing a custom parser which will
> > > > > call
> > > > different parsers
> > > > > for different websites?
> > > > >
> > > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > > <semyon.semyo...@mail.com>
> > > > > wrote:
> > > > >
> > > > > > As a side note,
> > > > > >
> > > > > > I had to implement my own parser with extra functionality,
> > > > > > simple copy/past of the code of HTMLparser did the job.
> > > > > >
> > > > > > If you want to inherit instead of copy paste it can be a bad
> > > > > > idea at
> > > > all.
> > > > > > HTML parser is a concrete non abstract class, therefore the
> > > > > > inheritance will not be so smooth as in case of contract
> > > > > > implementations(the plugins are contracts, ie interfaces) and
> > > > > > can
> > > > easily break
> > > > > some OOP rules.
> > > > > >
> > > > > >
> > > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > &

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
I tried printing the contents of document fragment in parsefilter-regex by
writing System.out.println(doc) but its printing null!! And document is
getting parsed!!

On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> Parse filters receive a DocumentFragment as their fourth parameter.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 15 March 2018 08:50
> > To: user@nutch.apache.org
> > Subject: Re: RE: Dependency between plugins
> >
> > Hi Jorge and Yossi,
> > The reason why I am trying to do it is exactly what yossi said "removing
> nutch
> > overhead", I didn't thought that it would be that complicated, All I am
> trying is to
> > call the existing parsers from my own parser, but I am not able to do it
> correctly,
> > may be chain approach is a better idea to do that but *do parse filter
> receives
> > any DOM object?* as a parameter so by accessing that I can extract the
> data I
> > want??
> >
> >
> > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com>
> > wrote:
> >
> > > There is no built-in mechanism for this. However, are you sure you
> > > really want a parser for each website, rather than a parse-filter for
> > > each website (which will take the results of the HTML parser and apply
> > > some domain specific customizations)?
> > > In both cases you can use a dispatcher approach, which your custom
> > > parser is, or a chain approach (every parser that is not intended for
> > > this domain returns null, or each parse-filter that is not intended
> > > for this domain returns the ParseResult that it received).
> > > The advantage of the chain approach is that each new website parser is
> > > a first-class, reusable Nutch object. The advantage of the dispatcher
> > > approach is that you don't need to deal with a lot of the Nutch
> > > overhead, but it is more monolithic (You can end up with one huge
> > > plugin that needs to be constantly modified whenever one of the
> websites is
> > modified).
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 14 March 2018 15:28
> > > > To: user@nutch.apache.org
> > > > Subject: Re: RE: Dependency between plugins
> > > >
> > > > Is there a way in nutch by which we can use different parser for
> > > different
> > > > websites?
> > > > I am trying to do this by writing a custom parser which will call
> > > different parsers
> > > > for different websites?
> > > >
> > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > <semyon.semyo...@mail.com>
> > > > wrote:
> > > >
> > > > > As a side note,
> > > > >
> > > > > I had to implement my own parser with extra functionality, simple
> > > > > copy/past of the code of HTMLparser did the job.
> > > > >
> > > > > If you want to inherit instead of copy paste it can be a bad idea
> > > > > at
> > > all.
> > > > > HTML parser is a concrete non abstract class, therefore the
> > > > > inheritance will not be so smooth as in case of contract
> > > > > implementations(the plugins are contracts, ie interfaces) and can
> > > easily break
> > > > some OOP rules.
> > > > >
> > > > >
> > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > > > To: user@nutch.apache.org
> > > > > Subject: RE: Dependency between plugins One suggestion I can make
> > > > > is to ensure that the html-parse plugin is built before your
> > > > > plugin (since you are including the jars that are generated in its
> build).
> > > > >
> > > > > > -Original Message-
> > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > Sent: 14 March 2018 09:55
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Dependency between plugins
> > > > > >
> > > > > > Hi,
> > > > > > It didn't worked in ant runtime.
>

RE: RE: Dependency between plugins

2018-03-15 Thread Yossi Tamari
Parse filters receive a DocumentFragment as their fourth parameter.

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 15 March 2018 08:50
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
> 
> Hi Jorge and Yossi,
> The reason why I am trying to do it is exactly what yossi said "removing nutch
> overhead", I didn't thought that it would be that complicated, All I am 
> trying is to
> call the existing parsers from my own parser, but I am not able to do it 
> correctly,
> may be chain approach is a better idea to do that but *do parse filter 
> receives
> any DOM object?* as a parameter so by accessing that I can extract the data I
> want??
> 
> 
> On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com>
> wrote:
> 
> > There is no built-in mechanism for this. However, are you sure you
> > really want a parser for each website, rather than a parse-filter for
> > each website (which will take the results of the HTML parser and apply
> > some domain specific customizations)?
> > In both cases you can use a dispatcher approach, which your custom
> > parser is, or a chain approach (every parser that is not intended for
> > this domain returns null, or each parse-filter that is not intended
> > for this domain returns the ParseResult that it received).
> > The advantage of the chain approach is that each new website parser is
> > a first-class, reusable Nutch object. The advantage of the dispatcher
> > approach is that you don't need to deal with a lot of the Nutch
> > overhead, but it is more monolithic (You can end up with one huge
> > plugin that needs to be constantly modified whenever one of the websites is
> modified).
> >
> > > -----Original Message-
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 14 March 2018 15:28
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: Dependency between plugins
> > >
> > > Is there a way in nutch by which we can use different parser for
> > different
> > > websites?
> > > I am trying to do this by writing a custom parser which will call
> > different parsers
> > > for different websites?
> > >
> > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> <semyon.semyo...@mail.com>
> > > wrote:
> > >
> > > > As a side note,
> > > >
> > > > I had to implement my own parser with extra functionality, simple
> > > > copy/past of the code of HTMLparser did the job.
> > > >
> > > > If you want to inherit instead of copy paste it can be a bad idea
> > > > at
> > all.
> > > > HTML parser is a concrete non abstract class, therefore the
> > > > inheritance will not be so smooth as in case of contract
> > > > implementations(the plugins are contracts, ie interfaces) and can
> > easily break
> > > some OOP rules.
> > > >
> > > >
> > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > > To: user@nutch.apache.org
> > > > Subject: RE: Dependency between plugins One suggestion I can make
> > > > is to ensure that the html-parse plugin is built before your
> > > > plugin (since you are including the jars that are generated in its 
> > > > build).
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > Sent: 14 March 2018 09:55
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Dependency between plugins
> > > > >
> > > > > Hi,
> > > > > It didn't worked in ant runtime.
> > > > > I included "import org.apache.nutch.parse.html;" in my custom
> > > > > parser
> > > > code.
> > > > > but it is throwing errror while i am doing ant runtime.
> > > > >
> > > > > [javac]
> > > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > > >
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > > error: cannot find symbol
> > > > >
> > > > > [javac] import org.apache.nutch.parse.html;
> > > > >
> > > > > [javac] ^
> > > > >
> > > > > [javac] symbol: class html
> > > &

Re: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
Hi Jorge and Yossi,
The reason why I am trying to do it is exactly what yossi said "removing
nutch overhead", I didn't thought that it would be that complicated, All I
am trying is to call the existing parsers from my own parser, but I am not
able to do it correctly, may be chain approach is a better idea to do that
but *do parse filter receives any DOM object?* as a parameter so by
accessing that I can extract the data I want??


On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> There is no built-in mechanism for this. However, are you sure you really
> want a parser for each website, rather than a parse-filter for each website
> (which will take the results of the HTML parser and apply some domain
> specific customizations)?
> In both cases you can use a dispatcher approach, which your custom parser
> is, or a chain approach (every parser that is not intended for this domain
> returns null, or each parse-filter that is not intended for this domain
> returns the ParseResult that it received).
> The advantage of the chain approach is that each new website parser is a
> first-class, reusable Nutch object. The advantage of the dispatcher
> approach is that you don't need to deal with a lot of the Nutch overhead,
> but it is more monolithic (You can end up with one huge plugin that needs
> to be constantly modified whenever one of the websites is modified).
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 14 March 2018 15:28
> > To: user@nutch.apache.org
> > Subject: Re: RE: Dependency between plugins
> >
> > Is there a way in nutch by which we can use different parser for
> different
> > websites?
> > I am trying to do this by writing a custom parser which will call
> different parsers
> > for different websites?
> >
> > On 14 Mar 2018 14:19, "Semyon Semyonov" <semyon.semyo...@mail.com>
> > wrote:
> >
> > > As a side note,
> > >
> > > I had to implement my own parser with extra functionality, simple
> > > copy/past of the code of HTMLparser did the job.
> > >
> > > If you want to inherit instead of copy paste it can be a bad idea at
> all.
> > > HTML parser is a concrete non abstract class, therefore the
> > > inheritance will not be so smooth as in case of contract
> > > implementations(the plugins are contracts, ie interfaces) and can
> easily break
> > some OOP rules.
> > >
> > >
> > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > To: user@nutch.apache.org
> > > Subject: RE: Dependency between plugins One suggestion I can make is
> > > to ensure that the html-parse plugin is built before your plugin
> > > (since you are including the jars that are generated in its build).
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 14 March 2018 09:55
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Dependency between plugins
> > > >
> > > > Hi,
> > > > It didn't worked in ant runtime.
> > > > I included "import org.apache.nutch.parse.html;" in my custom parser
> > > code.
> > > > but it is throwing errror while i am doing ant runtime.
> > > >
> > > > [javac]
> > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > error: cannot find symbol
> > > >
> > > > [javac] import org.apache.nutch.parse.html;
> > > >
> > > > [javac] ^
> > > >
> > > > [javac] symbol: class html
> > > >
> > > > [javac] location: package org.apache.nutch.parse
> > > >
> > > >
> > > > below are the xml files of my parser
> > > >
> > > >
> > > > My ivy.xml
> > > >
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > http://nutch.apache.org"/>
> > > >
> > > > 
> > > >
> > > > Apache Nutch
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > 

RE: RE: Dependency between plugins

2018-03-14 Thread Yossi Tamari
There is no built-in mechanism for this. However, are you sure you really want 
a parser for each website, rather than a parse-filter for each website (which 
will take the results of the HTML parser and apply some domain specific 
customizations)?
In both cases you can use a dispatcher approach, which your custom parser is, 
or a chain approach (every parser that is not intended for this domain returns 
null, or each parse-filter that is not intended for this domain returns the 
ParseResult that it received).
The advantage of the chain approach is that each new website parser is a 
first-class, reusable Nutch object. The advantage of the dispatcher approach is 
that you don't need to deal with a lot of the Nutch overhead, but it is more 
monolithic (You can end up with one huge plugin that needs to be constantly 
modified whenever one of the websites is modified). 

> -Original Message-
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 14 March 2018 15:28
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
> 
> Is there a way in nutch by which we can use different parser for different
> websites?
> I am trying to do this by writing a custom parser which will call different 
> parsers
> for different websites?
> 
> On 14 Mar 2018 14:19, "Semyon Semyonov" <semyon.semyo...@mail.com>
> wrote:
> 
> > As a side note,
> >
> > I had to implement my own parser with extra functionality, simple
> > copy/past of the code of HTMLparser did the job.
> >
> > If you want to inherit instead of copy paste it can be a bad idea at all.
> > HTML parser is a concrete non abstract class, therefore the
> > inheritance will not be so smooth as in case of contract
> > implementations(the plugins are contracts, ie interfaces) and can easily 
> > break
> some OOP rules.
> >
> >
> > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > To: user@nutch.apache.org
> > Subject: RE: Dependency between plugins One suggestion I can make is
> > to ensure that the html-parse plugin is built before your plugin
> > (since you are including the jars that are generated in its build).
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 14 March 2018 09:55
> > > To: user@nutch.apache.org
> > > Subject: Re: Dependency between plugins
> > >
> > > Hi,
> > > It didn't worked in ant runtime.
> > > I included "import org.apache.nutch.parse.html;" in my custom parser
> > code.
> > > but it is throwing errror while i am doing ant runtime.
> > >
> > > [javac]
> > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > error: cannot find symbol
> > >
> > > [javac] import org.apache.nutch.parse.html;
> > >
> > > [javac] ^
> > >
> > > [javac] symbol: class html
> > >
> > > [javac] location: package org.apache.nutch.parse
> > >
> > >
> > > below are the xml files of my parser
> > >
> > >
> > > My ivy.xml
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > http://nutch.apache.org"/>
> > >
> > > 
> > >
> > > Apache Nutch
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > build.xml
> > >
> > > 
> > >
> > > 
> > >
> > >  
> > > 
> > > 
> > >
> > >
> > > 
> > > 
> > >   
> > >
> > >  
> > >   > > target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
> > > 
> > >
> > > 
> > >
> > > plugin.xml
> > >
> > >  > > id="parse-custom"
> > > name="Custom Parse Plug-in"
> > > version="1.0.0"
> > > provider-name="nutch.org">
> > >
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > 
> > > 
> > >> > id="org.apache

Re: RE: Dependency between plugins

2018-03-14 Thread Jorge Betancourt
Is there any reason why writing a `HtmlParseFilter` would not be enough?
The HTML parser will execute its own logic and provide a DOM representation
to all the filters and you can extract your own data from the DOM tree.

At the moment individual parsers are matched by mimetype (see
https://github.com/apache/nutch/blob/master/conf/parse-plugins.xml).

Regards,

On Wed, Mar 14, 2018 at 2:27 PM, Yash Thenuan Thenuan <
rit2014...@iiita.ac.in> wrote:

> Is there a way in nutch by which we can use different parser for different
> websites?
> I am trying to do this by writing a custom parser which will call different
> parsers for different websites?
>
> On 14 Mar 2018 14:19, "Semyon Semyonov"  wrote:
>
> > As a side note,
> >
> > I had to implement my own parser with extra functionality, simple
> > copy/past of the code of HTMLparser did the job.
> >
> > If you want to inherit instead of copy paste it can be a bad idea at all.
> > HTML parser is a concrete non abstract class, therefore the inheritance
> > will not be so smooth as in case of contract implementations(the plugins
> > are contracts, ie interfaces) and can easily break some OOP rules.
> >
> >
> > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > From: "Yossi Tamari" 
> > To: user@nutch.apache.org
> > Subject: RE: Dependency between plugins
> > One suggestion I can make is to ensure that the html-parse plugin is
> built
> > before your plugin (since you are including the jars that are generated
> in
> > its build).
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 14 March 2018 09:55
> > > To: user@nutch.apache.org
> > > Subject: Re: Dependency between plugins
> > >
> > > Hi,
> > > It didn't worked in ant runtime.
> > > I included "import org.apache.nutch.parse.html;" in my custom parser
> > code.
> > > but it is throwing errror while i am doing ant runtime.
> > >
> > > [javac]
> > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > error: cannot find symbol
> > >
> > > [javac] import org.apache.nutch.parse.html;
> > >
> > > [javac] ^
> > >
> > > [javac] symbol: class html
> > >
> > > [javac] location: package org.apache.nutch.parse
> > >
> > >
> > > below are the xml files of my parser
> > >
> > >
> > > My ivy.xml
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > http://nutch.apache.org"/>
> > >
> > > 
> > >
> > > Apache Nutch
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > > build.xml
> > >
> > > 
> > >
> > > 
> > >
> > > 
> > > 
> > > 
> > > 
> > >
> > >
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > 
> > > 
> > > 
> > >  />
> > > 
> > >
> > > 
> > >
> > > plugin.xml
> > >
> > >  > > id="parse-custom"
> > > name="Custom Parse Plug-in"
> > > version="1.0.0"
> > > provider-name="nutch.org">
> > >
> > > 
> > > 
> > > 
> > > 
> > > 
> > >
> > > 
> > > 
> > > 
> > > 
> > >  > > name="CustomParse"
> > > point="org.apache.nutch.parse.Parser">
> > >
> > >  > > class="org.apache.nutch.parse.custom.CustomParser">
> > >  > > value="text/html|application/xhtml+xml"/>
> > > 
> > > 
> > >
> > > 
> > >
> > > 
> > >
> > >
> > >
> > >
> > > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari 
> > > wrote:
> > >
> > > > Hi Yash,
> > > >
> > > > I don't know how to do it, I never tried, but if I had to it would be
> > > > a trial and error thing
> > > >
> > > > If you want to increase the chances that someone will answer your
> > > > question, I suggest you provide as much information as possible:
> > > > Where did it not work? In "ant runtime", or when running in Hadoop?
> > > > What was the error message?
> > > > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > > > Is parse-html configured in your plugin-includes?
> > > >
> > > > If it's a problem during execution, I would suggest looking at or
> > > > debugging the code of PluginClassLoader.
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan 
> > > > > Sent: 14 March 2018 08:34
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Dependency between plugins
> > > > >
> > > > > Anybody please help me out regarding this.
> > > > >
> > > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > > > rit2014...@iiita.ac.in> wrote:
> > > > >
> > > > > > I am trying to import Htmlparser in my custom parser.
> > > > > > I did it in the same way by which Htmlparser imports lib-nekohtml
> > > > > > but it didn't worked.
> > > > > > Can anybody please tell me how to do it?
> > > > > >
> > > >
> > > >
> >
> >
>


Re: RE: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
Is there a way in nutch by which we can use different parser for different
websites?
I am trying to do this by writing a custom parser which will call different
parsers for different websites?

On 14 Mar 2018 14:19, "Semyon Semyonov"  wrote:

> As a side note,
>
> I had to implement my own parser with extra functionality, simple
> copy/past of the code of HTMLparser did the job.
>
> If you want to inherit instead of copy paste it can be a bad idea at all.
> HTML parser is a concrete non abstract class, therefore the inheritance
> will not be so smooth as in case of contract implementations(the plugins
> are contracts, ie interfaces) and can easily break some OOP rules.
>
>
> Sent: Wednesday, March 14, 2018 at 9:18 AM
> From: "Yossi Tamari" 
> To: user@nutch.apache.org
> Subject: RE: Dependency between plugins
> One suggestion I can make is to ensure that the html-parse plugin is built
> before your plugin (since you are including the jars that are generated in
> its build).
>
> > -Original Message-
> > From: Yash Thenuan Thenuan 
> > Sent: 14 March 2018 09:55
> > To: user@nutch.apache.org
> > Subject: Re: Dependency between plugins
> >
> > Hi,
> > It didn't worked in ant runtime.
> > I included "import org.apache.nutch.parse.html;" in my custom parser
> code.
> > but it is throwing errror while i am doing ant runtime.
> >
> > [javac]
> > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > error: cannot find symbol
> >
> > [javac] import org.apache.nutch.parse.html;
> >
> > [javac] ^
> >
> > [javac] symbol: class html
> >
> > [javac] location: package org.apache.nutch.parse
> >
> >
> > below are the xml files of my parser
> >
> >
> > My ivy.xml
> >
> >
> > 
> >
> > 
> >
> > 
> >
> > http://nutch.apache.org"/>
> >
> > 
> >
> > Apache Nutch
> >
> > 
> >
> > 
> >
> >
> > 
> >
> > 
> >
> > 
> >
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > build.xml
> >
> > 
> >
> > 
> >
> > 
> > 
> > 
> > 
> >
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> >
> > plugin.xml
> >
> >  > id="parse-custom"
> > name="Custom Parse Plug-in"
> > version="1.0.0"
> > provider-name="nutch.org">
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> >  > name="CustomParse"
> > point="org.apache.nutch.parse.Parser">
> >
> >  > class="org.apache.nutch.parse.custom.CustomParser">
> >  > value="text/html|application/xhtml+xml"/>
> > 
> > 
> >
> > 
> >
> > 
> >
> >
> >
> >
> > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari 
> > wrote:
> >
> > > Hi Yash,
> > >
> > > I don't know how to do it, I never tried, but if I had to it would be
> > > a trial and error thing
> > >
> > > If you want to increase the chances that someone will answer your
> > > question, I suggest you provide as much information as possible:
> > > Where did it not work? In "ant runtime", or when running in Hadoop?
> > > What was the error message?
> > > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > > Is parse-html configured in your plugin-includes?
> > >
> > > If it's a problem during execution, I would suggest looking at or
> > > debugging the code of PluginClassLoader.
> > >
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan 
> > > > Sent: 14 March 2018 08:34
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Dependency between plugins
> > > >
> > > > Anybody please help me out regarding this.
> > > >
> > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in> wrote:
> > > >
> > > > > I am trying to import Htmlparser in my custom parser.
> > > > > I did it in the same way by which Htmlparser imports lib-nekohtml
> > > > > but it didn't worked.
> > > > > Can anybody please tell me how to do it?
> > > > >
> > >
> > >
>
>


Re: RE: Dependency between plugins

2018-03-14 Thread Semyon Semyonov
As a side note,

I had to implement my own parser with extra functionality, simple copy/past of 
the code of HTMLparser did the job.

If you want to inherit instead of copy paste it can be a bad idea at all. HTML 
parser is a concrete non abstract class, therefore the inheritance will not be 
so smooth as in case of contract implementations(the plugins are contracts, ie 
interfaces) and can easily break some OOP rules.
 

Sent: Wednesday, March 14, 2018 at 9:18 AM
From: "Yossi Tamari" 
To: user@nutch.apache.org
Subject: RE: Dependency between plugins
One suggestion I can make is to ensure that the html-parse plugin is built 
before your plugin (since you are including the jars that are generated in its 
build).

> -Original Message-
> From: Yash Thenuan Thenuan 
> Sent: 14 March 2018 09:55
> To: user@nutch.apache.org
> Subject: Re: Dependency between plugins
>
> Hi,
> It didn't worked in ant runtime.
> I included "import org.apache.nutch.parse.html;" in my custom parser code.
> but it is throwing errror while i am doing ant runtime.
>
> [javac]
> /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> error: cannot find symbol
>
> [javac] import org.apache.nutch.parse.html;
>
> [javac] ^
>
> [javac] symbol: class html
>
> [javac] location: package org.apache.nutch.parse
>
>
> below are the xml files of my parser
>
>
> My ivy.xml
>
>
> 
>
> 
>
> 
>
> http://nutch.apache.org"/>
>
> 
>
> Apache Nutch
>
> 
>
> 
>
>
> 
>
> 
>
> 
>
>
> 
>
> 
>
> 
>
> 
>
> 
>
> build.xml
>
> 
>
> 
>
> 
> 
> 
> 
>
>
> 
> 
> 
> 
> 
>
> 
> 
> 
> 
> 
>
> 
>
> plugin.xml
>
>  id="parse-custom"
> name="Custom Parse Plug-in"
> version="1.0.0"
> provider-name="nutch.org">
>
> 
> 
> 
> 
> 
>
> 
> 
> 
> 
>  name="CustomParse"
> point="org.apache.nutch.parse.Parser">
>
>  class="org.apache.nutch.parse.custom.CustomParser">
>  value="text/html|application/xhtml+xml"/>
> 
> 
>
> 
>
> 
>
>
>
>
> On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari 
> wrote:
>
> > Hi Yash,
> >
> > I don't know how to do it, I never tried, but if I had to it would be
> > a trial and error thing
> >
> > If you want to increase the chances that someone will answer your
> > question, I suggest you provide as much information as possible:
> > Where did it not work? In "ant runtime", or when running in Hadoop?
> > What was the error message?
> > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > Is parse-html configured in your plugin-includes?
> >
> > If it's a problem during execution, I would suggest looking at or
> > debugging the code of PluginClassLoader.
> >
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan 
> > > Sent: 14 March 2018 08:34
> > > To: user@nutch.apache.org
> > > Subject: Re: Dependency between plugins
> > >
> > > Anybody please help me out regarding this.
> > >
> > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > rit2014...@iiita.ac.in> wrote:
> > >
> > > > I am trying to import Htmlparser in my custom parser.
> > > > I did it in the same way by which Htmlparser imports lib-nekohtml
> > > > but it didn't worked.
> > > > Can anybody please tell me how to do it?
> > > >
> >
> >