If you look at the code of the HTML parser, you'll see that the parameter is 
passed the variable "root", the same variable that is passed to the methods 
that extract the outlinks, the title, and the text. So it simply can’t be null. 
It may be an issue with what toString is printing for this element (for example 
it may be printing the name of the root element, and it happens to not have a 
name).
Again, I strongly recommend debugging, so you can see the real value there.

> -----Original Message-----
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 15 March 2018 10:26
> To: user@nutch.apache.org
> Subject: RE: RE: Dependency between plugins
> 
> Yes  I am using Html parser and yes the document is getting parsed but
> document fragment is printing null.
> 
> On 15 Mar 2018 13:52, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> 
> > Is your parser the HTML parser? I can say from experience that the
> > document is passed.
> > I really recommend debugging in local mode rather than using sysout.
> >
> > > -----Original Message-----
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 15 March 2018 10:13
> > > To: user@nutch.apache.org
> > > Subject: RE: RE: Dependency between plugins
> > >
> > > I tried printing the contents of document fragment in
> > > parsefilter-regex
> > by writing
> > > System.out.println(doc) but its printing null!! And document is
> > > getting
> > parsed!!
> > >
> > > On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> > >
> > > > Parse filters receive a DocumentFragment as their fourth parameter.
> > > >
> > > > > -----Original Message-----
> > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > Sent: 15 March 2018 08:50
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: RE: Dependency between plugins
> > > > >
> > > > > Hi Jorge and Yossi,
> > > > > The reason why I am trying to do it is exactly what yossi said
> > > > > "removing
> > > > nutch
> > > > > overhead", I didn't thought that it would be that complicated,
> > > > > All I am
> > > > trying is to
> > > > > call the existing parsers from my own parser, but I am not able
> > > > > to do it
> > > > correctly,
> > > > > may be chain approach is a better idea to do that but *do parse
> > > > > filter
> > > > receives
> > > > > any DOM object?* as a parameter so by accessing that I can
> > > > > extract the
> > > > data I
> > > > > want??
> > > > >
> > > > >
> > > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > > > <yossi.tam...@pipl.com>
> > > > > wrote:
> > > > >
> > > > > > There is no built-in mechanism for this. However, are you sure
> > > > > > you really want a parser for each website, rather than a
> > > > > > parse-filter for each website (which will take the results of
> > > > > > the HTML parser and apply some domain specific customizations)?
> > > > > > In both cases you can use a dispatcher approach, which your
> > > > > > custom parser is, or a chain approach (every parser that is
> > > > > > not intended for this domain returns null, or each
> > > > > > parse-filter that is not intended for this domain returns the 
> > > > > > ParseResult
> that it received).
> > > > > > The advantage of the chain approach is that each new website
> > > > > > parser is a first-class, reusable Nutch object. The advantage
> > > > > > of the dispatcher approach is that you don't need to deal with
> > > > > > a lot of the Nutch overhead, but it is more monolithic (You
> > > > > > can end up with one huge plugin that needs to be constantly
> > > > > > modified whenever one of the
> > > > websites is
> > > > > modified).
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > > Sent: 14 March 2018 15:28
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: RE: Dependency between plugins
> > > > > > >
> > > > > > > Is there a way in nutch by which we can use different parser
> > > > > > > for
> > > > > > different
> > > > > > > websites?
> > > > > > > I am trying to do this by writing a custom parser which will
> > > > > > > call
> > > > > > different parsers
> > > > > > > for different websites?
> > > > > > >
> > > > > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > > > > <semyon.semyo...@mail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > As a side note,
> > > > > > > >
> > > > > > > > I had to implement my own parser with extra functionality,
> > > > > > > > simple copy/past of the code of HTMLparser did the job.
> > > > > > > >
> > > > > > > > If you want to inherit instead of copy paste it can be a
> > > > > > > > bad idea at
> > > > > > all.
> > > > > > > > HTML parser is a concrete non abstract class, therefore
> > > > > > > > the inheritance will not be so smooth as in case of
> > > > > > > > contract implementations(the plugins are contracts, ie
> > > > > > > > interfaces) and can
> > > > > > easily break
> > > > > > > some OOP rules.
> > > > > > > >
> > > > > > > >
> > > > > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > > > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > > > > > > To: user@nutch.apache.org
> > > > > > > > Subject: RE: Dependency between plugins One suggestion I
> > > > > > > > can make is to ensure that the html-parse plugin is built
> > > > > > > > before your plugin (since you are including the jars that
> > > > > > > > are generated in its
> > > > build).
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > > > > Sent: 14 March 2018 09:55
> > > > > > > > > To: user@nutch.apache.org
> > > > > > > > > Subject: Re: Dependency between plugins
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > It didn't worked in ant runtime.
> > > > > > > > > I included "import org.apache.nutch.parse.html;" in my
> > > > > > > > > custom parser
> > > > > > > > code.
> > > > > > > > > but it is throwing errror while i am doing ant runtime.
> > > > > > > > >
> > > > > > > > > [javac]
> > > > > > > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/pars
> > > > > > > > > e-
> > > > > > > > >
> > > > >
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > > > > > > error: cannot find symbol
> > > > > > > > >
> > > > > > > > > [javac] import org.apache.nutch.parse.html;
> > > > > > > > >
> > > > > > > > > [javac] ^
> > > > > > > > >
> > > > > > > > > [javac] symbol: class html
> > > > > > > > >
> > > > > > > > > [javac] location: package org.apache.nutch.parse
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > below are the xml files of my parser
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > My ivy.xml
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <ivy-module version="1.0">
> > > > > > > > >
> > > > > > > > > <info organisation="org.apache.nutch"
> > > > > > > > > module="${ant.project.name}">
> > > > > > > > >
> > > > > > > > > <license name="Apache 2.0"/>
> > > > > > > > >
> > > > > > > > > <ivyauthor name="Apache Nutch Team"
> > > > > > > > > url="http://nutch.apache.org"/>
> > > > > > > > >
> > > > > > > > > <description>
> > > > > > > > >
> > > > > > > > > Apache Nutch
> > > > > > > > >
> > > > > > > > > </description>
> > > > > > > > >
> > > > > > > > > </info>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <configurations>
> > > > > > > > >
> > > > > > > > > <include file="../../../ivy/ivy-configurations.xml"/>
> > > > > > > > >
> > > > > > > > > </configurations>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <publications>
> > > > > > > > >
> > > > > > > > > <!--get the artifact from our module name-->
> > > > > > > > >
> > > > > > > > > <artifact conf="master"/>
> > > > > > > > >
> > > > > > > > > </publications>
> > > > > > > > >
> > > > > > > > > </ivy-module>
> > > > > > > > >
> > > > > > > > > build.xml
> > > > > > > > >
> > > > > > > > > <project name="parse-custom" default="jar-core">
> > > > > > > > >
> > > > > > > > > <import file="../build-plugin.xml"/>
> > > > > > > > >
> > > > > > > > > <!-- Build compilation dependencies --> <target
> > > > > > > > > name="deps-jar"> <ant target="compile-test"
> > inheritall="false"
> > > > > > > > > dir="../parse-html"/> </target>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > <path id="plugin.deps">
> > > > > > > > > <fileset dir="${nutch.root}/build"> <include
> > > > > > > > > name="**/parse-html/*.jar" /> </fileset> </path>
> > > > > > > > >
> > > > > > > > > <!-- Deploy Unit test dependencies --> <target
> > > > > > > > > name="deps-test"> <ant target="deploy" inheritall="false"
> > > > > > > > > dir="../parse-html"/> <ant target="deploy" inheritall="false"
> > > > > > > > > dir="../nutch-extensionpoints"/> </target>
> > > > > > > > >
> > > > > > > > > </project>
> > > > > > > > >
> > > > > > > > > plugin.xml
> > > > > > > > >
> > > > > > > > > <plugin
> > > > > > > > > id="parse-custom"
> > > > > > > > > name="Custom Parse Plug-in"
> > > > > > > > > version="1.0.0"
> > > > > > > > > provider-name="nutch.org">
> > > > > > > > >
> > > > > > > > > <runtime>
> > > > > > > > > <library name="parse-custom.jar"> <export name="*"/>
> > > > > > > > > </library> </runtime>
> > > > > > > > >
> > > > > > > > > <requires>
> > > > > > > > > <import plugin="parse-html"/> <import
> > > > > > > > > plugin="nutch-extensionpoints"/> </requires> <extension
> > > > > > > > > id="org.apache.nutch.parse.custom"
> > > > > > > > > name="CustomParse"
> > > > > > > > > point="org.apache.nutch.parse.Parser">
> > > > > > > > >
> > > > > > > > > <implementation id="org.apache.nutch.parse.
> > custom.CustomParser"
> > > > > > > > > class="org.apache.nutch.parse.custom.CustomParser">
> > > > > > > > > <parameter name="contentType"
> > > > > > > > > value="text/html|application/xhtml+xml"/>
> > > > > > > > > <parameter name="pathSuffix" value=""/>
> > > > > > > > > </implementation>
> > > > > > > > >
> > > > > > > > > </extension>
> > > > > > > > >
> > > > > > > > > </plugin>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari
> > > > > > > > > <yossi.tam...@pipl.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Yash,
> > > > > > > > > >
> > > > > > > > > > I don't know how to do it, I never tried, but if I had
> > > > > > > > > > to it would be a trial and error thing....
> > > > > > > > > >
> > > > > > > > > > If you want to increase the chances that someone will
> > > > > > > > > > answer your question, I suggest you provide as much
> > > > > > > > > > information as
> > > > possible:
> > > > > > > > > > Where did it not work? In "ant runtime", or when
> > > > > > > > > > running in
> > > > Hadoop?
> > > > > > > > > > What was the error message?
> > > > > > > > > > What is the content of your build.xml, plugin.xml, and
> > ivy.xml?
> > > > > > > > > > Is parse-html configured in your plugin-includes?
> > > > > > > > > >
> > > > > > > > > > If it's a problem during execution, I would suggest
> > > > > > > > > > looking at or debugging the code of PluginClassLoader.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > > > > > > Sent: 14 March 2018 08:34
> > > > > > > > > > > To: user@nutch.apache.org
> > > > > > > > > > > Subject: Re: Dependency between plugins
> > > > > > > > > > >
> > > > > > > > > > > Anybody please help me out regarding this.
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan
> > > > > > > > > > > Thenuan < rit2014...@iiita.ac.in> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I am trying to import Htmlparser in my custom parser.
> > > > > > > > > > > > I did it in the same way by which Htmlparser
> > > > > > > > > > > > imports lib-nekohtml but it didn't worked.
> > > > > > > > > > > > Can anybody please tell me how to do it?
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >

Reply via email to