Is your parser the HTML parser? I can say from experience that the document is passed. I really recommend debugging in local mode rather than using sysout.
> -----Original Message----- > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > Sent: 15 March 2018 10:13 > To: user@nutch.apache.org > Subject: RE: RE: Dependency between plugins > > I tried printing the contents of document fragment in parsefilter-regex by > writing > System.out.println(doc) but its printing null!! And document is getting > parsed!! > > On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > > > Parse filters receive a DocumentFragment as their fourth parameter. > > > > > -----Original Message----- > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > > Sent: 15 March 2018 08:50 > > > To: user@nutch.apache.org > > > Subject: Re: RE: Dependency between plugins > > > > > > Hi Jorge and Yossi, > > > The reason why I am trying to do it is exactly what yossi said > > > "removing > > nutch > > > overhead", I didn't thought that it would be that complicated, All I > > > am > > trying is to > > > call the existing parsers from my own parser, but I am not able to > > > do it > > correctly, > > > may be chain approach is a better idea to do that but *do parse > > > filter > > receives > > > any DOM object?* as a parameter so by accessing that I can extract > > > the > > data I > > > want?? > > > > > > > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari > > > <yossi.tam...@pipl.com> > > > wrote: > > > > > > > There is no built-in mechanism for this. However, are you sure you > > > > really want a parser for each website, rather than a parse-filter > > > > for each website (which will take the results of the HTML parser > > > > and apply some domain specific customizations)? > > > > In both cases you can use a dispatcher approach, which your custom > > > > parser is, or a chain approach (every parser that is not intended > > > > for this domain returns null, or each parse-filter that is not > > > > intended for this domain returns the ParseResult that it received). > > > > The advantage of the chain approach is that each new website > > > > parser is a first-class, reusable Nutch object. The advantage of > > > > the dispatcher approach is that you don't need to deal with a lot > > > > of the Nutch overhead, but it is more monolithic (You can end up > > > > with one huge plugin that needs to be constantly modified whenever > > > > one of the > > websites is > > > modified). > > > > > > > > > -----Original Message----- > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > > > > Sent: 14 March 2018 15:28 > > > > > To: user@nutch.apache.org > > > > > Subject: Re: RE: Dependency between plugins > > > > > > > > > > Is there a way in nutch by which we can use different parser for > > > > different > > > > > websites? > > > > > I am trying to do this by writing a custom parser which will > > > > > call > > > > different parsers > > > > > for different websites? > > > > > > > > > > On 14 Mar 2018 14:19, "Semyon Semyonov" > > > <semyon.semyo...@mail.com> > > > > > wrote: > > > > > > > > > > > As a side note, > > > > > > > > > > > > I had to implement my own parser with extra functionality, > > > > > > simple copy/past of the code of HTMLparser did the job. > > > > > > > > > > > > If you want to inherit instead of copy paste it can be a bad > > > > > > idea at > > > > all. > > > > > > HTML parser is a concrete non abstract class, therefore the > > > > > > inheritance will not be so smooth as in case of contract > > > > > > implementations(the plugins are contracts, ie interfaces) and > > > > > > can > > > > easily break > > > > > some OOP rules. > > > > > > > > > > > > > > > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM > > > > > > From: "Yossi Tamari" <yossi.tam...@pipl.com> > > > > > > To: user@nutch.apache.org > > > > > > Subject: RE: Dependency between plugins One suggestion I can > > > > > > make is to ensure that the html-parse plugin is built before > > > > > > your plugin (since you are including the jars that are > > > > > > generated in its > > build). > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > > > > > > Sent: 14 March 2018 09:55 > > > > > > > To: user@nutch.apache.org > > > > > > > Subject: Re: Dependency between plugins > > > > > > > > > > > > > > Hi, > > > > > > > It didn't worked in ant runtime. > > > > > > > I included "import org.apache.nutch.parse.html;" in my > > > > > > > custom parser > > > > > > code. > > > > > > > but it is throwing errror while i am doing ant runtime. > > > > > > > > > > > > > > [javac] > > > > > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse- > > > > > > > > > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41: > > > > > > > error: cannot find symbol > > > > > > > > > > > > > > [javac] import org.apache.nutch.parse.html; > > > > > > > > > > > > > > [javac] ^ > > > > > > > > > > > > > > [javac] symbol: class html > > > > > > > > > > > > > > [javac] location: package org.apache.nutch.parse > > > > > > > > > > > > > > > > > > > > > below are the xml files of my parser > > > > > > > > > > > > > > > > > > > > > My ivy.xml > > > > > > > > > > > > > > > > > > > > > <ivy-module version="1.0"> > > > > > > > > > > > > > > <info organisation="org.apache.nutch" > > > > > > > module="${ant.project.name}"> > > > > > > > > > > > > > > <license name="Apache 2.0"/> > > > > > > > > > > > > > > <ivyauthor name="Apache Nutch Team" > > > > > > > url="http://nutch.apache.org"/> > > > > > > > > > > > > > > <description> > > > > > > > > > > > > > > Apache Nutch > > > > > > > > > > > > > > </description> > > > > > > > > > > > > > > </info> > > > > > > > > > > > > > > > > > > > > > <configurations> > > > > > > > > > > > > > > <include file="../../../ivy/ivy-configurations.xml"/> > > > > > > > > > > > > > > </configurations> > > > > > > > > > > > > > > > > > > > > > <publications> > > > > > > > > > > > > > > <!--get the artifact from our module name--> > > > > > > > > > > > > > > <artifact conf="master"/> > > > > > > > > > > > > > > </publications> > > > > > > > > > > > > > > </ivy-module> > > > > > > > > > > > > > > build.xml > > > > > > > > > > > > > > <project name="parse-custom" default="jar-core"> > > > > > > > > > > > > > > <import file="../build-plugin.xml"/> > > > > > > > > > > > > > > <!-- Build compilation dependencies --> <target > > > > > > > name="deps-jar"> <ant target="compile-test" inheritall="false" > > > > > > > dir="../parse-html"/> </target> > > > > > > > > > > > > > > > > > > > > > <path id="plugin.deps"> > > > > > > > <fileset dir="${nutch.root}/build"> <include > > > > > > > name="**/parse-html/*.jar" /> </fileset> </path> > > > > > > > > > > > > > > <!-- Deploy Unit test dependencies --> <target > > > > > > > name="deps-test"> <ant target="deploy" inheritall="false" > > > > > > > dir="../parse-html"/> <ant target="deploy" inheritall="false" > > > > > > > dir="../nutch-extensionpoints"/> </target> > > > > > > > > > > > > > > </project> > > > > > > > > > > > > > > plugin.xml > > > > > > > > > > > > > > <plugin > > > > > > > id="parse-custom" > > > > > > > name="Custom Parse Plug-in" > > > > > > > version="1.0.0" > > > > > > > provider-name="nutch.org"> > > > > > > > > > > > > > > <runtime> > > > > > > > <library name="parse-custom.jar"> <export name="*"/> > > > > > > > </library> </runtime> > > > > > > > > > > > > > > <requires> > > > > > > > <import plugin="parse-html"/> <import > > > > > > > plugin="nutch-extensionpoints"/> </requires> <extension > > > > > > > id="org.apache.nutch.parse.custom" > > > > > > > name="CustomParse" > > > > > > > point="org.apache.nutch.parse.Parser"> > > > > > > > > > > > > > > <implementation id="org.apache.nutch.parse.custom.CustomParser" > > > > > > > class="org.apache.nutch.parse.custom.CustomParser"> > > > > > > > <parameter name="contentType" > > > > > > > value="text/html|application/xhtml+xml"/> > > > > > > > <parameter name="pathSuffix" value=""/> </implementation> > > > > > > > > > > > > > > </extension> > > > > > > > > > > > > > > </plugin> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari > > > > > > > <yossi.tam...@pipl.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Yash, > > > > > > > > > > > > > > > > I don't know how to do it, I never tried, but if I had to > > > > > > > > it would be a trial and error thing.... > > > > > > > > > > > > > > > > If you want to increase the chances that someone will > > > > > > > > answer your question, I suggest you provide as much > > > > > > > > information as > > possible: > > > > > > > > Where did it not work? In "ant runtime", or when running > > > > > > > > in > > Hadoop? > > > > > > > > What was the error message? > > > > > > > > What is the content of your build.xml, plugin.xml, and ivy.xml? > > > > > > > > Is parse-html configured in your plugin-includes? > > > > > > > > > > > > > > > > If it's a problem during execution, I would suggest > > > > > > > > looking at or debugging the code of PluginClassLoader. > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > > > > > > > > Sent: 14 March 2018 08:34 > > > > > > > > > To: user@nutch.apache.org > > > > > > > > > Subject: Re: Dependency between plugins > > > > > > > > > > > > > > > > > > Anybody please help me out regarding this. > > > > > > > > > > > > > > > > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan < > > > > > > > > > rit2014...@iiita.ac.in> wrote: > > > > > > > > > > > > > > > > > > > I am trying to import Htmlparser in my custom parser. > > > > > > > > > > I did it in the same way by which Htmlparser imports > > > > > > > > > > lib-nekohtml but it didn't worked. > > > > > > > > > > Can anybody please tell me how to do it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >