Parse filters receive a DocumentFragment as their fourth parameter.

> -----Original Message-----
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 15 March 2018 08:50
> To: user@nutch.apache.org
> Subject: Re: RE: Dependency between plugins
> 
> Hi Jorge and Yossi,
> The reason why I am trying to do it is exactly what yossi said "removing nutch
> overhead", I didn't thought that it would be that complicated, All I am 
> trying is to
> call the existing parsers from my own parser, but I am not able to do it 
> correctly,
> may be chain approach is a better idea to do that but *do parse filter 
> receives
> any DOM object?* as a parameter so by accessing that I can extract the data I
> want??
> 
> 
> On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com>
> wrote:
> 
> > There is no built-in mechanism for this. However, are you sure you
> > really want a parser for each website, rather than a parse-filter for
> > each website (which will take the results of the HTML parser and apply
> > some domain specific customizations)?
> > In both cases you can use a dispatcher approach, which your custom
> > parser is, or a chain approach (every parser that is not intended for
> > this domain returns null, or each parse-filter that is not intended
> > for this domain returns the ParseResult that it received).
> > The advantage of the chain approach is that each new website parser is
> > a first-class, reusable Nutch object. The advantage of the dispatcher
> > approach is that you don't need to deal with a lot of the Nutch
> > overhead, but it is more monolithic (You can end up with one huge
> > plugin that needs to be constantly modified whenever one of the websites is
> modified).
> >
> > > -----Original Message-----
> > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > Sent: 14 March 2018 15:28
> > > To: user@nutch.apache.org
> > > Subject: Re: RE: Dependency between plugins
> > >
> > > Is there a way in nutch by which we can use different parser for
> > different
> > > websites?
> > > I am trying to do this by writing a custom parser which will call
> > different parsers
> > > for different websites?
> > >
> > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> <semyon.semyo...@mail.com>
> > > wrote:
> > >
> > > > As a side note,
> > > >
> > > > I had to implement my own parser with extra functionality, simple
> > > > copy/past of the code of HTMLparser did the job.
> > > >
> > > > If you want to inherit instead of copy paste it can be a bad idea
> > > > at
> > all.
> > > > HTML parser is a concrete non abstract class, therefore the
> > > > inheritance will not be so smooth as in case of contract
> > > > implementations(the plugins are contracts, ie interfaces) and can
> > easily break
> > > some OOP rules.
> > > >
> > > >
> > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > > To: user@nutch.apache.org
> > > > Subject: RE: Dependency between plugins One suggestion I can make
> > > > is to ensure that the html-parse plugin is built before your
> > > > plugin (since you are including the jars that are generated in its 
> > > > build).
> > > >
> > > > > -----Original Message-----
> > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > Sent: 14 March 2018 09:55
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Dependency between plugins
> > > > >
> > > > > Hi,
> > > > > It didn't worked in ant runtime.
> > > > > I included "import org.apache.nutch.parse.html;" in my custom
> > > > > parser
> > > > code.
> > > > > but it is throwing errror while i am doing ant runtime.
> > > > >
> > > > > [javac]
> > > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > > >
> custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > > error: cannot find symbol
> > > > >
> > > > > [javac] import org.apache.nutch.parse.html;
> > > > >
> > > > > [javac] ^
> > > > >
> > > > > [javac] symbol: class html
> > > > >
> > > > > [javac] location: package org.apache.nutch.parse
> > > > >
> > > > >
> > > > > below are the xml files of my parser
> > > > >
> > > > >
> > > > > My ivy.xml
> > > > >
> > > > >
> > > > > <ivy-module version="1.0">
> > > > >
> > > > > <info organisation="org.apache.nutch"
> > > > > module="${ant.project.name}">
> > > > >
> > > > > <license name="Apache 2.0"/>
> > > > >
> > > > > <ivyauthor name="Apache Nutch Team"
> > > > > url="http://nutch.apache.org"/>
> > > > >
> > > > > <description>
> > > > >
> > > > > Apache Nutch
> > > > >
> > > > > </description>
> > > > >
> > > > > </info>
> > > > >
> > > > >
> > > > > <configurations>
> > > > >
> > > > > <include file="../../../ivy/ivy-configurations.xml"/>
> > > > >
> > > > > </configurations>
> > > > >
> > > > >
> > > > > <publications>
> > > > >
> > > > > <!--get the artifact from our module name-->
> > > > >
> > > > > <artifact conf="master"/>
> > > > >
> > > > > </publications>
> > > > >
> > > > > </ivy-module>
> > > > >
> > > > > build.xml
> > > > >
> > > > > <project name="parse-custom" default="jar-core">
> > > > >
> > > > > <import file="../build-plugin.xml"/>
> > > > >
> > > > > <!-- Build compilation dependencies --> <target name="deps-jar">
> > > > > <ant target="compile-test" inheritall="false"
> > > > > dir="../parse-html"/> </target>
> > > > >
> > > > >
> > > > > <path id="plugin.deps">
> > > > > <fileset dir="${nutch.root}/build"> <include
> > > > > name="**/parse-html/*.jar" /> </fileset> </path>
> > > > >
> > > > > <!-- Deploy Unit test dependencies --> <target name="deps-test">
> > > > > <ant target="deploy" inheritall="false" dir="../parse-html"/>
> > > > > <ant target="deploy" inheritall="false"
> > > > > dir="../nutch-extensionpoints"/> </target>
> > > > >
> > > > > </project>
> > > > >
> > > > > plugin.xml
> > > > >
> > > > > <plugin
> > > > > id="parse-custom"
> > > > > name="Custom Parse Plug-in"
> > > > > version="1.0.0"
> > > > > provider-name="nutch.org">
> > > > >
> > > > > <runtime>
> > > > > <library name="parse-custom.jar"> <export name="*"/> </library>
> > > > > </runtime>
> > > > >
> > > > > <requires>
> > > > > <import plugin="parse-html"/>
> > > > > <import plugin="nutch-extensionpoints"/> </requires> <extension
> > > > > id="org.apache.nutch.parse.custom"
> > > > > name="CustomParse"
> > > > > point="org.apache.nutch.parse.Parser">
> > > > >
> > > > > <implementation id="org.apache.nutch.parse.custom.CustomParser"
> > > > > class="org.apache.nutch.parse.custom.CustomParser">
> > > > > <parameter name="contentType"
> > > > > value="text/html|application/xhtml+xml"/>
> > > > > <parameter name="pathSuffix" value=""/> </implementation>
> > > > >
> > > > > </extension>
> > > > >
> > > > > </plugin>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari
> > > > > <yossi.tam...@pipl.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Yash,
> > > > > >
> > > > > > I don't know how to do it, I never tried, but if I had to it
> > > > > > would be a trial and error thing....
> > > > > >
> > > > > > If you want to increase the chances that someone will answer
> > > > > > your question, I suggest you provide as much information as 
> > > > > > possible:
> > > > > > Where did it not work? In "ant runtime", or when running in Hadoop?
> > > > > > What was the error message?
> > > > > > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > > > > > Is parse-html configured in your plugin-includes?
> > > > > >
> > > > > > If it's a problem during execution, I would suggest looking at
> > > > > > or debugging the code of PluginClassLoader.
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > > Sent: 14 March 2018 08:34
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Re: Dependency between plugins
> > > > > > >
> > > > > > > Anybody please help me out regarding this.
> > > > > > >
> > > > > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > > > > > rit2014...@iiita.ac.in> wrote:
> > > > > > >
> > > > > > > > I am trying to import Htmlparser in my custom parser.
> > > > > > > > I did it in the same way by which Htmlparser imports
> > > > > > > > lib-nekohtml but it didn't worked.
> > > > > > > > Can anybody please tell me how to do it?
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> >
> >

Reply via email to