RE: Regarding Internal Links

Yash Thenuan Thenuan Tue, 06 Mar 2018 10:17:54 -0800

I am able to get parsetext data structure.
But having trouble with parseData as it's constructor is asking for
parsestatus, outlinks, contentmeta and parsemeta.
Outlinks I can get from outlinkExtractor but what about other parameters?
And again getoutlinks is asking for configuration and i don't know, from
where I can get it?


On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> You should go over each segment, and for each one produce a ParseText and
> a ParseData. This is basically what the HTML Parser does for the whole
> document, which is why I suggested you should dive into its code.
> A ParseText is basically just a String containing the actual content of
> the segment (after stripping the HTML tags). This is usually the document
> you want to index.
> The ParseData structure is a little more complex, but the main things it
> contains are the title of this segment, and the outlinks from the segment
> (for further crawling). Take a look at the code of both classes and it
> should be relatively clear.
> Finally, you need to build one ParseResult object, with the original URL,
> and for each of the ParseText/ParseData pairs, call the put method, with
> the internal URL of the segment as the key.
>
> > -----Original Message-----
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 06 March 2018 14:45
> > To: user@nutch.apache.org
> > Subject: RE: Regarding Internal Links
> >
> > > I am able to get the content corresponding to each Internal link by
> > > writing a parse filter plugin. Now  I am  not getting how to proceed
> > > further. How can I parse them as separate document and what should
> > > my ParseResult filter return??
>
>

RE: Regarding Internal Links

Reply via email to