RE: Regarding Internal Links

Yossi Tamari Tue, 06 Mar 2018 05:03:08 -0800

You should go over each segment, and for each one produce a ParseText and a 
ParseData. This is basically what the HTML Parser does for the whole document, 
which is why I suggested you should dive into its code.
A ParseText is basically just a String containing the actual content of the 
segment (after stripping the HTML tags). This is usually the document you want 
to index.
The ParseData structure is a little more complex, but the main things it 
contains are the title of this segment, and the outlinks from the segment (for 
further crawling). Take a look at the code of both classes and it should be 
relatively clear.
Finally, you need to build one ParseResult object, with the original URL, and 
for each of the ParseText/ParseData pairs, call the put method, with the 
internal URL of the segment as the key.


> -----Original Message-----
> From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> Sent: 06 March 2018 14:45
> To: user@nutch.apache.org
> Subject: RE: Regarding Internal Links
> 
> > I am able to get the content corresponding to each Internal link by
> > writing a parse filter plugin. Now  I am  not getting how to proceed
> > further. How can I parse them as separate document and what should
> > my ParseResult filter return??

RE: Regarding Internal Links

Reply via email to