You should go over each segment, and for each one produce a ParseText and a ParseData. This is basically what the HTML Parser does for the whole document, which is why I suggested you should dive into its code. A ParseText is basically just a String containing the actual content of the segment (after stripping the HTML tags). This is usually the document you want to index. The ParseData structure is a little more complex, but the main things it contains are the title of this segment, and the outlinks from the segment (for further crawling). Take a look at the code of both classes and it should be relatively clear. Finally, you need to build one ParseResult object, with the original URL, and for each of the ParseText/ParseData pairs, call the put method, with the internal URL of the segment as the key.
> -----Original Message----- > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > Sent: 06 March 2018 14:45 > To: user@nutch.apache.org > Subject: RE: Regarding Internal Links > > > I am able to get the content corresponding to each Internal link by > > writing a parse filter plugin. Now I am not getting how to proceed > > further. How can I parse them as separate document and what should > > my ParseResult filter return??