Regarding the configuration parameter, your Parse Filter should expose a setConf method that receives a conf parameter. Keep that as a member variable and pass it where necessary. Regarding parsestatus, contentmeta and parsemeta, you're going to have to look at them yourself (probably in a debugger), but as a baseline, you can probably just use the values in the inbound ParseResult (of the whole document). More specifically, parsestatus is an indication of whether parsing was successful. Unless your parsing may fail even when the whole document parsing was successful, you don't need to change it. contentmeta is all the information that was gathered about this page before parsing, so again, you probably just want to keep it, and finally parsemeta is the metadata that was gathered during parsing and may be useful for indexing, so passing the metadata from the original ParseResult makes sense, or just using the constructor that does not require it if you don't care about the metadata. This should all be easier to understand if you look at what the HTML Parser does with each of these fields.
> -----Original Message----- > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > Sent: 06 March 2018 20:17 > To: email@example.com > Subject: RE: Regarding Internal Links > > I am able to get parsetext data structure. > But having trouble with parseData as it's constructor is asking for > parsestatus, > outlinks, contentmeta and parsemeta. > Outlinks I can get from outlinkExtractor but what about other parameters? > And again getoutlinks is asking for configuration and i don't know, from > where I > can get it? > > On 6 Mar 2018 18:32, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > > > You should go over each segment, and for each one produce a ParseText > > and a ParseData. This is basically what the HTML Parser does for the > > whole document, which is why I suggested you should dive into its code. > > A ParseText is basically just a String containing the actual content > > of the segment (after stripping the HTML tags). This is usually the > > document you want to index. > > The ParseData structure is a little more complex, but the main things > > it contains are the title of this segment, and the outlinks from the > > segment (for further crawling). Take a look at the code of both > > classes and it should be relatively clear. > > Finally, you need to build one ParseResult object, with the original > > URL, and for each of the ParseText/ParseData pairs, call the put > > method, with the internal URL of the segment as the key. > > > > > -----Original Message----- > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > > Sent: 06 March 2018 14:45 > > > To: firstname.lastname@example.org > > > Subject: RE: Regarding Internal Links > > > > > > > I am able to get the content corresponding to each Internal link > > > > by writing a parse filter plugin. Now I am not getting how to > > > > proceed further. How can I parse them as separate document and > > > > what should my ParseResult filter return?? > > > >