Hi All, I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo connector and Elastic Search as output connector, I want to get some knowledge about the crawling framework/hierarchy used by the webcrawler. As far as I know or I understand the crawling of the URL's works in the manner of tree structure.
I want to know if there is any functionality supported by manifoldcf as of now to store parent URL of a document For example seed URL is: www.example.com. and at document queue 80th number our document identifier is www.example.com/education/univeristy/234.html. Is there any way manifolcf is storing the back traced URL's, that means by following which hierarchy level the 80th document has came from. Like to store 79th, 78th,77th level of document crawl to reach 80th number of documents followed by seed document. Is this crawling hierarchy (if only level also), is being stored somewhere in manifoldcf code yet. If yes does this framework code is present in the form of jar.?? helpful or if not in jar any clue to which Java file this logic is being implemented, will be really. Any kind of clue or help will be really appreciated. Many Thanks Ritika