Extraction and storing parent URL while crawling

ritika jain Fri, 03 Apr 2020 01:59:16 -0700

Hi All,
I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo
connector and Elastic Search as output connector,
I want to get some knowledge about the crawling framework/hierarchy used by
the webcrawler.
As far as I know or I understand the crawling of the URL's works in the
manner of tree structure.


I want to know if there is any functionality supported by manifoldcf as of
now to store parent URL of a document
For example seed URL is: www.example.com. and at document queue 80th number
our document identifier is www.example.com/education/univeristy/234.html.

Is there any way manifolcf is storing the back traced URL's, that means by
following which hierarchy level the 80th document has came from.
Like to store 79th, 78th,77th level of document crawl to reach 80th number
of documents followed by seed document.

Is this crawling hierarchy (if only level also), is being stored somewhere
in manifoldcf code yet. If yes does this framework code is present in the
form of jar.?? helpful or if not in jar any clue to which Java file this
logic is being implemented, will be really.

Any kind of clue or help will be really appreciated.

Many Thanks
Ritika

Extraction and storing parent URL while crawling

Reply via email to