Hi Shreemoyee,

As each page is stored by key in the different Nutch files and this key is the url, stripping down the URL to just its domain wouldn't work unless you only had a single page per domain. All Nutch programs, including generator and fetcher, work off of the URL as key.

You can extract the domain from the key using something like this:

String host = URLUtil.getHost(key.toString());

And if you are looking to store it during the fetch/parse I would suggest looking at storing it in the crawl or parse metadata. To do this though you may have to modify the Fetcher job.

Dennis

On 05/30/2010 02:22 AM, Shreemoyee Sarkar wrote:
Hi,

Is it possible to store the website url instead of complete url without
affecting the crawl?

e.g. store http://example.com instead of
http://example.com/foo/bar/page.html

would the generate URLs and fetch for the subsequent depths go smoothly?

Thanks
Shreemoyee

Reply via email to