Re: Storing website urls instead of complete urls in index

Dennis Kubes Tue, 01 Jun 2010 06:34:39 -0700

Hi Shreemoyee,

As each page is stored by key in the different Nutch files and this keyis the url, stripping down the URL to just its domain wouldn't workunless you only had a single page per domain. All Nutch programs,including generator and fetcher, work off of the URL as key.


You can extract the domain from the key using something like this:

String host = URLUtil.getHost(key.toString());

And if you are looking to store it during the fetch/parse I wouldsuggest looking at storing it in the crawl or parse metadata. To dothis though you may have to modify the Fetcher job.


Dennis

On 05/30/2010 02:22 AM, Shreemoyee Sarkar wrote:

Hi,

Is it possible to store the website url instead of complete url without
affecting the crawl?

e.g. store http://example.com instead of
http://example.com/foo/bar/page.html

would the generate URLs and fetch for the subsequent depths go smoothly?

Thanks
Shreemoyee

Re: Storing website urls instead of complete urls in index

Reply via email to