Nguyen Ngoc Giang wrote:
I'm writing a small program which just utilizes Nutch as a crawler only,
with no search functionality. The program should be able to return page
content given an url input.
In the mapred branch this is directly supported by NutchBean.
Doug
Hi everyone,
I'm writing a small program which just utilizes Nutch as a crawler only,
with no search functionality. The program should be able to return page
content given an url input. I would like to ask how can we get the page
content given only the URL, since webdb only provides a
Take a look to the cache page, it returns the content from the segment.
Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang:
Hi everyone,
I'm writing a small program which just utilizes Nutch as a
crawler only,
with no search functionality. The program should be able to return
page
Hi,
Thanks Stefan and Piotr for your suggestions. My doubt is the same with
Thomas, since in the segments, we store only the RecNo, which can be
retrieved only via searching, which in turn requires indexing.
Can we add the URL of the page during fetching, so that the segment also
the URL?
Hi Nguyen
I am going to face this problem too. Here is my thoughts. One field
will be add in the index, saying uid, and the value of uid will be
generate from URL. Say the url is http://www.a.com/x/y/z.hml
uid = md5_hash(http://www.a.com;).append(md5_hash(/x/y/z.html));
Is that ok? When i query