Re: How to get page content given URL only?

2005-12-12 Thread Doug Cutting
Nguyen Ngoc Giang wrote: I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page content given an url input. In the mapred branch this is directly supported by NutchBean. Doug

How to get page content given URL only?

2005-12-09 Thread Nguyen Ngoc Giang
Hi everyone, I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page content given an url input. I would like to ask how can we get the page content given only the URL, since webdb only provides a

Re: How to get page content given URL only?

2005-12-09 Thread Stefan Groschupf
Take a look to the cache page, it returns the content from the segment. Am 09.12.2005 um 09:24 schrieb Nguyen Ngoc Giang: Hi everyone, I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page

Re: How to get page content given URL only?

2005-12-09 Thread Nguyen Ngoc Giang
Hi, Thanks Stefan and Piotr for your suggestions. My doubt is the same with Thomas, since in the segments, we store only the RecNo, which can be retrieved only via searching, which in turn requires indexing. Can we add the URL of the page during fetching, so that the segment also the URL?

Re: How to get page content given URL only?

2005-12-09 Thread Jack Tang
Hi Nguyen I am going to face this problem too. Here is my thoughts. One field will be add in the index, saying uid, and the value of uid will be generate from URL. Say the url is http://www.a.com/x/y/z.hml uid = md5_hash(http://www.a.com;).append(md5_hash(/x/y/z.html)); Is that ok? When i query