subject:"Nutch Crawler, Page Rediction and Pagination"

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread Transbuerg Tian

I have the same conditions like you meet. I think to crawle a dynamic page is black hole for crawler. we could not get all necessary parameters which need to post to a form . and to fetch dynamic page , we need to identify the duplicate page. 2005/9/26, Jack Tang [EMAIL PROTECTED]: Hi Guys

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM

To crawl dynamic pages you need to know the dynamics structure of each website separately. Or, as I used to do it, just crawl anything in small enough chunks, and when something goes wrong, look at the said website, determine why it happened, modify the urlfilter and repeat the process. This

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM

I know that if you are big user (several dedicated machines in a data center with fast connection...) you probably don't care about this, your crawler will run over any website, with 50-500 threads the default three retry times, and the problem will solve itself out. But, can something be done

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM

Say there is only one entry to list all content of the website: http://a.com/search?city=YourCity. (We take it as search engine on the website of course.) If I input YourCity's value as NewYork, then it will list all content related with NewYork, and there are many pages. And the pagination URL