I have the same conditions like you meet.
I think to crawle a dynamic page is black hole for crawler.
we could not get all necessary parameters which need to post to a form .
and to fetch dynamic page , we need to identify the duplicate page.
2005/9/26, Jack Tang [EMAIL PROTECTED]:
Hi Guys
To crawl dynamic pages you need to know the dynamics structure of each
website separately.
Or, as I used to do it, just crawl anything in small enough chunks, and
when something goes wrong, look at the said website, determine why it
happened, modify the urlfilter and repeat the process. This
I know that if you are big user (several dedicated machines in a data
center with fast connection...) you probably don't care about this, your
crawler will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done
Say there is only one entry to list all content of the website:
http://a.com/search?city=YourCity. (We take it as search engine on the
website of course.)
If I input YourCity's value as NewYork, then it will list all
content related with NewYork, and there are many pages. And the
pagination URL