Hi
Use case:
* Scrape a given url. e.g. mydomain.com/movies/general

* Parse this page and extract urls that match a certain pattern and
download the pages for these matched urls. lets say the pages I want to
download are mydomain.com/movies/general?id=123 format

Now the problem I am facing is,
* Pagination mydomain.com/movies/general/2 and so on
* links on this page with regex that matches the regex of this page's url
mydoamin.com/movies/kids, mydomain.com/movies/english etc

So when I fetch mydomain.com/movies/general and if this page has links to
next page as well as to mydoamin.com/movies/kids, then for my next fetch I
now have 2 variations of pages

So one way I thought I can deal with this is by using batch_id. So when I
fetch
mydomain.com/movies/general, I use batchId, say 'general'
On a few iterations of these fetches, I end up fetching pages that are a
result of a crawl from a link mydoamin.com/movies/kids which was on
mydomain.com/movies/general page.

At a later point I crawl mydoamin.com/movies/kids as a separate batchId,
say 'kids'

Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
then the fetch with 'kids' batch_id wont have this movie 123. So if I want
a list of movies fetched under 'kids' I have missed this entry.

Sorry for the long email, but I hope this explains my problem.

Reply via email to