This isn't what Batch ID is for. If you're crawling on only the one server and only want that specific section, use the regex-urlfilter to accept only the specific pages you want.
On Tue, Jul 9, 2013 at 3:36 PM, h b <[email protected]> wrote: > Hi > Use case: > * Scrape a given url. e.g. mydomain.com/movies/general > > * Parse this page and extract urls that match a certain pattern and > download the pages for these matched urls. lets say the pages I want to > download are mydomain.com/movies/general?id=123 format > > Now the problem I am facing is, > * Pagination mydomain.com/movies/general/2 and so on > * links on this page with regex that matches the regex of this page's url > mydoamin.com/movies/kids, mydomain.com/movies/english etc > > So when I fetch mydomain.com/movies/general and if this page has links to > next page as well as to mydoamin.com/movies/kids, then for my next fetch I > now have 2 variations of pages > > So one way I thought I can deal with this is by using batch_id. So when I > fetch > mydomain.com/movies/general, I use batchId, say 'general' > On a few iterations of these fetches, I end up fetching pages that are a > result of a crawl from a link mydoamin.com/movies/kids which was on > mydomain.com/movies/general page. > > At a later point I crawl mydoamin.com/movies/kids as a separate batchId, > say 'kids' > > Now, if 'general' has fetched a movie 123 which is also a 'kids' movie, > then the fetch with 'kids' batch_id wont have this movie 123. So if I want > a list of movies fetched under 'kids' I have missed this entry. > > Sorry for the long email, but I hope this explains my problem. >

