My understanding is when I specify a batch_id with generate, generate marks
a set of urls to be fetched. So there should be some relation between the
urls fetched (or marked to be fetched) with the batch_id, is that not so?

In the same context, with SOLR, I set the
    <field name="batchId" type="string" stored="true" indexed="true"/>

in my schema.xml, hoping that I can query solr by the batchId, however,
even after reindexing, and restarting solr, I do not see the batchId in the
response. I added fl=batchId to my solr query and get back nothing.



On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <[email protected]> wrote:

> This isn't what Batch ID is for.  If you're crawling on only the one server
> and only want that specific section, use the regex-urlfilter to accept only
> the specific pages you want.
>
>
> On Tue, Jul 9, 2013 at 3:36 PM, h b <[email protected]> wrote:
>
> > Hi
> > Use case:
> > * Scrape a given url. e.g. mydomain.com/movies/general
> >
> > * Parse this page and extract urls that match a certain pattern and
> > download the pages for these matched urls. lets say the pages I want to
> > download are mydomain.com/movies/general?id=123 format
> >
> > Now the problem I am facing is,
> > * Pagination mydomain.com/movies/general/2 and so on
> > * links on this page with regex that matches the regex of this page's url
> > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> >
> > So when I fetch mydomain.com/movies/general and if this page has links
> to
> > next page as well as to mydoamin.com/movies/kids, then for my next
> fetch I
> > now have 2 variations of pages
> >
> > So one way I thought I can deal with this is by using batch_id. So when I
> > fetch
> > mydomain.com/movies/general, I use batchId, say 'general'
> > On a few iterations of these fetches, I end up fetching pages that are a
> > result of a crawl from a link mydoamin.com/movies/kids which was on
> > mydomain.com/movies/general page.
> >
> > At a later point I crawl mydoamin.com/movies/kids as a separate batchId,
> > say 'kids'
> >
> > Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
> > then the fetch with 'kids' batch_id wont have this movie 123. So if I
> want
> > a list of movies fetched under 'kids' I have missed this entry.
> >
> > Sorry for the long email, but I hope this explains my problem.
> >
>

Reply via email to