Yes, generate marks the urls with the specified batch id.  However, the
next time those urls are generated, a new batch id will be set.  And
updatedb removes the generate batch id marker from the url.

Nutch does not send the batch id to solr, so that is why you are not able
to query it.

If you want to batch urls to be queried later by solr then you need to
write an indexing filter to set a separate field that you can then later
query with solr.  Also, you can tell solr to look in the url for your
general/kids/etc keyword and do searches that way.

Make sense?


On Thu, Jul 11, 2013 at 1:13 PM, h b <[email protected]> wrote:

> My understanding is when I specify a batch_id with generate, generate marks
> a set of urls to be fetched. So there should be some relation between the
> urls fetched (or marked to be fetched) with the batch_id, is that not so?
>
> In the same context, with SOLR, I set the
>     <field name="batchId" type="string" stored="true" indexed="true"/>
>
> in my schema.xml, hoping that I can query solr by the batchId, however,
> even after reindexing, and restarting solr, I do not see the batchId in the
> response. I added fl=batchId to my solr query and get back nothing.
>
>
>
> On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <[email protected]> wrote:
>
> > This isn't what Batch ID is for.  If you're crawling on only the one
> server
> > and only want that specific section, use the regex-urlfilter to accept
> only
> > the specific pages you want.
> >
> >
> > On Tue, Jul 9, 2013 at 3:36 PM, h b <[email protected]> wrote:
> >
> > > Hi
> > > Use case:
> > > * Scrape a given url. e.g. mydomain.com/movies/general
> > >
> > > * Parse this page and extract urls that match a certain pattern and
> > > download the pages for these matched urls. lets say the pages I want to
> > > download are mydomain.com/movies/general?id=123 format
> > >
> > > Now the problem I am facing is,
> > > * Pagination mydomain.com/movies/general/2 and so on
> > > * links on this page with regex that matches the regex of this page's
> url
> > > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> > >
> > > So when I fetch mydomain.com/movies/general and if this page has links
> > to
> > > next page as well as to mydoamin.com/movies/kids, then for my next
> > fetch I
> > > now have 2 variations of pages
> > >
> > > So one way I thought I can deal with this is by using batch_id. So
> when I
> > > fetch
> > > mydomain.com/movies/general, I use batchId, say 'general'
> > > On a few iterations of these fetches, I end up fetching pages that are
> a
> > > result of a crawl from a link mydoamin.com/movies/kids which was on
> > > mydomain.com/movies/general page.
> > >
> > > At a later point I crawl mydoamin.com/movies/kids as a separate
> batchId,
> > > say 'kids'
> > >
> > > Now, if 'general' has fetched a movie 123 which is also a 'kids' movie,
> > > then the fetch with 'kids' batch_id wont have this movie 123. So if I
> > want
> > > a list of movies fetched under 'kids' I have missed this entry.
> > >
> > > Sorry for the long email, but I hope this explains my problem.
> > >
> >
>

Reply via email to