You'll have to write a plugin that does that.  Look at the parse and index
plugins.


On Thu, Jul 11, 2013 at 2:12 PM, h b <[email protected]> wrote:

> It kinda does.
> But then what is the best way to tie a seed url to the url list that gets
> generated?
>
> So lets say my seed.txt has
> url1.com
> url2.com
>
> So when fetch has fetched say page1, page2, page3 from url1 and
> page4,page5,page6 from url2, after the crawl, how do I tell that page4 is
> from url2.com and page1 is from url1.com?
>
>
>
>
> On Thu, Jul 11, 2013 at 10:54 AM, Bai Shen <[email protected]>
> wrote:
>
> > Yes, generate marks the urls with the specified batch id.  However, the
> > next time those urls are generated, a new batch id will be set.  And
> > updatedb removes the generate batch id marker from the url.
> >
> > Nutch does not send the batch id to solr, so that is why you are not able
> > to query it.
> >
> > If you want to batch urls to be queried later by solr then you need to
> > write an indexing filter to set a separate field that you can then later
> > query with solr.  Also, you can tell solr to look in the url for your
> > general/kids/etc keyword and do searches that way.
> >
> > Make sense?
> >
> >
> > On Thu, Jul 11, 2013 at 1:13 PM, h b <[email protected]> wrote:
> >
> > > My understanding is when I specify a batch_id with generate, generate
> > marks
> > > a set of urls to be fetched. So there should be some relation between
> the
> > > urls fetched (or marked to be fetched) with the batch_id, is that not
> so?
> > >
> > > In the same context, with SOLR, I set the
> > >     <field name="batchId" type="string" stored="true" indexed="true"/>
> > >
> > > in my schema.xml, hoping that I can query solr by the batchId, however,
> > > even after reindexing, and restarting solr, I do not see the batchId in
> > the
> > > response. I added fl=batchId to my solr query and get back nothing.
> > >
> > >
> > >
> > > On Thu, Jul 11, 2013 at 4:25 AM, Bai Shen <[email protected]>
> > wrote:
> > >
> > > > This isn't what Batch ID is for.  If you're crawling on only the one
> > > server
> > > > and only want that specific section, use the regex-urlfilter to
> accept
> > > only
> > > > the specific pages you want.
> > > >
> > > >
> > > > On Tue, Jul 9, 2013 at 3:36 PM, h b <[email protected]> wrote:
> > > >
> > > > > Hi
> > > > > Use case:
> > > > > * Scrape a given url. e.g. mydomain.com/movies/general
> > > > >
> > > > > * Parse this page and extract urls that match a certain pattern and
> > > > > download the pages for these matched urls. lets say the pages I
> want
> > to
> > > > > download are mydomain.com/movies/general?id=123 format
> > > > >
> > > > > Now the problem I am facing is,
> > > > > * Pagination mydomain.com/movies/general/2 and so on
> > > > > * links on this page with regex that matches the regex of this
> page's
> > > url
> > > > > mydoamin.com/movies/kids, mydomain.com/movies/english etc
> > > > >
> > > > > So when I fetch mydomain.com/movies/general and if this page has
> > links
> > > > to
> > > > > next page as well as to mydoamin.com/movies/kids, then for my next
> > > > fetch I
> > > > > now have 2 variations of pages
> > > > >
> > > > > So one way I thought I can deal with this is by using batch_id. So
> > > when I
> > > > > fetch
> > > > > mydomain.com/movies/general, I use batchId, say 'general'
> > > > > On a few iterations of these fetches, I end up fetching pages that
> > are
> > > a
> > > > > result of a crawl from a link mydoamin.com/movies/kids which was
> on
> > > > > mydomain.com/movies/general page.
> > > > >
> > > > > At a later point I crawl mydoamin.com/movies/kids as a separate
> > > batchId,
> > > > > say 'kids'
> > > > >
> > > > > Now, if 'general' has fetched a movie 123 which is also a 'kids'
> > movie,
> > > > > then the fetch with 'kids' batch_id wont have this movie 123. So
> if I
> > > > want
> > > > > a list of movies fetched under 'kids' I have missed this entry.
> > > > >
> > > > > Sorry for the long email, but I hope this explains my problem.
> > > > >
> > > >
> > >
> >
>

Reply via email to