Re: Identifying results from two distinct crawls in Nutch 2.2.1

Sebastian Nagel Wed, 17 Dec 2014 13:02:30 -0800

Hi,

what about the -crawlId <crawlId> option available with all bin/nutch
tools (inject, fetch, parse, etc.) and also for bin/crawl?


This should start a new table (keyspace, schema, or however it's called)
<crawlId>_webpage.

Best,
Sebastian



On 12/16/2014 09:17 PM, Tamer Yousef wrote:
> Hi All:
> I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running 
> in local mode (runtime/local), on a Centos box.
> 
> I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, 
> crawl another dataset2 or urls without touching the results of dataset1, I 
> want to avoid going through depth 2 of the first dataset,  I want all data to 
> live within the same database (or keyspace in cassandra).
> Here is what I did:
> 
> 1-      dropped the keyspace "webpage" from Cassandra
> 
> 2-      changed the  value of the property "storage.schema.webpage" from 
> "webpage" to "database1" in both nutch-default and nutch-site.
> 
> 3-      Reran "ant runtime"  just to make sure these changes are reflected in 
> my local deployment.
> 
> 4-      ran the nutch crawl script.
> 
> but fetch results are still written to a new created keyspace "webpage", not 
> the one I specified in the conf file, I'm unable to change the db destination 
> of where data is going.
> 
> I tried to google the "crawlid" parameter, I can pass it to the crawl script, 
> but I'm unable to figure out how to read it back from cassandra, the gora 
> schema mapping does not mention it anywhere, it only mentions "bas" which is 
> the batchid. I thought I can use two different crawl ids for the two 
> different url datasets I have, so later I can query each set separately...
> 
> So either solution will help me a lot: either figure out how to change the 
> destination db in Cassandra, or if the crawlid can help in identifying 
> results from two distinct crawls.
> 
> 
> Any hints will be appreciated !
> 
> Thanks.
> 
>

Re: Identifying results from two distinct crawls in Nutch 2.2.1

Reply via email to