RE: Identifying results from two distinct crawls in Nutch 2.2.1

Tamer Yousef Thu, 18 Dec 2014 05:32:58 -0800

Yes I tried that (explained it after point 4. Below), but it did NOT write to a 
new table. All data are forced to the same table.
Unfortunately for now, I will crawl, get the data from the DB using "nutch 
readdb" command, drop the table, repeat...sounds awkward but works.


Tamer

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Wednesday, December 17, 2014 4:01 PM
To: [email protected]
Subject: Re: Identifying results from two distinct crawls in Nutch 2.2.1

Hi,

what about the -crawlId <crawlId> option available with all bin/nutch tools 
(inject, fetch, parse, etc.) and also for bin/crawl?

This should start a new table (keyspace, schema, or however it's called) 
<crawlId>_webpage.

Best,
Sebastian



On 12/16/2014 09:17 PM, Tamer Yousef wrote:
> Hi All:
> I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running 
> in local mode (runtime/local), on a Centos box.
> 
> I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, 
> crawl another dataset2 or urls without touching the results of dataset1, I 
> want to avoid going through depth 2 of the first dataset,  I want all data to 
> live within the same database (or keyspace in cassandra).
> Here is what I did:
> 
> 1-      dropped the keyspace "webpage" from Cassandra
> 
> 2-      changed the  value of the property "storage.schema.webpage" from 
> "webpage" to "database1" in both nutch-default and nutch-site.
> 
> 3-      Reran "ant runtime"  just to make sure these changes are reflected in 
> my local deployment.
> 
> 4-      ran the nutch crawl script.
> 
> but fetch results are still written to a new created keyspace "webpage", not 
> the one I specified in the conf file, I'm unable to change the db destination 
> of where data is going.
> 
> I tried to google the "crawlid" parameter, I can pass it to the crawl script, 
> but I'm unable to figure out how to read it back from cassandra, the gora 
> schema mapping does not mention it anywhere, it only mentions "bas" which is 
> the batchid. I thought I can use two different crawl ids for the two 
> different url datasets I have, so later I can query each set separately...
> 
> So either solution will help me a lot: either figure out how to change the 
> destination db in Cassandra, or if the crawlid can help in identifying 
> results from two distinct crawls.
> 
> 
> Any hints will be appreciated !
> 
> Thanks.
> 
>

RE: Identifying results from two distinct crawls in Nutch 2.2.1

Reply via email to