Yes I tried that (explained it after point 4. Below), but it did NOT write to a new table. All data are forced to the same table. Unfortunately for now, I will crawl, get the data from the DB using "nutch readdb" command, drop the table, repeat...sounds awkward but works.
Tamer -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Wednesday, December 17, 2014 4:01 PM To: [email protected] Subject: Re: Identifying results from two distinct crawls in Nutch 2.2.1 Hi, what about the -crawlId <crawlId> option available with all bin/nutch tools (inject, fetch, parse, etc.) and also for bin/crawl? This should start a new table (keyspace, schema, or however it's called) <crawlId>_webpage. Best, Sebastian On 12/16/2014 09:17 PM, Tamer Yousef wrote: > Hi All: > I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running > in local mode (runtime/local), on a Centos box. > > I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, > crawl another dataset2 or urls without touching the results of dataset1, I > want to avoid going through depth 2 of the first dataset, I want all data to > live within the same database (or keyspace in cassandra). > Here is what I did: > > 1- dropped the keyspace "webpage" from Cassandra > > 2- changed the value of the property "storage.schema.webpage" from > "webpage" to "database1" in both nutch-default and nutch-site. > > 3- Reran "ant runtime" just to make sure these changes are reflected in > my local deployment. > > 4- ran the nutch crawl script. > > but fetch results are still written to a new created keyspace "webpage", not > the one I specified in the conf file, I'm unable to change the db destination > of where data is going. > > I tried to google the "crawlid" parameter, I can pass it to the crawl script, > but I'm unable to figure out how to read it back from cassandra, the gora > schema mapping does not mention it anywhere, it only mentions "bas" which is > the batchid. I thought I can use two different crawl ids for the two > different url datasets I have, so later I can query each set separately... > > So either solution will help me a lot: either figure out how to change the > destination db in Cassandra, or if the crawlid can help in identifying > results from two distinct crawls. > > > Any hints will be appreciated ! > > Thanks. > >

