Hi, what about the -crawlId <crawlId> option available with all bin/nutch tools (inject, fetch, parse, etc.) and also for bin/crawl?
This should start a new table (keyspace, schema, or however it's called) <crawlId>_webpage. Best, Sebastian On 12/16/2014 09:17 PM, Tamer Yousef wrote: > Hi All: > I do have nutch 2.2.1 with Cassandra on the backend and Hadoop 1.2.1, running > in local mode (runtime/local), on a Centos box. > > I'm trying to do a very simple test: Crawl a dataset1 of urls and once done, > crawl another dataset2 or urls without touching the results of dataset1, I > want to avoid going through depth 2 of the first dataset, I want all data to > live within the same database (or keyspace in cassandra). > Here is what I did: > > 1- dropped the keyspace "webpage" from Cassandra > > 2- changed the value of the property "storage.schema.webpage" from > "webpage" to "database1" in both nutch-default and nutch-site. > > 3- Reran "ant runtime" just to make sure these changes are reflected in > my local deployment. > > 4- ran the nutch crawl script. > > but fetch results are still written to a new created keyspace "webpage", not > the one I specified in the conf file, I'm unable to change the db destination > of where data is going. > > I tried to google the "crawlid" parameter, I can pass it to the crawl script, > but I'm unable to figure out how to read it back from cassandra, the gora > schema mapping does not mention it anywhere, it only mentions "bas" which is > the batchid. I thought I can use two different crawl ids for the two > different url datasets I have, so later I can query each set separately... > > So either solution will help me a lot: either figure out how to change the > destination db in Cassandra, or if the crawlid can help in identifying > results from two distinct crawls. > > > Any hints will be appreciated ! > > Thanks. > >

