I have debuged this issue further and found strange thing that my webpage html is all mixed up. Meaning url1's html has some chunks of url2's html....
But if I look into cassandra for columnfamily 'f' and its 'column3' shows me all correct html content ( I am using wso2carbon to visualize cassandra's db). In my ParseFilter I am using webPage.getContent().array() to get complete html of current parse job's url. Is this a correct way to get html of current parser's job ? Thanks, Tony. On Tue, Jun 18, 2013 at 12:48 AM, Tony Mullins <tonymullins...@gmail.com>wrote: > I have 3 urls > > url1 > url2 > url3 > > And lets say I want to extract some data from these urls in my ParseFilter > and then index it using my IndexingFilter and that data is > > url1 => data1 , data2,data3 > url2 => data1 , data2 > url3 => data1, data2, data3, data4,data5 > > Now when I am in ParseFilter I query webPage.getBaseUrl() and if its url1 > I extract data1, data2, data3 and add them to my webPage.putToMetadata(key1 > , data1) > webPage.putToMetadata(key2 , data2) > webPage.putToMetadata(key3 , data3) > > And similarly for url2 and url3. > > Now I was expecting that when Nutch will execute my Parse URL levelFilter > and when I will query webPage.getFromMetadata(key1) and if its in for url1 > it will return me url1's key1 data i.e data1 and so on... but its mixing up > things. In my Solr I get mix results for url1 document , like data1 is of > url1 but data2 is of url3 and data3 is of url2 etc. > > How can I make sure that when I am in my IndexingFilter and I query for > key ( which is unique at URL level , not at current crawl level) I get > consistent data for that particular url only. > > Thanks, > Tony. >