Ok , then I give a try with HBase... but this is strange as I think once Lewis said he is also using Cassandra in his Nutch setup.
Thanks, Tony On Fri, Jun 21, 2013 at 1:19 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > HBase seems to be the one most widely used. I haven't followed GORA lately > but the MySql one was unusable > > > On 21 June 2013 07:17, Tony Mullins <tonymullins...@gmail.com> wrote: > > > Then which backend is more stabable & consistent with gora/nutch2.x ? > > How about MySql and HBase ? > > > > > > Thanks, > > Tony > > > > > > On Fri, Jun 21, 2013 at 9:15 AM, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> wrote: > > > > > Hi Tony, > > > You are using Cassandra backend right? > > > I think it's safe to say that there are lingering bugs in > gora-cassandra. > > > I am getting some dodgy behaviour using Cassandra 1.1.2 during large > > > crawls. > > > > > > > > > > > > On Tue, Jun 18, 2013 at 12:40 AM, Tony Mullins < > tonymullins...@gmail.com > > > >wrote: > > > > > > > I have debuged this issue further and found strange thing that my > > webpage > > > > html is all mixed up. Meaning url1's html has some chunks of url2's > > > > html.... > > > > > > > > But if I look into cassandra for columnfamily 'f' and its 'column3' > > shows > > > > me all correct html content ( I am using wso2carbon to visualize > > > > cassandra's db). > > > > > > > > In my ParseFilter I am using webPage.getContent().array() to get > > complete > > > > html of current parse job's url. > > > > > > > > Is this a correct way to get html of current parser's job ? > > > > > > > > > > > > Thanks, > > > > Tony. > > > > > > > > > > > > On Tue, Jun 18, 2013 at 12:48 AM, Tony Mullins < > > tonymullins...@gmail.com > > > > >wrote: > > > > > > > > > I have 3 urls > > > > > > > > > > url1 > > > > > url2 > > > > > url3 > > > > > > > > > > And lets say I want to extract some data from these urls in my > > > > ParseFilter > > > > > and then index it using my IndexingFilter and that data is > > > > > > > > > > url1 => data1 , data2,data3 > > > > > url2 => data1 , data2 > > > > > url3 => data1, data2, data3, data4,data5 > > > > > > > > > > Now when I am in ParseFilter I query webPage.getBaseUrl() and if > its > > > url1 > > > > > I extract data1, data2, data3 and add them to my > > > > webPage.putToMetadata(key1 > > > > > , data1) > > > > > webPage.putToMetadata(key2 , data2) > > > > > webPage.putToMetadata(key3 , data3) > > > > > > > > > > And similarly for url2 and url3. > > > > > > > > > > Now I was expecting that when Nutch will execute my Parse URL > > > levelFilter > > > > > and when I will query webPage.getFromMetadata(key1) and if its in > for > > > > url1 > > > > > it will return me url1's key1 data i.e data1 and so on... but its > > > mixing > > > > up > > > > > things. In my Solr I get mix results for url1 document , like data1 > > is > > > of > > > > > url1 but data2 is of url3 and data3 is of url2 etc. > > > > > > > > > > How can I make sure that when I am in my IndexingFilter and I query > > for > > > > > key ( which is unique at URL level , not at current crawl level) I > > get > > > > > consistent data for that particular url only. > > > > > > > > > > Thanks, > > > > > Tony. > > > > > > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >