Re: Synchronization & Consistency of data in ParseFilter and IndexingFIlter

Tony Mullins Fri, 21 Jun 2013 02:19:46 -0700

Ok , then I give a try with HBase... but this is strange as I think once
Lewis said he is also using Cassandra in his Nutch setup.


Thanks,
Tony


On Fri, Jun 21, 2013 at 1:19 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> HBase seems to be the one most widely used. I haven't followed GORA lately
> but the MySql one was unusable
>
>
> On 21 June 2013 07:17, Tony Mullins <tonymullins...@gmail.com> wrote:
>
> > Then which backend is more stabable & consistent with gora/nutch2.x ?
> > How about MySql and HBase ?
> >
> >
> > Thanks,
> > Tony
> >
> >
> > On Fri, Jun 21, 2013 at 9:15 AM, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> wrote:
> >
> > > Hi Tony,
> > > You are using Cassandra backend right?
> > > I think it's safe to say that there are lingering bugs in
> gora-cassandra.
> > > I am getting some dodgy behaviour using Cassandra 1.1.2 during large
> > > crawls.
> > >
> > >
> > >
> > > On Tue, Jun 18, 2013 at 12:40 AM, Tony Mullins <
> tonymullins...@gmail.com
> > > >wrote:
> > >
> > > > I have debuged this issue further and found strange thing that my
> > webpage
> > > > html is all mixed up. Meaning url1's html has some chunks of url2's
> > > > html....
> > > >
> > > > But if I look into cassandra for columnfamily 'f' and its 'column3'
> > shows
> > > > me all correct html content ( I am using wso2carbon to visualize
> > > > cassandra's db).
> > > >
> > > > In my ParseFilter I am using webPage.getContent().array() to get
> > complete
> > > > html of current parse job's url.
> > > >
> > > >  Is this a correct way to get html of current parser's job  ?
> > > >
> > > >
> > > > Thanks,
> > > > Tony.
> > > >
> > > >
> > > > On Tue, Jun 18, 2013 at 12:48 AM, Tony Mullins <
> > tonymullins...@gmail.com
> > > > >wrote:
> > > >
> > > > > I have 3 urls
> > > > >
> > > > > url1
> > > > > url2
> > > > > url3
> > > > >
> > > > > And lets say I want to extract some data from these urls in my
> > > > ParseFilter
> > > > > and then index it using my IndexingFilter  and that data is
> > > > >
> > > > > url1 => data1 , data2,data3
> > > > > url2 => data1 , data2
> > > > > url3 => data1, data2, data3, data4,data5
> > > > >
> > > > > Now when I am in ParseFilter I query webPage.getBaseUrl() and if
> its
> > > url1
> > > > > I extract data1, data2, data3 and add them to my
> > > > webPage.putToMetadata(key1
> > > > > , data1)
> > > > > webPage.putToMetadata(key2 , data2)
> > > > > webPage.putToMetadata(key3 , data3)
> > > > >
> > > > > And similarly for url2 and url3.
> > > > >
> > > > > Now I was expecting that when Nutch will execute my Parse URL
> > > levelFilter
> > > > > and when I will query webPage.getFromMetadata(key1) and if its in
> for
> > > > url1
> > > > > it will return me url1's key1 data i.e data1 and so on... but its
> > > mixing
> > > > up
> > > > > things. In my Solr I get mix results for url1 document , like data1
> > is
> > > of
> > > > > url1 but data2 is of url3 and data3 is of url2 etc.
> > > > >
> > > > > How can I make sure that when I am in my IndexingFilter and I query
> > for
> > > > > key ( which is unique at URL level , not at current crawl level) I
> > get
> > > > > consistent data for that particular url only.
> > > > >
> > > > > Thanks,
> > > > > Tony.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Synchronization & Consistency of data in ParseFilter and IndexingFIlter

Reply via email to