Hi Tony,
You are using Cassandra backend right?
I think it's safe to say that there are lingering bugs in gora-cassandra.
I am getting some dodgy behaviour using Cassandra 1.1.2 during large crawls.



On Tue, Jun 18, 2013 at 12:40 AM, Tony Mullins <tonymullins...@gmail.com>wrote:

> I have debuged this issue further and found strange thing that my webpage
> html is all mixed up. Meaning url1's html has some chunks of url2's
> html....
>
> But if I look into cassandra for columnfamily 'f' and its 'column3' shows
> me all correct html content ( I am using wso2carbon to visualize
> cassandra's db).
>
> In my ParseFilter I am using webPage.getContent().array() to get complete
> html of current parse job's url.
>
>  Is this a correct way to get html of current parser's job  ?
>
>
> Thanks,
> Tony.
>
>
> On Tue, Jun 18, 2013 at 12:48 AM, Tony Mullins <tonymullins...@gmail.com
> >wrote:
>
> > I have 3 urls
> >
> > url1
> > url2
> > url3
> >
> > And lets say I want to extract some data from these urls in my
> ParseFilter
> > and then index it using my IndexingFilter  and that data is
> >
> > url1 => data1 , data2,data3
> > url2 => data1 , data2
> > url3 => data1, data2, data3, data4,data5
> >
> > Now when I am in ParseFilter I query webPage.getBaseUrl() and if its url1
> > I extract data1, data2, data3 and add them to my
> webPage.putToMetadata(key1
> > , data1)
> > webPage.putToMetadata(key2 , data2)
> > webPage.putToMetadata(key3 , data3)
> >
> > And similarly for url2 and url3.
> >
> > Now I was expecting that when Nutch will execute my Parse URL levelFilter
> > and when I will query webPage.getFromMetadata(key1) and if its in for
> url1
> > it will return me url1's key1 data i.e data1 and so on... but its mixing
> up
> > things. In my Solr I get mix results for url1 document , like data1 is of
> > url1 but data2 is of url3 and data3 is of url2 etc.
> >
> > How can I make sure that when I am in my IndexingFilter and I query for
> > key ( which is unique at URL level , not at current crawl level) I get
> > consistent data for that particular url only.
> >
> > Thanks,
> > Tony.
> >
>



-- 
*Lewis*

Reply via email to