I have debuged this issue further and found strange thing that my webpage
html is all mixed up. Meaning url1's html has some chunks of url2's
html....

But if I look into cassandra for columnfamily 'f' and its 'column3' shows
me all correct html content ( I am using wso2carbon to visualize
cassandra's db).

In my ParseFilter I am using webPage.getContent().array() to get complete
html of current parse job's url.

 Is this a correct way to get html of current parser's job  ?


Thanks,
Tony.


On Tue, Jun 18, 2013 at 12:48 AM, Tony Mullins <tonymullins...@gmail.com>wrote:

> I have 3 urls
>
> url1
> url2
> url3
>
> And lets say I want to extract some data from these urls in my ParseFilter
> and then index it using my IndexingFilter  and that data is
>
> url1 => data1 , data2,data3
> url2 => data1 , data2
> url3 => data1, data2, data3, data4,data5
>
> Now when I am in ParseFilter I query webPage.getBaseUrl() and if its url1
> I extract data1, data2, data3 and add them to my webPage.putToMetadata(key1
> , data1)
> webPage.putToMetadata(key2 , data2)
> webPage.putToMetadata(key3 , data3)
>
> And similarly for url2 and url3.
>
> Now I was expecting that when Nutch will execute my Parse URL levelFilter
> and when I will query webPage.getFromMetadata(key1) and if its in for url1
> it will return me url1's key1 data i.e data1 and so on... but its mixing up
> things. In my Solr I get mix results for url1 document , like data1 is of
> url1 but data2 is of url3 and data3 is of url2 etc.
>
> How can I make sure that when I am in my IndexingFilter and I query for
> key ( which is unique at URL level , not at current crawl level) I get
> consistent data for that particular url only.
>
> Thanks,
> Tony.
>

Reply via email to