Hi,When a page is re-crawled on Nutch and identified new outlink urls alnog
with the existing urls, old outlinks are getting removed and only new urls
are updated to hbase. For examplecrawl cycle 1 for www.123.com, identified
outlinks are abc.compqr.comcrawlcyle 2 of same www.123.com, the outlinks
are(note that abc.com is removed and added with xyz.com)pqr.comxyz.comAt the
end of crawlcycle 2, base has only xyz.com(Expected: have pqr.com and
xyz.com). As per the code ParseUtil.java, it seems to be removing the old
links and insets onlythe new links.if (page.getOutlinks() != null) {
page.getOutlinks().clear();}Has anyone faced this issue and any fix for
this? Details of our cluster:10 node EC2 instances on hadoop-0.20.205Nutch -
2.1HBase - 0.90.6Thanks,Senthil
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-tp4146676.html
Sent from the Nutch - User mailing list archive at Nabble.com.