[ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1282: ---------------------------------------- Fix Version/s: 1.7 > linkdb scalability > ------------------ > > Key: NUTCH-1282 > URL: https://issues.apache.org/jira/browse/NUTCH-1282 > Project: Nutch > Issue Type: Improvement > Components: linkdb > Affects Versions: 1.4 > Reporter: behnam nikbakht > Fix For: 1.7 > > > as described in NUTCH-1054, the linkdb is optional in solrindex and it's > usage is only for anchor and not impact on scoring. > as seemed, size of linkdb in incremental crawl grow very fast and make it > unscalable for huge size of web sites. > so, here is two choises, one, ignore invertlinks and linkdb from crawl, and > second, make it scalable > in invertlinks, there is 2 jobs, first for construct new linkdb from new > parsed segments, and second for merge new linkdb with old linkdb. the second > job is unscalable and we can ignore it with this changes in solrIndex: > in the class IndexerMapReduce, reduce method, if fetchDatum == null or > dbDatum == null or parseText == null or parseData == null, then add anchor to > doc and update solr (no insert) > here also some changes required to NutchDocument. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira