Moreno Feltscher created NUTCH-2496: ---------------------------------------
Summary: Speed up link inversion step in crawling script Key: NUTCH-2496 URL: https://issues.apache.org/jira/browse/NUTCH-2496 Project: Nutch Issue Type: Improvement Reporter: Moreno Feltscher While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html {quote} I am running the invertlinks step in my Nutch 1.6 based crawl process on a single node. I run invertlinks only because I need the Inlinks in the indexer step so as to store them with the document. I do not need the anchor text and I am not scoring. I am finding that invertlinks (and more specifically the merge of the linkdb) takes a long time - about 30 minutes for a crawl of around 150K documents. I am looking for ways that I might shorten this processing time. Any suggestions? {quote} Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch. In my case however I kind of depend on those so this is no real solution. I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process. -- This message was sent by Atlassian JIRA (v6.4.14#64029)