[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains
[ https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702326#action_12702326 ] Julien Nioche commented on NUTCH-477: - Having a scope for the URL filters could be useful in cases where we want to do a focused crawl. If for instance we want to parse a limited number of domains we could have different filters to use in ParseOutputFormat (so that we keep some of the outgoing links using the usual prefix and suffix filters for instance) and in CrawlDBFilter so that we keep only the URLs matching our limited set of domains. Another way of doing would be to have a different set of filters for the Generation to fetch only within the domains of interest but keep all URLs in the crawlDB. Of course we can have custom scorers to give a low score to URLS we don't want to fetch and set a threshold in the Generation, but IMHO being able to do that with the filters would be more elegant Extend URLFilters to support different filtering chains --- Key: NUTCH-477 URL: https://issues.apache.org/jira/browse/NUTCH-477 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.1 Attachments: urlfilters.patch I propose to make the following changes to URLFilters: * extend URLFilters so that they support different filtering rules depending on the context where they are executed. This functionality mirrors the one that URLNormalizers already support. * change their return value to an int code, in order to support early termination of long filtering chains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702412#action_12702412 ] Julien Nioche commented on NUTCH-692: - OK I had the same problem again on my main cluster, one of the nodes lost contact with the master during a parsing and the subsequent attempts failed with AlreadyBeingCreatedException. I managed to reproduce the problem locally using a fresh copy from SVN by hacking the BasicURLNormalizer to make it sleep for 5 mins everytime it gets a URL, which gave me plenty of time to fail a reduce task with ./hadoop job -fail-task attempt_200904241525_0007_r_00_0 as expected the following attempts failed with AlreadyBeingCreatedException. I did the same experiment using your patch and can confirm that it solves the problem. Thanks J. AlreadyBeingCreatedException with Hadoop 0.19 - Key: NUTCH-692 URL: https://issues.apache.org/jira/browse/NUTCH-692 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Julien Nioche Attachments: NUTCH-692.patch I have been using the SVN version of Nutch on an EC2 cluster and got some AlreadyBeingCreatedException during the reduce phase of a parse. For some reason one of my tasks crashed and then I ran into this AlreadyBeingCreatedException when other nodes tried to pick it up. There was recently a discussion on the Hadoop user list on similar issues with Hadoop 0.19 (see http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried using 0.18.2 yet but will do if the problems persist with 0.19 I was wondering whether anyone else had experienced the same problem. Do you think 0.19 is stable enough to use it for Nutch 1.0? I will be running a crawl on a super large cluster in the next couple of weeks and I will confirm this issue J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Hudson build is back to normal: Nutch-trunk #794
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/794/