[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-04-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702326#action_12702326
 ] 

Julien Nioche commented on NUTCH-477:
-

Having a scope for the URL filters could be useful in cases where we want to do 
a focused crawl. If for instance we want to parse a limited number of domains 
we could have different filters to use in ParseOutputFormat (so that we keep 
some of the outgoing links using the usual prefix and suffix filters for 
instance) and in CrawlDBFilter so that we keep only the URLs matching our 
limited set of domains.

Another way of doing would be to have a different set of filters for the 
Generation to fetch only within the domains of interest but keep all URLs in 
the crawlDB. 

Of course we can have custom scorers to give a low score to URLS we don't want 
to fetch and set a threshold in the Generation, but IMHO being able to do that 
with the filters would be more elegant

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.1

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-692) AlreadyBeingCreatedException with Hadoop 0.19

2009-04-24 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702412#action_12702412
 ] 

Julien Nioche commented on NUTCH-692:
-

OK I had the same problem again on my main cluster, one of the nodes lost 
contact with the master during a parsing and the subsequent attempts failed 
with AlreadyBeingCreatedException.

I managed to reproduce the problem locally using a fresh copy from SVN by 
hacking  the BasicURLNormalizer to make it sleep for 5 mins everytime it gets a 
URL, which gave me plenty of time to fail a reduce task with 

./hadoop job -fail-task attempt_200904241525_0007_r_00_0

as expected the following attempts failed with AlreadyBeingCreatedException.

I did the same experiment using your patch and can confirm that it solves the 
problem. 

Thanks

J.

 AlreadyBeingCreatedException with Hadoop 0.19
 -

 Key: NUTCH-692
 URL: https://issues.apache.org/jira/browse/NUTCH-692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-692.patch


 I have been using the SVN version of Nutch on an EC2 cluster and got some 
 AlreadyBeingCreatedException during the reduce phase of a parse. For some 
 reason one of my tasks crashed and then I ran into this 
 AlreadyBeingCreatedException when other nodes tried to pick it up.
 There was recently a discussion on the Hadoop user list on similar issues 
 with Hadoop 0.19 (see 
 http://markmail.org/search/after+upgrade+to+0%2E19%2E0). I have not tried 
 using 0.18.2 yet but will do if the problems persist with 0.19
 I was wondering whether anyone else had experienced the same problem. Do you 
 think 0.19 is stable enough to use it for Nutch 1.0?
 I will be running a crawl on a super large cluster in the next couple of 
 weeks and I will confirm this issue  
 J.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hudson build is back to normal: Nutch-trunk #794

2009-04-24 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/794/