[jira] Commented: (NUTCH-888) Remove parse-rss

2010-08-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898828#action_12898828 ] Doğacan Güney commented on NUTCH-888: - +1 > Remove parse-rss > > >

[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

2010-08-16 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898827#action_12898827 ] Julien Nioche commented on NUTCH-887: - Have created https://issues.apache.org/jira/brows

[jira] Created: (NUTCH-888) Remove parse-rss

2010-08-16 Thread Julien Nioche (JIRA)
Remove parse-rss Key: NUTCH-888 URL: https://issues.apache.org/jira/browse/NUTCH-888 Project: Nutch Issue Type: Task Components: parser Affects Versions: 2.0 Reporter: Julien Nioche Assi

Re: When a crawl goes bad...

2010-08-16 Thread Julien Nioche
It's probably more an issue with DNS resolution than robots.txt. Even if you respect the robots.txt instructions you can still have N host or even domain names pointing to a single server. This can be avoided in Nutch by setting 'partition.url.mode' and 'fetcher.queue.mode' to 'byIP'. On 16 Augus

Re: When a crawl goes bad...

2010-08-16 Thread CatOs Mandros
Rather amusing :) Something similar was what made Grub gain a bit of bad reputation... thank god we have the robots.txt file. On Sat, Aug 14, 2010 at 7:48 PM, Mattmann, Chris A (388J) wrote: > LOL... > > > On 8/14/10 8:57 AM, "Ken Krugler" wrote: > > Dear @80legs stop crushing metafilter.com fr