[jira] Created: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Rod Taylor (JIRA)
Add optional -urlFiltering to updatedb -- Key: NUTCH-242 URL: http://issues.apache.org/jira/browse/NUTCH-242 Project: Nutch Type: New Feature Versions: 0.8-dev Reporter: Rod Taylor Allow filtering the URLs completely

[jira] Updated: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ] Rod Taylor updated NUTCH-242: - Attachment: nutch_crawldb_filtering.patch Add optional -urlFiltering to updatedb -- Key: NUTCH-242 URL:

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ] Doug Cutting commented on NUTCH-171: Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth usage constant. Overlapping map2 with reduce1 should

[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] Doug Cutting commented on NUTCH-240: First, I hope my critical remarks were not taken personally. I am thankful for this and all of your contributions. Initially, I did

[jira] Commented: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372581 ] Doug Cutting commented on NUTCH-242: Shouldn't you use the returned value of the filter? If so, then this should be done in a mapper, not in the reducer. Add optional

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372588 ] Rod Taylor commented on NUTCH-171: -- One thing that's needed is the ability to mark urls as being fetched, which was in 0.7 but has not yet made it into 0.8. In addition, we

[jira] Commented: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372592 ] Rod Taylor commented on NUTCH-242: -- Shouldn't you use the returned value of the filter? I forgot about URL Normalization (focused on expunging the data only). I suppose

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ] Doug Cutting commented on NUTCH-171: Generate for 20 Segments of 10M in size is almost as fast as 1 segment that is 10M in size. A single 200M URL segment is unweildly

[jira] Commented: (NUTCH-241) Non-informative error message

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-241?page=comments#action_12372599 ] Rod Taylor commented on NUTCH-241: -- I couldn't say about the accuracy of the rest, but this phrasing specifically is very confusing mapred.child.heap.size is deprecated. Use

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372602 ] Rod Taylor commented on NUTCH-171: -- How is a 200M url segment unweildy? There are two reasons why I have found this. First, Nutch still has a bad habit of not completing a

[jira] Updated: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ] Rod Taylor updated NUTCH-242: - Attachment: nutch_urlfilter.patch How about this one instead? It creates a CrawlDbMapper class which does the filtering when requested. Add optional -urlFiltering

[jira] Commented: (NUTCH-242) Add optional -urlFiltering to updatedb

2006-03-30 Thread Rod Taylor (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372620 ] Rod Taylor commented on NUTCH-242: -- Sorry. I cannot mark the first patch as being invalid and neglected to use a version number. v2 is named nutch_urlfilter.patch Add