Add optional -urlFiltering to updatedb
--
Key: NUTCH-242
URL: http://issues.apache.org/jira/browse/NUTCH-242
Project: Nutch
Type: New Feature
Versions: 0.8-dev
Reporter: Rod Taylor
Allow filtering the URLs completely
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ]
Rod Taylor updated NUTCH-242:
-
Attachment: nutch_crawldb_filtering.patch
Add optional -urlFiltering to updatedb
--
Key: NUTCH-242
URL:
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372556 ]
Doug Cutting commented on NUTCH-171:
Ideally we could overlap segment2 map with segment1 reduce to keep bandwidth
usage constant.
Overlapping map2 with reduce1 should
[
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ]
Doug Cutting commented on NUTCH-240:
First, I hope my critical remarks were not taken personally. I am thankful for
this and all of your contributions.
Initially, I did
[
http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372581 ]
Doug Cutting commented on NUTCH-242:
Shouldn't you use the returned value of the filter? If so, then this should be
done in a mapper, not in the reducer.
Add optional
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372588 ]
Rod Taylor commented on NUTCH-171:
--
One thing that's needed is the ability to mark urls as being fetched, which
was in 0.7 but has not yet made it into 0.8. In addition, we
[
http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372592 ]
Rod Taylor commented on NUTCH-242:
--
Shouldn't you use the returned value of the filter?
I forgot about URL Normalization (focused on expunging the data only). I
suppose
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372597 ]
Doug Cutting commented on NUTCH-171:
Generate for 20 Segments of 10M in size is almost as fast as 1 segment that
is 10M in size. A single 200M URL segment is unweildly
[
http://issues.apache.org/jira/browse/NUTCH-241?page=comments#action_12372599 ]
Rod Taylor commented on NUTCH-241:
--
I couldn't say about the accuracy of the rest, but this phrasing specifically
is very confusing mapred.child.heap.size is deprecated. Use
[
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372602 ]
Rod Taylor commented on NUTCH-171:
--
How is a 200M url segment unweildy?
There are two reasons why I have found this. First, Nutch still has a bad habit
of not completing a
[ http://issues.apache.org/jira/browse/NUTCH-242?page=all ]
Rod Taylor updated NUTCH-242:
-
Attachment: nutch_urlfilter.patch
How about this one instead? It creates a CrawlDbMapper class which does the
filtering when requested.
Add optional -urlFiltering
[
http://issues.apache.org/jira/browse/NUTCH-242?page=comments#action_12372620 ]
Rod Taylor commented on NUTCH-242:
--
Sorry. I cannot mark the first patch as being invalid and neglected to use a
version number.
v2 is named nutch_urlfilter.patch
Add
12 matches
Mail list logo