[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ]
Matt Kangas commented on NUTCH-272:
---
Thanks Doug, that makes more sense now. Running URLFilters.filter() during
Generate seems very handy, albeit costly for large crawls.
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ]
Matt Kangas commented on NUTCH-272:
---
Scratch my last comment. :-) I assumed that URLFilters.filter() was applied
while traversing the segment, as it was in 0.7. Not true in
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ]
Matt Kangas commented on NUTCH-272:
---
I've been thinking about this after hitting several sites that explode into 1.5
M URLs (or more). I could sleep easier at night if I
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ]
Matt Kangas commented on NUTCH-272:
---
btw, I'd love to be proven wrong, because if generate.max.per.host parameter
works as a hard URL cap per site, I could be sleeping
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412620 ]
Stefan Neufeind commented on NUTCH-272:
---
Oh, I just discovered this new parameter was added in 0.8-dev :-)
But to my understanding of the description in