[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-30 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] Matt Kangas commented on NUTCH-272: --- Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls.

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-22 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ] Matt Kangas commented on NUTCH-272: --- Scratch my last comment. :-) I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] Matt Kangas commented on NUTCH-272: --- I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ] Matt Kangas commented on NUTCH-272: --- btw, I'd love to be proven wrong, because if generate.max.per.host parameter works as a hard URL cap per site, I could be sleeping

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Stefan Neufeind (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412620 ] Stefan Neufeind commented on NUTCH-272: --- Oh, I just discovered this new parameter was added in 0.8-dev :-) But to my understanding of the description in