[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2005-09-10 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ] Matt Kangas commented on NUTCH-87: -- Sample plugin.xml file for use with WhitelistURLFilter ?xml version=1.0 encoding=UTF-8? plugin id=epile-whitelisturlfilter name=Epile

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-20 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332660 ] Matt Kangas commented on NUTCH-82: -- Another pure Java solution is to rewrite the nutch bash script in BeanShell (http://www.beanshell.org). I just took a quick (~1 hr) stab

[jira] Commented: (NUTCH-143) Improper error numbers returned on exit

2005-12-17 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-143?page=comments#action_12360689 ] Matt Kangas commented on NUTCH-143: --- I'd like to see this fixed too. It would make error-checking in wrapper scripts much simpler to implement. A fix would have to touch

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch urlfilter-whitelist.tar.gz THIS REPLACES THE PREVIOUS TARBALL SEE THE INCLUDED README.txt FOR USAGE GUIDELINES Place both

[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-12 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ] Matt Kangas commented on NUTCH-87: -- JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file instead. Efficient site-specific crawling for a large

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Version: 0.7.2-dev 0.8-dev Efficient site-specific crawling for a large number of sites

[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch-0.8 The previous patch file is valid for 0.7. Here is one that works for 0.8-dev (trunk). (It's three separate one-line additions, to

[jira] Created: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)
Log when db.max configuration limits reached Key: NUTCH-182 URL: http://issues.apache.org/jira/browse/NUTCH-182 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Matt Kangas

[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached

2006-01-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ] Matt Kangas updated NUTCH-182: -- Attachment: ParseData.java.patch LinkDb.java.patch Two patches are attached for nutch/trunk (0.8-dev). LinkDb.java.patch adds two new LOG.info()

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] Matt Kangas commented on NUTCH-272: --- I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-19 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ] Matt Kangas commented on NUTCH-272: --- btw, I'd love to be proven wrong, because if generate.max.per.host parameter works as a hard URL cap per site, I could be sleeping

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-22 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ] Matt Kangas commented on NUTCH-272: --- Scratch my last comment. :-) I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] Matt Kangas commented on NUTCH-289: --- +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise) CrawlDatum should store IP address

[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-30 Thread Matt Kangas (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] Matt Kangas commented on NUTCH-272: --- Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls.

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2007-12-04 Thread Matt Kangas (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420 ] Matt Kangas commented on NUTCH-585: --- Simplest path forward... that I can think of: 1) Add a new indexing plugin