[
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ]
Matt Kangas commented on NUTCH-87:
--
Sample plugin.xml file for use with WhitelistURLFilter
?xml version=1.0 encoding=UTF-8?
plugin
id=epile-whitelisturlfilter
name=Epile
[
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332660 ]
Matt Kangas commented on NUTCH-82:
--
Another pure Java solution is to rewrite the nutch bash script in BeanShell
(http://www.beanshell.org).
I just took a quick (~1 hr) stab
[
http://issues.apache.org/jira/browse/NUTCH-143?page=comments#action_12360689 ]
Matt Kangas commented on NUTCH-143:
---
I'd like to see this fixed too. It would make error-checking in wrapper scripts
much simpler to implement.
A fix would have to touch
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-
Attachment: build.xml.patch
urlfilter-whitelist.tar.gz
THIS REPLACES THE PREVIOUS TARBALL
SEE THE INCLUDED README.txt FOR USAGE GUIDELINES
Place both
[
http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ]
Matt Kangas commented on NUTCH-87:
--
JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file
instead.
Efficient site-specific crawling for a large
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-
Version: 0.7.2-dev
0.8-dev
Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ]
Matt Kangas updated NUTCH-87:
-
Attachment: build.xml.patch-0.8
The previous patch file is valid for 0.7. Here is one that works for 0.8-dev
(trunk).
(It's three separate one-line additions, to
Log when db.max configuration limits reached
Key: NUTCH-182
URL: http://issues.apache.org/jira/browse/NUTCH-182
Project: Nutch
Type: Improvement
Components: fetcher
Versions: 0.8-dev
Reporter: Matt Kangas
[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ]
Matt Kangas updated NUTCH-182:
--
Attachment: ParseData.java.patch
LinkDb.java.patch
Two patches are attached for nutch/trunk (0.8-dev).
LinkDb.java.patch adds two new LOG.info()
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ]
Matt Kangas commented on NUTCH-272:
---
I've been thinking about this after hitting several sites that explode into 1.5
M URLs (or more). I could sleep easier at night if I
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ]
Matt Kangas commented on NUTCH-272:
---
btw, I'd love to be proven wrong, because if generate.max.per.host parameter
works as a hard URL cap per site, I could be sleeping
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ]
Matt Kangas commented on NUTCH-272:
---
Scratch my last comment. :-) I assumed that URLFilters.filter() was applied
while traversing the segment, as it was in 0.7. Not true in
[
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ]
Matt Kangas commented on NUTCH-289:
---
+1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher
or otherwise)
CrawlDatum should store IP address
[
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ]
Matt Kangas commented on NUTCH-272:
---
Thanks Doug, that makes more sense now. Running URLFilters.filter() during
Generate seems very handy, albeit costly for large crawls.
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420
]
Matt Kangas commented on NUTCH-585:
---
Simplest path forward... that I can think of:
1) Add a new indexing plugin
15 matches
Mail list logo