[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12323158 ] Matt Kangas commented on NUTCH-87: -- Sample plugin.xml file for use with WhitelistURLFilter ?xml version=1.0 encoding=UTF-8? plugin id=epile-whitelisturlfilter name=Epile whitelist URL filter version=1.0.0 provider-name=teamgigabyte.com extension-point id=org.apache.nutch.net.URLFilter name=Nutch URL Filter/ runtime/runtime extension id=org.apache.nutch.net.urlfiler name=Epile Whitelist URL Filter point=org.apache.nutch.net.URLFilter implementation id=WhitelistURLFilter class=epile.crawl.plugin.WhitelistURLFilter/ /extension /plugin Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: http://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Type: New Feature Components: fetcher Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools
[ http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332660 ] Matt Kangas commented on NUTCH-82: -- Another pure Java solution is to rewrite the nutch bash script in BeanShell (http://www.beanshell.org). I just took a quick (~1 hr) stab at this. The syntax seems quite agreeable, with many builtin versions of standard unix commands (cd(), cat(), etc). However, I quickly hit two barriers: 1) Reading environment variables. System.getenv() works on 1.5, but is nonfunctional on Java 1.3 and The only workaround on 1.4 is what ant does: run a native command, read the output, and set system properties. 2) Setting -Xmx et al. My sense is that it's simply not possible. Other than these issues, it would be quite easy to rewrite all of the usage/command/path-building logic into a beanshell script. Then there could be two *small* scripts (bash .bat) to handle the stuff that can't be done in Java, and one beanshell script for the rest. Does that seem useful? FYI, the core beanshell interpreter is ~143k. Nutch Commands should run on Windows without external tools --- Key: NUTCH-82 URL: http://issues.apache.org/jira/browse/NUTCH-82 Project: Nutch Type: New Feature Environment: Windows 2000 Reporter: AJ Banck Attachments: nutch.bat, nutch.bat, nutch.pl Currently there is only a shellscript to run the Nutch commands. This should be platform independant. Best would be Ant tools, or scripts generated by a template tool to avoid replication. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-143) Improper error numbers returned on exit
[ http://issues.apache.org/jira/browse/NUTCH-143?page=comments#action_12360689 ] Matt Kangas commented on NUTCH-143: --- I'd like to see this fixed too. It would make error-checking in wrapper scripts much simpler to implement. A fix would have to touch every .java file that has a main() method, because the problem is that the JVM returns status=0 from main(), because main() has a _void_ return type, after all. To solve this, I recommend renaming all existing main() methods to doMain() and adding the following to each affected file: /** * main() wrapper that returns proper exit status */ public static void main(String[] args) { Runtime rt = Runtime.getRuntime(); try { boolean status = doMain(args); rt.exit(status ? 0 : 1); } catch (Exception e) { LOG.log(Level.SEVERE, LOGPREFIX + error, caught Exception in main(), e); rt.exit(1); } } Improper error numbers returned on exit --- Key: NUTCH-143 URL: http://issues.apache.org/jira/browse/NUTCH-143 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Rod Taylor Nutch does not obey standard command line error numbers which can make it difficult to script around commands. Both of the below should have exited with an error number larger than 0 causing the shell script to enter into the 'Failed' case. bash-3.00$ /opt/nutch/bin/nutch updatedb echo ==Success || echo ==Failed Usage: crawldb segment ==Success bash-3.00$ /opt/nutch/bin/nutch readdb echo ==Success || echo ==Failed Usage: CrawlDbReader crawldb (-stats | -dump out_dir | -url url) crawldb directory name where crawldb is located -stats print overall statistics to System.out -dump out_dir dump the whole db to a text file in out_dir -url url print information on url to System.out ==Success Note that the nutch shell script functions as expected: bash-3.00$ /opt/nutch/bin/nutch echo ==Success || echo ==Failed Usage: nutch COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdbread / dump crawl db readlinkdbread / dump link db admin database administration, including creation injectinject new urls into the database generate generate new segments to fetch fetch fetch a segment's pages parse parse a segment's pages updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments index run the indexer on parsed segments and linkdb merge merge several segment indexes dedup remove duplicates from a set of segment indexes serverrun a search server namenode run the NDFS namenode datanode run an NDFS datanode ndfs run an NDFS admin client jobtrackerrun the MapReduce job Tracker node tasktracker run a MapReduce task Tracker node job manipulate MapReduce jobs or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. ==Failed -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch urlfilter-whitelist.tar.gz THIS REPLACES THE PREVIOUS TARBALL SEE THE INCLUDED README.txt FOR USAGE GUIDELINES Place both of these files into ~nutch/src/plugin, then: - untar the tarball - apply the patch to ~nutch/src/plugin/build.xml to permit urifilter-whitelist to be built Next, cd ~nutch and build (ant). A JUnit test is included. It will be run automatically by ant test-plugins. Then follow the instructions in ~nutch/src/plugin/urlfilter-whitelist/README.txt Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: http://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Type: New Feature Components: fetcher Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=comments#action_12362584 ] Matt Kangas commented on NUTCH-87: -- JIRA-87-whitelistfilter.tar.gz is OBSOLETE. Use the newer tarball + patch file instead. Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: http://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Type: New Feature Components: fetcher Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Version: 0.7.2-dev 0.8-dev Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: http://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev, 0.7.2-dev Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, urlfilter-whitelist.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-87) Efficient site-specific crawling for a large number of sites
[ http://issues.apache.org/jira/browse/NUTCH-87?page=all ] Matt Kangas updated NUTCH-87: - Attachment: build.xml.patch-0.8 The previous patch file is valid for 0.7. Here is one that works for 0.8-dev (trunk). (It's three separate one-line additions, to include the plugin in the deploy, test , and clean targets.) Efficient site-specific crawling for a large number of sites Key: NUTCH-87 URL: http://issues.apache.org/jira/browse/NUTCH-87 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev, 0.7.2-dev Environment: cross-platform Reporter: AJ Chen Attachments: JIRA-87-whitelistfilter.tar.gz, build.xml.patch, build.xml.patch-0.8, urlfilter-whitelist.tar.gz There is a gap between whole-web crawling and single (or handful) site crawling. Many applications actually fall in this gap, which usually require to crawl a large number of selected sites, say 10 domains. Current CrawlTool is designed for a handful of sites. So, this request calls for a new feature or improvement on CrawTool so that nutch crawl command can efficiently deal with large number of sites. One requirement is to add or change smallest amount of code so that this feature can be implemented sooner rather than later. There is a discussion about adding a URLFilter to implement this requested feature, see the following thread - http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00726.html The idea is to use a hashtable in URLFilter for looking up regex for any given domain. Hashtable will be much faster than list implementation currently used in RegexURLFilter. Fortunately, Matt Kangas has implemented such idea before for his own application and is willing to make it available for adaptation to Nutch. I'll be happy to help him in this regard. But, before we do it, we would like to hear more discussions or comments about this approach or other approaches. Particularly, let us know what potential downside will be for hashtable lookup in a new URLFilter plugin. AJ Chen -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-182) Log when db.max configuration limits reached
Log when db.max configuration limits reached Key: NUTCH-182 URL: http://issues.apache.org/jira/browse/NUTCH-182 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Matt Kangas Priority: Trivial Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html There are three db.max parameters currently in nutch-default.xml: * db.max.outlinks.per.page * db.max.anchor.length * db.max.inlinks Having values that are too low can result in a site being under-crawled. However, currently there is nothing written to the log when these limits are hit, so users have to guess when they need to raise these values. I suggest that we add three new log messages at the appropriate points: * Exceeded db.max.outlinks.per.page for URL * Exceeded db.max.anchor.length for URL * Exceeded db.max.inlinks for URL -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-182) Log when db.max configuration limits reached
[ http://issues.apache.org/jira/browse/NUTCH-182?page=all ] Matt Kangas updated NUTCH-182: -- Attachment: ParseData.java.patch LinkDb.java.patch Two patches are attached for nutch/trunk (0.8-dev). LinkDb.java.patch adds two new LOG.info() statements: * Exceeded db.max.anchor.length for URL url * Exceeded db.max.inlinks for URL url ParseData.java.patch adds a private static LOG variable, pluse one LOG.info() statement: * Exceeded db.max.outlinks.per.page I would have preferred to print the URL too on the latter, but it's not available in the method where the cutoff is performed (afaik). Log when db.max configuration limits reached Key: NUTCH-182 URL: http://issues.apache.org/jira/browse/NUTCH-182 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Matt Kangas Priority: Trivial Attachments: LinkDb.java.patch, ParseData.java.patch Followup to http://www.nabble.com/Re%3A-Can%27t-index-some-pages-p2480833.html There are three db.max parameters currently in nutch-default.xml: * db.max.outlinks.per.page * db.max.anchor.length * db.max.inlinks Having values that are too low can result in a site being under-crawled. However, currently there is nothing written to the log when these limits are hit, so users have to guess when they need to raise these values. I suggest that we add three new log messages at the appropriate points: * Exceeded db.max.outlinks.per.page for URL * Exceeded db.max.anchor.length for URL * Exceeded db.max.inlinks for URL -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412601 ] Matt Kangas commented on NUTCH-272: --- I've been thinking about this after hitting several sites that explode into 1.5 M URLs (or more). I could sleep easier at night if I could set a cap at 50k URLs/site and just check my log files in the morning. Counting total URLs/domain needs to happen in one of the places where Nutch already traverses the crawldb. For Nutch 0.8 this is nutch generate and nutch updatedb. URLs are added by both nutch inject and nutch updatedb. These tools use the URLFilter plugin x-point to determine which URLs to keep, and which to reject. But note that updatedb could only compute URLs/domain _after_ traversing crawldb, during which time it merges the new URLs. So, one way to approach it is: * Count URLs/domain during update. If a domain exceeds the limit, write to a file. * Read this file at the start of update (next cycle) and block further additions * Or: read in a new URLFilter plugin, and block the URLs in URLFilter.filter() If you do it all in update, you won't catch URLs added via inject, but it would still halt runaway crawls, and it would be simpler because it would be a one-file patch. Max. pages to crawl/fetch per site (emergency limit) Key: NUTCH-272 URL: http://issues.apache.org/jira/browse/NUTCH-272 Project: Nutch Type: Improvement Reporter: Stefan Neufeind If I'm right, there is no way in place right now for setting an emergency limit to fetch a certain max. number of pages per site. Is there an easy way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412614 ] Matt Kangas commented on NUTCH-272: --- btw, I'd love to be proven wrong, because if generate.max.per.host parameter works as a hard URL cap per site, I could be sleeping better quite soon. :) Max. pages to crawl/fetch per site (emergency limit) Key: NUTCH-272 URL: http://issues.apache.org/jira/browse/NUTCH-272 Project: Nutch Type: Improvement Reporter: Stefan Neufeind If I'm right, there is no way in place right now for setting an emergency limit to fetch a certain max. number of pages per site. Is there an easy way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12412845 ] Matt Kangas commented on NUTCH-272: --- Scratch my last comment. :-) I assumed that URLFilters.filter() was applied while traversing the segment, as it was in 0.7. Not true in 0.8... it's applied during Generate. (Wow. This means the crawldb will accumulate lots of junk URLs over time. Is this a feature or a bug?) Max. pages to crawl/fetch per site (emergency limit) Key: NUTCH-272 URL: http://issues.apache.org/jira/browse/NUTCH-272 Project: Nutch Type: Improvement Reporter: Stefan Neufeind If I'm right, there is no way in place right now for setting an emergency limit to fetch a certain max. number of pages per site. Is there an easy way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] Matt Kangas commented on NUTCH-289: --- +1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher or otherwise) CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Doug Cutting If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)
[ http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] Matt Kangas commented on NUTCH-272: --- Thanks Doug, that makes more sense now. Running URLFilters.filter() during Generate seems very handy, albeit costly for large crawls. (Should have an option to turn off?) I also see that URLFilters.filter() is applied in Fetcher (for redirects) and ParseOutputFormat, plus other tools. Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, and they're sorted. You can veto crawldb additions here. Could you effectively count URLs/host here? (Not sure when distributed.) Would it require setting a Partitioner, like crawl.PartitionUrlByHost? Max. pages to crawl/fetch per site (emergency limit) Key: NUTCH-272 URL: http://issues.apache.org/jira/browse/NUTCH-272 Project: Nutch Type: Improvement Reporter: Stefan Neufeind If I'm right, there is no way in place right now for setting an emergency limit to fetch a certain max. number of pages per site. Is there an easy way to implement such a limit, maybe as a plugin? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548420 ] Matt Kangas commented on NUTCH-585: --- Simplest path forward... that I can think of: 1) Add a new indexing plugin extension-point for filtering page content. 2) Put your apriori marked-up content exclusion logic into a plugin. 3) Someone else figures out a more general-purpose solution later, and swaps out your plugin at that time. Ergo, you generalize the interface, and lazy-load the more general implementation. :-) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Priority: Minor We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.