Re: Automating workflow using ndfs
The goal is to avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL these regex for each URL. Any comment? Sure seems like just some hash look up table could handle it. I am having a hard time seeing when you really need a regex and a fixed list wouldn't do. Especially if you have forward and maybe a backwards lookup as well in a multi-level hash, to perhaps include/exclude at a certain sudomain level, like include: com-site-good (for good.site.com stuff) exclude: com-site-bad (for bad.site.com) and kind of walk backwards, kind of like dns. Then you could just do a few hash lookups instead of 100,000 regexes. I realize I am talking about host and not page level filtering, but if you want to include everything from your 100,000 sites, I think such a strategy could work. Hope this makes sense. Maybe I could write some code to and see if it works in practice. If nothing else, maybe the hash stuff could just be another filter option in conf/crawl-urlfilter.txt. Earl __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: regex-normalize.xml
to get regex-normalize.xml to work i must put: in nutch-site.xml In nutch-default.xml there is set: Is this a bug or a feature? =) nutch-site.xml overrides properties defined in nutch-default. So: * If you remove urlnormalizer.class property from nutch-default it must still uses the one defined in nutch-site * If you remove urlnormalizer.class property from nutch-site it must use the one defined in nutch-default * ... (if it works another way it is a bug, otherwise, the feature is to uses nutch-site first, then nutch-default if some properties are not defined in nutch-site). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: regex-normalize.xml
Hi Jérôme, i think i expressed it wrong. The Question was if its a feature or a bug that regex-normalize.xml is used only after this changes. Regards Michael Jérôme Charron schrieb: to get regex-normalize.xml to work i must put: in nutch-site.xml In nutch-default.xml there is set: Is this a bug or a feature? =) nutch-site.xml overrides properties defined in nutch-default. So: * If you remove urlnormalizer.class property from nutch-default it must still uses the one defined in nutch-site * If you remove urlnormalizer.class property from nutch-site it must use the one defined in nutch-default * ... (if it works another way it is a bug, otherwise, the feature is to uses nutch-site first, then nutch-default if some properties are not defined in nutch-site). Regards Jérôme
Re: regex-normalize.xml
i think i expressed it wrong. The Question was if its a feature or a bug that regex-normalize.xml is used only after this changes. the regex-normalize.xml is used only after you specify that you want to use the RegexUrlNormalizer implementation. So it is used only if you specify urlnormalizer.class=org.apache.nutch.net.RegexUrlNormalizer. But it must also works if you remove the urlnormalizer.class = org.apache.nutch.net.BasicUrlNormalizer int the nutch-default. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
[jira] Closed: (NUTCH-21) parser plugin for MS PowerPoint slides
[ http://issues.apache.org/jira/browse/NUTCH-21?page=all ] Jerome Charron closed NUTCH-21: --- Fix Version: 0.8-dev Resolution: Fixed Commited to trunk (http://svn.apache.org/viewcvs.cgi?rev=267226view=rev) Thanks to Stephan Strittmatter. Note: Take care of the patches attached to this issue since the unit tests are platform dependent (only successed on windows). The committed code is platform independent (I hope). I tested it on Linux, so if someone can test it on other platforms it would be a good idea. parser plugin for MS PowerPoint slides -- Key: NUTCH-21 URL: http://issues.apache.org/jira/browse/NUTCH-21 Project: Nutch Type: Improvement Components: fetcher Reporter: Stefan Groschupf Priority: Trivial Fix For: 0.8-dev Attachments: MSPowerPointParser.java, build.xml.patch.txt, parse-mspowerpoint.zip, parse-mspowerpoint.zip transfered from: http://sourceforge.net/tracker/index.php?func=detailaid=1109321group_id=59548atid=491356 submitted by: Stephan Strittmatter -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Automating workflow using ndfs
Matt, This is great! It would be very useful to Nutch developers if your code can be shared. I'm sure quite a few applications will benefit from it because it fills a gap between whole-web crawling and single site (or a handful of sites) crawling. I'll be interested in adapting your plugin to Nutch convention. Thanks, -AJ Matt Kangas wrote: AJ and Earl, I've implemented URLFilters before. In fact, I have a WhitelistURLFilter that implements just what you describe: a hashtable of regex-lists. We implemented it specifically because we want to be able to crawl a large number of known-good paths through sites, including paths through CGIs. The hash is a Nutch ArrayFile, which provides low runtime overhead. We've tested it on 200+ sites thus far, and haven't seen any indication that it will have problems scaling further. The filter and its supporting WhitelistWriter currently rely on a few custom classes, but it should be straightforward to adapt to Nutch naming conventions, etc. If you're interested in doing this work, I can see if it's ok to publish our code. BTW, we're currently alpha-testing the site that uses this plugin, and preparing for a public beta. I'll be sure to post here when we're finally open for business. :) --Matt On Sep 2, 2005, at 11:43 AM, AJ Chen wrote: From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, it seems that a new urlfilter is a good place to extend the inclusion regex capability. The new urlfilter will be defined by urlfilter.class property, which gets loaded by the URLFilterFactory. Regex is necessary because you want to include urls matching certain patterns. Can anybody who implemented URLFilter plugin before share some thoughts about this approach? I expect the new filter must have all capabilities that the current RegexURLFilter.java has so that it won't require change in any other classes. The difference is that the new filter uses a hash table for efficiently looking up regex for included domains (a large number!). BTW, I can't find urlfilter.class property in any of the configuration files in Nutch-0.7. Does 0.7 version still support urlfilter extension? Any difference relative to what's described in the doc DissectingTheNutchCrawler cited above? Thanks, AJ Earl Cahill wrote: The goal is to avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL these regex for each URL. Any comment? Sure seems like just some hash look up table could handle it. I am having a hard time seeing when you really need a regex and a fixed list wouldn't do. Especially if you have forward and maybe a backwards lookup as well in a multi-level hash, to perhaps include/exclude at a certain sudomain level, like include: com-site-good (for good.site.com stuff) exclude: com-site-bad (for bad.site.com) and kind of walk backwards, kind of like dns. Then you could just do a few hash lookups instead of 100,000 regexes. I realize I am talking about host and not page level filtering, but if you want to include everything from your 100,000 sites, I think such a strategy could work. Hope this makes sense. Maybe I could write some code to and see if it works in practice. If nothing else, maybe the hash stuff could just be another filter option in conf/crawl-urlfilter.txt. Earl -- AJ (Anjun) Chen, Ph.D. Canova Bioconsulting Marketing * BD * Software Development 748 Matadero Ave., Palo Alto, CA 94306, USA Cell 650-283-4091, [EMAIL PROTECTED] --- -- Matt Kangas / [EMAIL PROTECTED] -- AJ (Anjun) Chen, Ph.D. Canova Bioconsulting Marketing * BD * Software Development 748 Matadero Ave., Palo Alto, CA 94306, USA Cell 650-283-4091, [EMAIL PROTECTED] ---
Re: Automating workflow using ndfs
I'm going to make a request in Jira now. -AJ --- Matt Kangas [EMAIL PROTECTED] wrote: Great! Is there a ticket in JIRA requesting this feature? If not, we should file one and get a few votes in favor of it. AFAIK, that's the process for getting new features into Nutch. On Sep 2, 2005, at 1:30 PM, AJ Chen wrote: Matt, This is great! It would be very useful to Nutch developers if your code can be shared. I'm sure quite a few applications will benefit from it because it fills a gap between whole-web crawling and single site (or a handful of sites) crawling. I'll be interested in adapting your plugin to Nutch convention. Thanks, -AJ Matt Kangas wrote: AJ and Earl, I've implemented URLFilters before. In fact, I have a WhitelistURLFilter that implements just what you describe: a hashtable of regex-lists. We implemented it specifically because we want to be able to crawl a large number of known-good paths through sites, including paths through CGIs. The hash is a Nutch ArrayFile, which provides low runtime overhead. We've tested it on 200+ sites thus far, and haven't seen any indication that it will have problems scaling further. The filter and its supporting WhitelistWriter currently rely on a few custom classes, but it should be straightforward to adapt to Nutch naming conventions, etc. If you're interested in doing this work, I can see if it's ok to publish our code. BTW, we're currently alpha-testing the site that uses this plugin, and preparing for a public beta. I'll be sure to post here when we're finally open for business. :) --Matt On Sep 2, 2005, at 11:43 AM, AJ Chen wrote: From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler, it seems that a new urlfilter is a good place to extend the inclusion regex capability. The new urlfilter will be defined by urlfilter.class property, which gets loaded by the URLFilterFactory. Regex is necessary because you want to include urls matching certain patterns. Can anybody who implemented URLFilter plugin before share some thoughts about this approach? I expect the new filter must have all capabilities that the current RegexURLFilter.java has so that it won't require change in any other classes. The difference is that the new filter uses a hash table for efficiently looking up regex for included domains (a large number!). BTW, I can't find urlfilter.class property in any of the configuration files in Nutch-0.7. Does 0.7 version still support urlfilter extension? Any difference relative to what's described in the doc DissectingTheNutchCrawler cited above? Thanks, AJ Earl Cahill wrote: The goal is to avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL these regex for each URL. Any comment? Sure seems like just some hash look up table could handle it. I am having a hard time seeing when you really need a regex and a fixed list wouldn't do. Especially if you have forward and maybe a backwards lookup as well in a multi-level hash, to perhaps include/exclude at a certain sudomain level, like include: com-site-good (for good.site.com stuff) exclude: com-site-bad (for bad.site.com) and kind of walk backwards, kind of like dns. Then you could just do a few hash lookups instead of 100,000 regexes. I realize I am talking about host and not page level filtering, but if you want to include everything from your 100,000 sites, I think such a strategy could work. Hope this makes sense. Maybe I could write some code to and see if it works in practice. If nothing else, maybe the hash stuff could just be another filter option in conf/crawl-urlfilter.txt. Earl -- AJ (Anjun) Chen, Ph.D. Canova Bioconsulting Marketing * BD * Software Development 748 Matadero Ave., Palo Alto, CA 94306, USA Cell 650-283-4091, [EMAIL PROTECTED] --- -- Matt Kangas / [EMAIL PROTECTED] -- AJ (Anjun) Chen, Ph.D. Canova Bioconsulting Marketing * BD * Software Development 748 Matadero Ave., Palo Alto, CA 94306, USA Cell 650-283-4091, [EMAIL PROTECTED] --- -- Matt Kangas / [EMAIL PROTECTED]