Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Ken Krugler wrote: For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper lower-case versions of the suffixes.

Re: Urlfilter Patch

2005-12-01 Thread Rod Taylor
On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote: .And .xhtml seem like they would be parsable by the default HTML parser. Ditto for .xml. It is a valid (though seldom used) xhtml extension. Howie From: Doug Cutting [EMAIL PROTECTED] Ken Krugler wrote: For what it's worth, below is

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Jerome, I think that this is a great idea and ensures that there isn't replication of so-called management information across the system. It could be easily implemented as a utility method because we have utility java classes that represent the ParsePluginList, that you could get the mimeTypes

Re: Urlfilter Patch

2005-12-01 Thread Piotr Kosiorowski
Jérôme Charron wrote: [...] build a list of file extensions to include (other ones will be excluded) in the fecth process. [...] I would not like to exclude all others - as for example many extensions are valid for html - especially dynamicly generated pages (jsp,asp,cgi just to name the easy

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Jérôme Charron wrote: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the

Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas
The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... [EMAIL PROTECTED]:~$ telnet localhost

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting
Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... This could be a

Re: Urlfilter Patch

2005-12-01 Thread Ken Krugler
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Doug, Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in

Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas
Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :) For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for