subject:"RE\: Urlfilter Patch"

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Ken Krugler wrote: For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper lower-case versions of the suffixes.

Re: Urlfilter Patch

2005-12-01 Thread Rod Taylor

On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote: .And .xhtml seem like they would be parsable by the default HTML parser. Ditto for .xml. It is a valid (though seldom used) xhtml extension. Howie From: Doug Cutting [EMAIL PROTECTED] Ken Krugler wrote: For what it's worth, below is

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron

Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Jerome, I think that this is a great idea and ensures that there isn't replication of so-called management information across the system. It could be easily implemented as a utility method because we have utility java classes that represent the ParsePluginList, that you could get the mimeTypes

Re: Urlfilter Patch

2005-12-01 Thread Piotr Kosiorowski

Jérôme Charron wrote: [...] build a list of file extensions to include (other ones will be excluded) in the fecth process. [...] I would not like to exclude all others - as for example many extensions are valid for html - especially dynamicly generated pages (jsp,asp,cgi just to name the easy

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Jérôme Charron wrote: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron

Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the

Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas

The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... [EMAIL PROTECTED]:~$ telnet localhost

Re: Urlfilter Patch

2005-12-01 Thread Doug Cutting

Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... This could be a

Re: Urlfilter Patch

2005-12-01 Thread Ken Krugler

Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file

RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Hi Doug, Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in

Re: Urlfilter Patch

2005-12-01 Thread Matt Kangas

Totally agreed. Neither approach replaces the other. I just wanted to mention possibility so people don't over-focus on trying to build a hyper-optimized regex list. :) For the content provider, an HTTP HEAD request saves them bandwidth if we don't do a GET. That's some cost savings for

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

Re: Urlfilter Patch

RE: Urlfilter Patch

Re: Urlfilter Patch

13 matches

Site Navigation

Mail list logo

Footer information