Ken Krugler wrote:
For what it's worth, below is the filter list we're using for doing an
html-centric crawl (no word docs, for example). Using the (?i) means we
don't need to have upper lower-case versions of the suffixes.
On Thu, 2005-12-01 at 18:53 +, Howie Wang wrote:
.And .xhtml seem like they
would be parsable by the default HTML parser.
Ditto for .xml. It is a valid (though seldom used) xhtml extension.
Howie
From: Doug Cutting [EMAIL PROTECTED]
Ken Krugler wrote:
For what it's worth, below is
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file
Jerome,
I think that this is a great idea and ensures that there isn't replication
of so-called management information across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes
Jérôme Charron wrote:
[...]
build a list of file extensions to include (other ones will be excluded) in
the fecth process.
[...]
I would not like to exclude all others - as for example many extensions
are valid for html - especially dynamicly generated pages (jsp,asp,cgi
just to name the easy
Jérôme Charron wrote:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting
Chris Mattmann wrote:
In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like.
Right, but the URL filters run long before we know the mime type, in
order to try to keep us
Right, but the URL filters run long before we know the mime type, in
order to try to keep us from fetching lots of stuff we can't process.
The mime type is not known until we've fetched it.
Yes, the fetcher can't rely on the document mime-type.
The only thing we can use for filtering is the
The latter is not strictly true. Nutch could issue an HTTP HEAD
before the HTTP GET, and determine the mime-type before actually
grabbing the content.
It's not how Nutch works now, but this might be more useful than a
super-detailed set of regexes...
[EMAIL PROTECTED]:~$ telnet localhost
Matt Kangas wrote:
The latter is not strictly true. Nutch could issue an HTTP HEAD before
the HTTP GET, and determine the mime-type before actually grabbing the
content.
It's not how Nutch works now, but this might be more useful than a
super-detailed set of regexes...
This could be a
Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file
Hi Doug,
Chris Mattmann wrote:
In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of
whether
it ends in .foo, .bar or the like.
Right, but the URL filters run long before we know the mime type, in
Totally agreed. Neither approach replaces the other. I just wanted to
mention possibility so people don't over-focus on trying to build a
hyper-optimized regex list. :)
For the content provider, an HTTP HEAD request saves them bandwidth
if we don't do a GET. That's some cost savings for
13 matches
Mail list logo