Re: [proposal] Generic Markup Language Parser

2005-11-28 Thread Doug Cutting
Andrzej Bialecki wrote: Gentlemen, please let's keep a civilized tone to this exchange, or take it off the list. +1 Doug

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Erik Hatcher
On 24 Nov 2005, at 23:49, Chris Mattmann wrote: Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. Me too. In fact, I think many people would disagree with that in fact. Dublin core is a standard metadata model for electronic

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Stefan Groschupf
Am 25.11.2005 um 11:30 schrieb Erik Hatcher: On 24 Nov 2005, at 23:49, Chris Mattmann wrote: Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. Me too. Do we talk about parsing rdf or do we discuss to store parsed html text in rdf

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Jérôme, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, -1! Xsl is terrible slow! You have to consider what the XSL will be used for. Our proposal suggests XSL as a means of intermediate transformation of markup content on the backend, as Jerome suggested in his reply. This means that whenever markup content is encountered,

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, and Jerome, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) No, nutch uses java logging, only some plugins use jar that depends on log4j. Stefan