Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser
On 24 Nov 2005, at 23:49, Chris Mattmann wrote: Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. Me too. In fact, I think many people would disagree with that in fact. Dublin core is a standard metadata model for electronic resources. It is by no means the entire spectrum of metadata that could be stored for electronic content. However, rather than creating your own author field, or content creator, or document creator, or whatever you want to call it, I think it would be nice to provide the DC metadata because at least it is well known and provides interoperability with other content storage systems. Check out DSpace from MIT. Check out ISO-11179 registry systems. Check out the ISO standard OAIS reference model for archiving systems. Each of these systems has recognized that standard metadata is an important concern in any content management system. Further along these lines... Nutch's instigation had a bit to do with Google's dominance, and look where Google is headed now! Semantic web, oh my! Google Base currently is just scratching the surface of where they'll head. Nutch could certainly be used in this sort of space. I was, but currently backed off for something much simpler to begin with, using Nutch to crawl library archives with RDF data backing the web pages, pointed to by link tags in the head section. That RDF is dumped into a powerful triplestore (Kowari), with the goal of blending structured RDF queries with full-text queries. I strongly suspect that there will be more efforts to tweak Nutch into the semantic web space. I'd be surprised otherwise. The magic world is minimalism. So I vote against this suggestion! Stefan In general, this proposal represents a step forward in being able to parse generic XML content in Nutch, which is a very challenging problem. Thanks for your suggestions, however, I think that our proposal would help Nutch to move forward in being to handle generic forms of XML markup content. Stefan - please don't inhibit innovation. Just because you don't agree with the approach, let them have the freedom to prove it out with encouragement, not negativity. Plugins can be turned off, and if it isn't acceptable to be in the core then so be it, it doesn't even have to be an officially supported plugin. But I, for one, would like to encourage them to continue on with their XML efforts and see where it leads. RDF, microformats, triplestores, structured querying, faceted browsing these are the things I need, with of course full-text search, and this is the direction Google is headed in a major way. Full-text is great and all, but it's only part of the story, and a crude one in many respects. :) Scraping HTML for meaning... insanity. Erik
Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser
Am 25.11.2005 um 11:30 schrieb Erik Hatcher: On 24 Nov 2005, at 23:49, Chris Mattmann wrote: Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. Me too. Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the concept of Wriatbles with xml - is from my point of view a bad idea. Stefan - please don't inhibit innovation. :-) I'm the last that inhibit innovation, but I would love to see nutch able to parse billion of pages. As you can read in my last posting, to give freedom for all developers back in the days I contributed the plugin system. Stefan
Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser
Do we talk about parsing rdf or do we discuss to store parsed html text in rdf and convert it via xslt to pure text? I may misunderstand something. I very like the idea of a general rdf parser. Back in the days i played around with jena.sf.net Parsing yes, replace nutch sequence file and the concept of Wriatbles with xml - is from my point of view a bad idea. One more time. Please read the proposal one more time and my responses. The proposal doesn't suggest to replace the way data are stored in Nutch. It is just a proposal of a generic xml parser (as the title suggest it) :-) I'm the last that inhibit innovation, but I would love to see nutch able to parse billion of pages. Today, parsing billion of pages is not the only challenge of search engines (look at Google that no more displays the number of indexed pages) The parsing of a lot of content types, the language technologies (language specific stemmatization, analysis, querying, summarization, ...) are some other new challenges... The low level challenges are importants, but they must not be a brake for high level processes. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: [proposal] Generic Markup Language Parser
Jérôme, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here something like 200++ to a maximal 20 queries per second. http://wiki.apache.org/nutch/HardwareRequirements Speed improvement in ui can be done by caching components you use to assemble the ui. There are some ways to improve speed But seriously I don't think there will be any pages that contains 'cacheable' items until parsing. Until last years there is one thing I notice that matters in a search engine - minimalism. There is no usage in nutch of a logging library, no RMI and no meta data in the web db. Why? Minimalism. Minimalism == speed, speed == scalability, scalability == serious enterprise search engine projects. I don't think it would be a good move to slow down html parsing (most used parser) to make rss parser writing more easier for developers. BTW, we already have a html and feed parser that works, as far I know. I guess 90 % of the nutch users use the html parser but only 10 % the feed-parser (since blogs are mostly html as well). From my perspective we have much more general things to solve in nutch (manageability, monitoring, ndfs block based task-routing, more dynamic search servers) than improving thing we already have. Anyway as you may know we have a plugin system and one goal of the plugin system is to give developers the freedom to develop custom plugins. :-) Cheers, Stefan B-) P.S. Do you think it makes sense to run another public nutch mailing list, since 'THE nutch [...]' (mailing list is nutch- [EMAIL PROTECTED]), 'Isn't it?' http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html Am 24.11.2005 um 19:28 schrieb Jérôme Charron: Hi Stefan, And thanks for taking time to read the doc and giving us your feedback. -1! Xsl is terrible slow! Xml will blow up memory and storage usage. But there still something I don't understand... Regarding a previous discussion we had about the use of OpenSearch API to replace Servlet = HTML by Servlet = XML = HTML (using xsl), here is a copy of one of my comment: In my opinion, it is the front-end dreamed architecture. But more pragmatically, I'm not sure it's a good idea. XSL transformation is a rather slow process!! And the Nutch front-end must be very responsive. and then your response and Doug response too: Stefan: We already done experiments using XSLT. There are some ways to improve speed, however it is 20 ++ % slower then jsp. Doug: I don't think this would make a significant impact on overall Nutch search performance. (the complete thread is available at http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/ msg03811.html ) I'm a little bit confused... why the use of xsl must be considered as too time and memory expansive in the back-end process, but not in the front-end? Dublin core may is good for semantic web, but not for a content storage. It is not used as a content storage, but just as an intermediate step: the output of the xsl transformation, that will be then indexed using standard nutch APIs. (notice that this xml file schema is perfectly mapped to Parse and ParseData objects) In general the goal must be to minimalize memory usage and improve performance such a parser would increase memory usage and definitely slow down parsing. Not improving the flexibility, extensibility and features? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: [proposal] Generic Markup Language Parser
Hi Stefan, -1! Xsl is terrible slow! You have to consider what the XSL will be used for. Our proposal suggests XSL as a means of intermediate transformation of markup content on the backend, as Jerome suggested in his reply. This means that whenever markup content is encountered, specifically, XML based content, then XSL will be used to create an intermediary parse-out xml file, containing the fields to index. I don't think, given the percentage of xml-based markup content out there (of course excluding html), compared to regular content, that this would significantly degrade performance. Xml will blow up memory and storage usage. Possibly, but I would think that we would do it in a clever fashion. For instance, the parse-out xml files would most likely be small (~kb) files that could be deleted if space is a concern. It could be a parameterized option. Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. In fact, I think many people would disagree with that in fact. Dublin core is a standard metadata model for electronic resources. It is by no means the entire spectrum of metadata that could be stored for electronic content. However, rather than creating your own author field, or content creator, or document creator, or whatever you want to call it, I think it would be nice to provide the DC metadata because at least it is well known and provides interoperability with other content storage systems. Check out DSpace from MIT. Check out ISO-11179 registry systems. Check out the ISO standard OAIS reference model for archiving systems. Each of these systems has recognized that standard metadata is an important concern in any content management system. In general the goal must be to minimalize memory usage and improve performance such a parser would increase memory usage and definitely slow down parsing. I dont think it would slow down parsing significantly, as I mentioned above markup content represents a small portion of the amount of content out there. The magic world is minimalism. So I vote against this suggestion! Stefan In general, this proposal represents a step forward in being able to parse generic XML content in Nutch, which is a very challenging problem. Thanks for your suggestions, however, I think that our proposal would help Nutch to move forward in being to handle generic forms of XML markup content. Cheers, Chris Mattmann Am 24.11.2005 um 00:01 schrieb Jérôme Charron: Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch/MarkupLanguageParserProposal Here is the Summary of Issue: Currently, Nutch provides some specific markup language parsing plugins: one for handling HTML, another one for RSS, but no generic XML parsing plugin. This is extremely cumbersome as adding support for a new markup language implies that you have to develop the whole XML parsing code from scratch. This methodology causes: (1) code duplication, with little or no reuse of common pieces of XML parsing code, and (2) dependency library duplication, where many XML parsing plugins may rely on similar xml parsing libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing plugin keeps its own local copy of these libraries. It is also very difficult to identify precisely the type of XML content encountered during a parse. That difficult issue is outside the scope of this proposal, and will be identified in a future proposal. Thanks for your feedback, comments, suggestions (and votes). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: [proposal] Generic Markup Language Parser
Hi Stefan, and Jerome, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here something like 200++ to a maximal 20 queries per second. http://wiki.apache.org/nutch/HardwareRequirements I'm not sure that our proposal affects the ui, really at all. Parsing occurs only during a fetch, which creates the index for the ui, no? So, why mention the amount of queries per second that the ui can handle? Speed improvement in ui can be done by caching components you use to assemble the ui. There are some ways to improve speed But seriously I don't think there will be any pages that contains 'cacheable' items until parsing. Until last years there is one thing I notice that matters in a search engine - minimalism. There is no usage in nutch of a logging library, Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) no RMI and no meta data in the web db. Why? Minimalism. Minimalism == speed, speed == scalability, scalability == serious enterprise search engine projects. I don't think it would be a good move to slow down html parsing (most used parser) to make rss parser writing more easier for developers. This proposal isn't meant for RSS, that's seriously constraining the scope. The proposal is meant for making writing * XML * parsers easier. Note the XML. RSS is a significantly small subset of XML as a whole. And, there currently exists no default support for generic XML documents in Nutch. BTW, we already have a html and feed parser that works, as far I know. I guess 90 % of the nutch users use the html parser but only 10 % the feed-parser (since blogs are mostly html as well). This may or may not be true however I wouldn't be surprised if it was because it is representative of the division of content on the web -- HTML definitely is orders of magnitude more pervasive than RSS. From my perspective we have much more general things to solve in nutch (manageability, monitoring, ndfs block based task-routing, more dynamic search servers) than improving thing we already have. I would tend to agree with Jerome on this one -- these seem to be the items on your agenda: a representative set indeed, but by no means an exhaustive set of what's needed to improve, and benefit Nutch. One of the motivations behind our proposal was several emails posted to the Nutch list by users interested in crawling blogs and RSS: http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23 69417.html One of my replies to this thread was a message on October 19th, 2005, which really identified the main problem: http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23 69576.html There is a lack of a general XML parser in Nutch that would allow it to deal with general XML content based on user defined schemas and DTDs. Our proposal would be the initial step towards a solution to this overall problem. At least, that's part of its intention. Anyway as you may know we have a plugin system and one goal of the plugin system is to give developers the freedom to develop custom plugins. :-) Indeed. And our goal is help developers in their endeavors by providing at starting point and generic solution for XML based parsing plugins :-) Cheers, Chris Cheers, Stefan B-) P.S. Do you think it makes sense to run another public nutch mailing list, since 'THE nutch [...]' (mailing list is nutch- [EMAIL PROTECTED]), 'Isn't it?' http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html Am 24.11.2005 um 19:28 schrieb Jérôme Charron: Hi Stefan, And thanks for taking time to read the doc and giving us your feedback. -1! Xsl is terrible slow! Xml will blow up memory and storage usage. But there still something I don't understand... Regarding a previous discussion we had about the use of OpenSearch API to replace Servlet = HTML by Servlet = XML = HTML (using xsl), here is a copy of one of my comment: In my opinion, it is the front-end dreamed architecture. But more pragmatically, I'm not sure it's a good idea. XSL transformation is a rather slow process!! And the Nutch front-end must be very responsive. and then your response and Doug response too: Stefan: We already done experiments using XSLT. There are some ways to improve speed, however it is 20 ++ % slower then jsp. Doug: I don't think this would make a significant impact on overall Nutch search performance. (the complete thread is available at http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/ msg03811.html ) I'm a little bit confused... why the use of xsl must be considered as too time and memory expansive in the back-end process, but not in the front-end?
Re: [proposal] Generic Markup Language Parser
Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) No, nutch uses java logging, only some plugins use jar that depends on log4j. Stefan