Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Erik Hatcher


On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
Dublin core may is good for semantic web, but not for a content  
storage.


I completely disagree with that.


Me too.


In fact, I think many people would disagree
with that in fact. Dublin core is a standard metadata model for  
electronic
resources. It is by no means the entire spectrum of metadata that  
could be

stored for electronic content. However, rather than creating your own
author field, or content creator, or document creator, or  
whatever you
want to call it, I think it would be nice to provide the DC  
metadata because
at least it is well known and provides interoperability with other  
content
storage systems. Check out DSpace from MIT. Check out ISO-11179  
registry

systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard  
metadata is an

important concern in any content management system.


Further along these lines... Nutch's instigation had a bit to do with  
Google's dominance, and look where Google is headed now!  Semantic  
web, oh my!  Google Base currently is just scratching the surface of  
where they'll head.  Nutch could certainly be used in this sort of  
space.  I was, but currently backed off for something much simpler to  
begin with, using Nutch to crawl library archives with RDF data  
backing the web pages, pointed to by link tags in the head  
section.  That RDF is dumped into a powerful triplestore (Kowari),  
with the goal of blending structured RDF queries with full-text queries.


I strongly suspect that there will be more efforts to tweak Nutch  
into the semantic web space.  I'd be surprised otherwise.



The magic world is minimalism.
So I vote against this suggestion!
Stefan


In general, this proposal represents a step forward in being able  
to parse
generic XML content in Nutch, which is a very challenging problem.  
Thanks
for your suggestions, however, I think that our proposal would help  
Nutch to

move forward in being to handle generic forms of XML markup content.


Stefan - please don't inhibit innovation.  Just because you don't  
agree with the approach, let them have the freedom to prove it out  
with encouragement, not negativity.  Plugins can be turned off, and  
if it isn't acceptable to be in the core then so be it, it doesn't  
even have to be an officially supported plugin.  But I, for one,  
would like to encourage them to continue on with their XML efforts  
and see where it leads.


RDF, microformats, triplestores, structured querying, faceted  
browsing these are the things I need, with of course full-text  
search, and this is the direction Google is headed in a major way.   
Full-text is great and all, but it's only part of the story, and a  
crude one in many respects. :)  Scraping HTML for meaning... insanity.


Erik




Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Stefan Groschupf


Am 25.11.2005 um 11:30 schrieb Erik Hatcher:



On 24 Nov 2005, at 23:49, Chris Mattmann wrote:
Dublin core may is good for semantic web, but not for a content  
storage.


I completely disagree with that.


Me too.
Do we talk about parsing rdf or do we discuss to store parsed html  
text in rdf and convert it via xslt to pure text?
I may misunderstand something. I very like the idea of a general rdf  
parser. Back in the days i played around with jena.sf.net
Parsing yes, replace nutch sequence file and the concept of Wriatbles  
with xml - is from my point of view a bad idea.




Stefan - please don't inhibit innovation.
:-) I'm the last that inhibit innovation, but I would love to see  
nutch able to parse billion of pages.
As you can read in my last posting, to give freedom for all  
developers back in the days I contributed the plugin system.


Stefan



Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
 Do we talk about parsing rdf or do we discuss to store parsed html
 text in rdf and convert it via xslt to pure text?
 I may misunderstand something. I very like the idea of a general rdf
 parser. Back in the days i played around with jena.sf.net
 Parsing yes, replace nutch sequence file and the concept of Wriatbles
 with xml - is from my point of view a bad idea.

One more time. Please read the proposal one more time and my responses.
The proposal doesn't suggest to replace the way data are stored in Nutch.
It is just a proposal of a generic xml parser (as the title suggest it)


 :-) I'm the last that inhibit innovation, but I would love to see
 nutch able to parse billion of pages.

Today, parsing billion of pages is not the only challenge of search engines
(look at Google that no more displays the number of indexed pages)
The parsing of a lot of content types, the language technologies (language
specific stemmatization, analysis, querying, summarization, ...) are some
other new challenges...
The low level challenges are importants, but they must not be a brake for
high level processes.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf

Jérôme,

A  mail archive is a amazing source of information, isn't it?! :-)
To answer your question, just ask your self how many pages per second  
your plan to fetch and parse and how much queries per second a lucene  
index is able to handle - and you can deliver in the ui.

I have here something like 200++ to a maximal 20 queries per second.
http://wiki.apache.org/nutch/HardwareRequirements

Speed improvement in ui can be done by caching components you use to  
assemble the ui. There are some ways to improve speed
But seriously I don't think there will be any pages  that contains  
'cacheable' items until parsing.
Until last years there is one thing I notice that matters in a search  
engine - minimalism.
There is no usage in nutch of a  logging library, no RMI and no meta  
data in the web db. Why?

Minimalism.
Minimalism == speed, speed == scalability, scalability == serious  
enterprise search engine projects.


I don't think it would be a good move to slow down html parsing (most  
used parser) to make rss parser writing more easier for developers.

BTW, we already have a html and feed parser that works, as far I know.
I guess 90 % of the nutch users use the html parser but only 10 % the  
feed-parser (since blogs are mostly html as well).


From my perspective we have much more general things to solve in  
nutch (manageability, monitoring, ndfs block based task-routing, more  
dynamic search servers) than improving thing we already have.
Anyway as you may know we have a plugin system and one goal of the  
plugin system is to give developers the freedom to develop custom  
plugins. :-)


Cheers,
Stefan
B-)

P.S. Do you think it makes sense to run another public nutch mailing  
list, since 'THE nutch [...]' (mailing list  is nutch- 
[EMAIL PROTECTED]), 'Isn't it?'

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html



Am 24.11.2005 um 19:28 schrieb Jérôme Charron:


Hi Stefan,

And thanks for taking time to read the doc and giving us your  
feedback.


-1!

Xsl is terrible slow!
Xml will blow up memory and storage usage.


But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearch  
API to

replace Servlet = HTML by Servlet = XML = HTML (using xsl),
here is a copy of one of my comment:

In my opinion, it is the front-end dreamed architecture. But more
pragmatically, I'm not sure it's a good idea. XSL transformation is a
rather slow process!! And the Nutch front-end must be very responsive.

and then your response and Doug response too:
Stefan:
We already done experiments using XSLT.
There are some ways to improve speed, however it is 20 ++ % slower  
then jsp.

Doug:
I don't think this would make a significant impact on overall Nutch  
search

performance.
(the complete thread is available at
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/ 
msg03811.html

)

I'm a little bit confused... why the use of xsl must be considered  
as too

time and memory expansive in the back-end process,
but not in the front-end?

Dublin core may is good for semantic web, but not for a content  
storage.


It is not used as a content storage, but just as an intermediate  
step: the
output of the xsl transformation, that will be then indexed using  
standard

nutch APIs.
(notice that this xml file schema is perfectly mapped to Parse and  
ParseData

objects)



In general the goal must be to minimalize memory usage and improve
performance such a parser would increase memory usage and definitely
slow down parsing.


Not improving the flexibility, extensibility and features?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/




RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan,


 -1!
 Xsl is terrible slow!

You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
backend, as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifically, XML based content, then XSL will be
used to create an intermediary parse-out xml file, containing the fields
to index. I don't think, given the percentage of xml-based markup content
out there (of course excluding html), compared to regular content, that
this would significantly degrade performance. 

 Xml will blow up memory and storage usage.

Possibly, but I would think that we would do it in a clever fashion. For
instance, the parse-out xml files would most likely be small (~kb) files
that could be deleted if space is a concern. It could be a parameterized
option. 

 Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that. In fact, I think many people would disagree
with that in fact. Dublin core is a standard metadata model for electronic
resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
author field, or content creator, or document creator, or whatever you
want to call it, I think it would be nice to provide the DC metadata because
at least it is well known and provides interoperability with other content
storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

 In general the goal must be to minimalize memory usage and improve
 performance such a parser would increase memory usage and definitely
 slow down parsing.

I don’t think it would slow down parsing significantly, as I mentioned above
markup content represents a small portion of the amount of content out
there.

 The magic world is minimalism.
 So I vote against this suggestion!
 Stefan

In general, this proposal represents a step forward in being able to parse
generic XML content in Nutch, which is a very challenging problem. Thanks
for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.


Cheers,
   Chris Mattmann

 
 
 
 
 
 Am 24.11.2005 um 00:01 schrieb Jérôme Charron:
 
  Hi,
 
  We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and
  me) just
  add a new proposal on the nutch Wiki:
  http://wiki.apache.org/nutch/MarkupLanguageParserProposal
 
  Here is the Summary of Issue:
  Currently, Nutch provides some specific markup language parsing
  plugins:
  one for handling HTML, another one for RSS, but no generic XML parsing
  plugin. This is extremely cumbersome as adding support for a new
  markup
  language implies that you have to develop the whole XML parsing
  code from
  scratch. This methodology causes: (1) code duplication, with little
  or no
  reuse of common pieces of XML parsing code, and (2) dependency library
  duplication, where many XML parsing plugins may rely on similar xml
  parsing
  libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing
  plugin
  keeps its own local copy of these libraries. It is also very
  difficult to
  identify precisely the type of XML content encountered during a
  parse. That
  difficult issue is outside the scope of this proposal, and will be
  identified in a future proposal.
 
  Thanks for your feedback, comments, suggestions (and votes).
 
  Regards
 
  Jérôme
 
  --
  http://motrech.free.fr/
  http://www.frutch.org/



RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, and Jerome,

 A  mail archive is a amazing source of information, isn't it?! :-)
 To answer your question, just ask your self how many pages per second
 your plan to fetch and parse and how much queries per second a lucene
 index is able to handle - and you can deliver in the ui.
 I have here something like 200++ to a maximal 20 queries per second.
 http://wiki.apache.org/nutch/HardwareRequirements

I'm not sure that our proposal affects the ui, really at all. Parsing occurs
only during a fetch, which creates the index for the ui, no? So, why mention
the amount of queries per second that the ui can handle?

 
 Speed improvement in ui can be done by caching components you use to
 assemble the ui. There are some ways to improve speed
 But seriously I don't think there will be any pages  that contains
 'cacheable' items until parsing.
 Until last years there is one thing I notice that matters in a search
 engine - minimalism.
 There is no usage in nutch of a  logging library, 

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)

 no RMI and no meta
 data in the web db. Why?
 Minimalism.
 Minimalism == speed, speed == scalability, scalability == serious
 enterprise search engine projects.
 
 I don't think it would be a good move to slow down html parsing (most
 used parser) to make rss parser writing more easier for developers.

This proposal isn't meant for RSS, that's seriously constraining the scope.
The proposal is meant for making writing * XML * parsers easier. Note the
XML. RSS is a significantly small subset of XML as a whole. And, there
currently exists no default support for generic XML documents in Nutch.


 BTW, we already have a html and feed parser that works, as far I know.
 I guess 90 % of the nutch users use the html parser but only 10 % the
 feed-parser (since blogs are mostly html as well).

This may or may not be true however I wouldn't be surprised if it was
because it is representative of the division of content on the web -- HTML
definitely is orders of magnitude more pervasive than RSS.

 
  From my perspective we have much more general things to solve in
 nutch (manageability, monitoring, ndfs block based task-routing, more
 dynamic search servers) than improving thing we already have.

I would tend to agree with Jerome on this one -- these seem to be the items
on your agenda: a representative set indeed, but by no means an exhaustive
set of what's needed to improve, and benefit Nutch. One of the motivations
behind our proposal was several emails posted to the Nutch list by users
interested in crawling blogs and RSS:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69417.html

One of my replies to this thread was a message on October 19th, 2005, which
really identified the main problem:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69576.html

There is a lack of a general XML parser in Nutch that would allow it to deal
with general XML content based on user defined schemas and DTDs. Our
proposal would be the initial step towards a solution to this overall
problem. At least, that's part of its intention.


 Anyway as you may know we have a plugin system and one goal of the
 plugin system is to give developers the freedom to develop custom
 plugins. :-)

Indeed. And our goal is help developers in their endeavors by providing at
starting point and generic solution for XML based parsing plugins :-)

Cheers,
  Chris


 
 Cheers,
 Stefan
 B-)
 
 P.S. Do you think it makes sense to run another public nutch mailing
 list, since 'THE nutch [...]' (mailing list  is nutch-
 [EMAIL PROTECTED]), 'Isn't it?'
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
 
 
 
 Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
 
  Hi Stefan,
 
  And thanks for taking time to read the doc and giving us your
  feedback.
 
  -1!
  Xsl is terrible slow!
  Xml will blow up memory and storage usage.
 
  But there still something I don't understand...
  Regarding a previous discussion we had about the use of OpenSearch
  API to
  replace Servlet = HTML by Servlet = XML = HTML (using xsl),
  here is a copy of one of my comment:
 
  In my opinion, it is the front-end dreamed architecture. But more
  pragmatically, I'm not sure it's a good idea. XSL transformation is a
  rather slow process!! And the Nutch front-end must be very responsive.
 
  and then your response and Doug response too:
  Stefan:
  We already done experiments using XSLT.
  There are some ways to improve speed, however it is 20 ++ % slower
  then jsp.
  Doug:
  I don't think this would make a significant impact on overall Nutch
  search
  performance.
  (the complete thread is available at
  http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/
  msg03811.html
  )
 
  I'm a little bit confused... why the use of xsl must be considered
  as too
  time and memory expansive in the back-end process,
  but not in the front-end?
 
  

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)
No, nutch uses java logging, only some plugins use jar that depends  
on log4j.


Stefan