Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
speed == scalability Oh, damned, is it a new theory Stefan? Not? How you will run a search engine that is able to scale up to billion of pages that can only parse 20 pages per second? Do you have unlimited hardware resources? Let me know I would be interested to join your project. Yes

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) No, nutch uses java logging, only some plugins use jar that depends on log4j. Stefan

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, and Jerome, > A mail archive is a amazing source of information, isn't it?! :-) > To answer your question, just ask your self how many pages per second > your plan to fetch and parse and how much queries per second a lucene > index is able to handle - and you can deliver in the ui. > I

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, > -1! > Xsl is terrible slow! You have to consider what the XSL will be used for. Our proposal suggests XSL as a means of intermediate transformation of markup content on the "backend", as Jerome suggested in his reply. This means that whenever markup content is encountered, specifica

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Jérôme Charron
> Until last years there is one thing I notice that matters in a search > engine - minimalism. If you are honnest Stefan, take a closer look at the end of the proposal (here is a copy): Issues Create performance benchmarks and ensure that the new implementation gives at least the same performance

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
Jérôme, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here somethin

[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

2005-11-24 Thread Earl Cahill (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358466 ] Earl Cahill commented on NUTCH-120: --- I can't really explain what was happening, but for a time, many valid links would throw an exception. Then it just stopped. I think we

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Jérôme Charron
Hi Stefan, And thanks for taking time to read the doc and giving us your feedback. -1! > Xsl is terrible slow! > Xml will blow up memory and storage usage. But there still something I don't understand... Regarding a previous discussion we had about the use of OpenSearch API to replace Servlet =>

Re: [jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Andrzej Bialecki
Stefan Groschupf wrote: Sorry for me terrible english. Sure I know the concept of nutch-default.xml and nutch-site.xml. I tried to say that in case you have a setup for plugin.inlcude in nutch-site.xml in the beginning of the file and may since you made a mistake a second time in the end of

Re: [jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Stefan Groschupf
Sorry for me terrible english. Sure I know the concept of nutch-default.xml and nutch-site.xml. I tried to say that in case you have a setup for plugin.inlcude in nutch-site.xml in the beginning of the file and may since you made a mistake a second time in the end of the same file, the last

Re: problem with ndfs

2005-11-24 Thread Stefan Groschupf
Sounds like a problem with the hostnames of your datanodes. Check that your are able to ping all the datanodes with the hostnames they had send to the namenode. check: bin/nutch ndfs -report to see the hostnames. Stefan Am 24.11.2005 um 16:04 schrieb Anton Potehin: When we start namenode and

Re: [jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Andrzej Bialecki
Stefan Groschupf (JIRA) wrote: second configuration nodes overwrites first node Key: NUTCH-128 URL: http://issues.apache.org/jira/browse/NUTCH-128 Project: Nutch Type: Bug Versions: 0.7.1 Reporter: Stefan Gros

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Stefan Groschupf
-1! Xsl is terrible slow! Xml will blow up memory and storage usage. Dublin core may is good for semantic web, but not for a content storage. In general the goal must be to minimalize memory usage and improve performance such a parser would increase memory usage and definitely slow down parsin

[jira] Created: (NUTCH-128) second configuration nodes overwrites first node

2005-11-24 Thread Stefan Groschupf (JIRA)
second configuration nodes overwrites first node Key: NUTCH-128 URL: http://issues.apache.org/jira/browse/NUTCH-128 Project: Nutch Type: Bug Versions: 0.7.1 Reporter: Stefan Groschupf Priority: Trivial

problem with ndfs

2005-11-24 Thread Anton Potehin
When we start namenode and datenode on one host and then try get file from ndfs from another host we get error: Exception in thread "main" java.lang.NullPointerException     at java.net.Socket.(Socket.java:301)     at java.net.Socket.(Socket.java:153)     at org.apache.nutch.nd

RE: Incremental crawling

2005-11-24 Thread anton
We realize this scheme, and (to our surprise! :)) it work! But after each iteration, we need to restart Tomcat, otherwise on search page there no results at all! How work out this problem?

Incremental crawling

2005-11-24 Thread Anton Potehin
We think out next work scheme for incremental crawling: 1. Depth =1, topN = big enough (for example 10) 2. clear partial indexes from previous iteration 3. copy global index to indexes 4. crawl new segment 5. create index for new segment 6. deldup (working for