speed == scalability
Oh, damned, is it a new theory Stefan?
Not? How you will run a search engine that is able to scale up to
billion of pages that can only parse 20 pages per second?
Do you have unlimited hardware resources? Let me know I would be
interested to join your project.
Yes
Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)
No, nutch uses java logging, only some plugins use jar that depends
on log4j.
Stefan
Hi Stefan, and Jerome,
> A mail archive is a amazing source of information, isn't it?! :-)
> To answer your question, just ask your self how many pages per second
> your plan to fetch and parse and how much queries per second a lucene
> index is able to handle - and you can deliver in the ui.
> I
Hi Stefan,
> -1!
> Xsl is terrible slow!
You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
"backend", as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifica
> Until last years there is one thing I notice that matters in a search
> engine - minimalism.
If you are honnest Stefan, take a closer look at the end of the proposal
(here is a copy):
Issues
Create performance benchmarks and ensure that the new implementation gives
at least the same performance
Jérôme,
A mail archive is a amazing source of information, isn't it?! :-)
To answer your question, just ask your self how many pages per second
your plan to fetch and parse and how much queries per second a lucene
index is able to handle - and you can deliver in the ui.
I have here somethin
[
http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358466 ]
Earl Cahill commented on NUTCH-120:
---
I can't really explain what was happening, but for a time, many valid links
would throw an exception. Then it just stopped. I think we
Hi Stefan,
And thanks for taking time to read the doc and giving us your feedback.
-1!
> Xsl is terrible slow!
> Xml will blow up memory and storage usage.
But there still something I don't understand...
Regarding a previous discussion we had about the use of OpenSearch API to
replace Servlet =>
Stefan Groschupf wrote:
Sorry for me terrible english.
Sure I know the concept of nutch-default.xml and nutch-site.xml.
I tried to say that in case you have a setup for plugin.inlcude in
nutch-site.xml in the beginning of the file and may since you made a
mistake a second time in the end of
Sorry for me terrible english.
Sure I know the concept of nutch-default.xml and nutch-site.xml.
I tried to say that in case you have a setup for plugin.inlcude in
nutch-site.xml in the beginning of the file and may since you made a
mistake a second time in the end of the same file, the last
Sounds like a problem with the hostnames of your datanodes.
Check that your are able to ping all the datanodes with the hostnames
they had send to the namenode.
check:
bin/nutch ndfs -report to see the hostnames.
Stefan
Am 24.11.2005 um 16:04 schrieb Anton Potehin:
When we start namenode and
Stefan Groschupf (JIRA) wrote:
second configuration nodes overwrites first node
Key: NUTCH-128
URL: http://issues.apache.org/jira/browse/NUTCH-128
Project: Nutch
Type: Bug
Versions: 0.7.1
Reporter: Stefan Gros
-1!
Xsl is terrible slow!
Xml will blow up memory and storage usage.
Dublin core may is good for semantic web, but not for a content storage.
In general the goal must be to minimalize memory usage and improve
performance such a parser would increase memory usage and definitely
slow down parsin
second configuration nodes overwrites first node
Key: NUTCH-128
URL: http://issues.apache.org/jira/browse/NUTCH-128
Project: Nutch
Type: Bug
Versions: 0.7.1
Reporter: Stefan Groschupf
Priority: Trivial
When we start namenode and datenode on one host and then try get file from ndfs
from another host we get error:
Exception in thread "main" java.lang.NullPointerException
at java.net.Socket.(Socket.java:301)
at java.net.Socket.(Socket.java:153)
at
org.apache.nutch.nd
We realize this scheme, and (to our surprise! :)) it work!
But after each iteration, we need to restart Tomcat, otherwise on search
page there no results at all!
How work out this problem?
We think out next work scheme for incremental crawling:
1. Depth =1, topN = big enough (for example 10)
2. clear partial indexes from previous iteration
3. copy global index to indexes
4. crawl new segment
5. create index for new segment
6. deldup (working for
17 matches
Mail list logo