thank you Sean Dean for your quick reply ...
well i am running nutch on ubuntu 5.01 and jdk1.5
there are some apps running in the background but they dont take up that
much of memory .
secondly i can understand about the first search .. but the other searches
following it also take time even
It looks like you don't have enough RAM to maintain the quick speeds you were
seeing when the index was only around 3000 pages.
Nutch scales very well, but the hardware behind it must also. Using quick
calculations and common sense, if your total system RAM is only 512MB and all
of that is
thank you sean ..
will do the same and let you know .. if the performance is not up to the
mark ..
thanks a lot
Thanks Regards
Shrinivas
Hi guys
I have my nutch system working pretty reasonably I think and I am
quite happy with the way it is fetching, crawling and indexing. I do
have a problem however in that I can not figure out how to make the
http searches pull data from the index.
Running the searcher command[1] brings up a
Otto, Frank napisał(a):
hi,
I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the index without delete the crawl folder?
kind regards
frank
Well, google is your friend but if you cant use it try this link:
When you run a search from Tomcat, what is written to your logs? Do you see
something like whats below, but pointing to a different path (your correct
path)?
NutchBean - opening segments in
/usr/local/nutch/build/nutch-0.9-dev/crawl/segments
NutchBean - opening linkdb in
Thanks for your answer and for the hint.
Has someone do this as java main class?
-Ursprüngliche Nachricht-
Von: Damian Florczyk [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 29. Dezember 2006 14:22
An: nutch-user@lucene.apache.org
Betreff: Re: recrawl index
Otto, Frank
I've got 500k urls indexed on an old 700mhz P3 clunker with only 384MB
of RAM at my searches take sub-seconds Something is funny here.
I've got my JVM at 64MB for this as well, so be careful as it sounds
like you just caused the box to thrash a bit with swapping. Set the JVM
down to
Make sure you don't have any empty or bad segments. We had some
serious speed issues for a long time until we realized we had some empty
segments that had been generated as we tested. Nutch would then sit and
spin on these bad segments for a few seconds on every search. Simply
deleting the
Insurance Squared Inc. wrote:
Make sure you don't have any empty or bad segments. We had some
serious speed issues for a long time until we realized we had some
empty segments that had been generated as we tested. Nutch would then
sit and spin on these bad segments for a few seconds on
If I recall correctly, we just checked the segment directories for space
size. The bad ones had files of only 32K or something like that.
g.
Michael Wechner wrote:
Insurance Squared Inc. wrote:
Make sure you don't have any empty or bad segments. We had some
serious speed issues for a
Justin Hartman wrote:
Hi guys
I have my nutch system working pretty reasonably I think and I am
quite happy with the way it is fetching, crawling and indexing. I do
have a problem however in that I can not figure out how to make the
http searches pull data from the index.
[]
Hi Justin,
Nitin Borwankar wrote:
Justin Hartman wrote:
Hi guys
I have my nutch system working pretty reasonably I think and I am
quite happy with the way it is fetching, crawling and indexing. I do
have a problem however in that I can not figure out how to make the
http searches pull data from the
Some of it is happening behind the scenes. A hash of the text of the
index document is created when the documents are read. The MapReduce
process uses the InputFormat static inner class to read documents into a
Map class. Here the Map class is not specified so it is an
IdentityMapper which
Hi Nitin
IIRC, the tutorial requires you to start the tomcat instance so it knows
where your index is.
Are you starting tomcat from the directory that has your index (the
suggested way in the tutorial) ?
Or are you indicating to the search servlet the location of your index
in some other way?
Insurance Squared Inc. wrote:
If I recall correctly, we just checked the segment directories for
space size. The bad ones had files of only 32K or something like that.
thanks. Any idea why these are being created in the first place resp.
why these are not being created anymore?
Thanks
Yeah, I think it happens when we restarted either Tomcat or Apache
whilst in the middle of crawling or indexing (crawling if I had to
guess). Now we're careful to let our crawls and indexing finish before
we restart anything. Haven't had any problems since.
Michael Wechner wrote:
Insurance
Thanks guys for all your help and support with this issue. I have
managed to get it working.
Those sneaky pests at SWsoft hid the /WEB-INF/classes/ folder away
from normal viewing but on looking at the catalina.out log file
(located at /var/log/tomcat5/catalina.out) I was able to see where and
Insurance Squared Inc. wrote:
Yeah, I think it happens when we restarted either Tomcat or Apache
whilst in the middle of crawling or indexing (crawling if I had to
guess). Now we're careful to let our crawls and indexing finish before
we restart anything. Haven't had any problems since.
The javascript parser will often add the discovered URL as its anchor
text (See below linkdb dump for examples). These urls-as-anchor text
are tokenized when indexing and then, because anchors by default get a
hefty boost at query time, the URL-found-by-the-parse-js-plugin can show
high in
20 matches
Mail list logo