Re: search performance

2006-12-29 Thread shrinivas patwardhan
thank you Sean Dean for your quick reply ... well i am running nutch on ubuntu 5.01 and jdk1.5 there are some apps running in the background but they dont take up that much of memory . secondly i can understand about the first search .. but the other searches following it also take time even

Re: search performance

2006-12-29 Thread Sean Dean
It looks like you don't have enough RAM to maintain the quick speeds you were seeing when the index was only around 3000 pages. Nutch scales very well, but the hardware behind it must also. Using quick calculations and common sense, if your total system RAM is only 512MB and all of that is

Re: search performance

2006-12-29 Thread shrinivas patwardhan
thank you sean .. will do the same and let you know .. if the performance is not up to the mark .. thanks a lot Thanks Regards Shrinivas

Searching via http statistical data

2006-12-29 Thread Justin Hartman
Hi guys I have my nutch system working pretty reasonably I think and I am quite happy with the way it is fetching, crawling and indexing. I do have a problem however in that I can not figure out how to make the http searches pull data from the index. Running the searcher command[1] brings up a

Re: recrawl index

2006-12-29 Thread Damian Florczyk
Otto, Frank napisał(a): hi, I'm new to nutch. I have crawled my website. But we can I recrawl/refresh the index without delete the crawl folder? kind regards frank Well, google is your friend but if you cant use it try this link:

Re: Searching via http statistical data

2006-12-29 Thread Sean Dean
When you run a search from Tomcat, what is written to your logs? Do you see something like whats below, but pointing to a different path (your correct path)? NutchBean - opening segments in /usr/local/nutch/build/nutch-0.9-dev/crawl/segments NutchBean - opening linkdb in

AW: recrawl index

2006-12-29 Thread Otto, Frank
Thanks for your answer and for the hint. Has someone do this as java main class? -Ursprüngliche Nachricht- Von: Damian Florczyk [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 29. Dezember 2006 14:22 An: nutch-user@lucene.apache.org Betreff: Re: recrawl index Otto, Frank

Re: search performance

2006-12-29 Thread RP
I've got 500k urls indexed on an old 700mhz P3 clunker with only 384MB of RAM at my searches take sub-seconds Something is funny here. I've got my JVM at 64MB for this as well, so be careful as it sounds like you just caused the box to thrash a bit with swapping. Set the JVM down to

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
Make sure you don't have any empty or bad segments. We had some serious speed issues for a long time until we realized we had some empty segments that had been generated as we tested. Nutch would then sit and spin on these bad segments for a few seconds on every search. Simply deleting the

Re: search performance

2006-12-29 Thread Michael Wechner
Insurance Squared Inc. wrote: Make sure you don't have any empty or bad segments. We had some serious speed issues for a long time until we realized we had some empty segments that had been generated as we tested. Nutch would then sit and spin on these bad segments for a few seconds on

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
If I recall correctly, we just checked the segment directories for space size. The bad ones had files of only 32K or something like that. g. Michael Wechner wrote: Insurance Squared Inc. wrote: Make sure you don't have any empty or bad segments. We had some serious speed issues for a

Re: Searching via http statistical data

2006-12-29 Thread Nitin Borwankar
Justin Hartman wrote: Hi guys I have my nutch system working pretty reasonably I think and I am quite happy with the way it is fetching, crawling and indexing. I do have a problem however in that I can not figure out how to make the http searches pull data from the index. [] Hi Justin,

Re: Searching via http statistical data

2006-12-29 Thread Nitin Borwankar
Nitin Borwankar wrote: Justin Hartman wrote: Hi guys I have my nutch system working pretty reasonably I think and I am quite happy with the way it is fetching, crawling and indexing. I do have a problem however in that I can not figure out how to make the http searches pull data from the

Re: Need help with deleteduplicates

2006-12-29 Thread Dennis Kubes
Some of it is happening behind the scenes. A hash of the text of the index document is created when the documents are read. The MapReduce process uses the InputFormat static inner class to read documents into a Map class. Here the Map class is not specified so it is an IdentityMapper which

Re: Searching via http statistical data

2006-12-29 Thread Justin Hartman
Hi Nitin IIRC, the tutorial requires you to start the tomcat instance so it knows where your index is. Are you starting tomcat from the directory that has your index (the suggested way in the tutorial) ? Or are you indicating to the search servlet the location of your index in some other way?

Re: search performance

2006-12-29 Thread Michael Wechner
Insurance Squared Inc. wrote: If I recall correctly, we just checked the segment directories for space size. The bad ones had files of only 32K or something like that. thanks. Any idea why these are being created in the first place resp. why these are not being created anymore? Thanks

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
Yeah, I think it happens when we restarted either Tomcat or Apache whilst in the middle of crawling or indexing (crawling if I had to guess). Now we're careful to let our crawls and indexing finish before we restart anything. Haven't had any problems since. Michael Wechner wrote: Insurance

(SOLVED) Searching via http statistical data

2006-12-29 Thread Justin Hartman
Thanks guys for all your help and support with this issue. I have managed to get it working. Those sneaky pests at SWsoft hid the /WEB-INF/classes/ folder away from normal viewing but on looking at the catalina.out log file (located at /var/log/tomcat5/catalina.out) I was able to see where and

Re: search performance

2006-12-29 Thread Michael Wechner
Insurance Squared Inc. wrote: Yeah, I think it happens when we restarted either Tomcat or Apache whilst in the middle of crawling or indexing (crawling if I had to guess). Now we're careful to let our crawls and indexing finish before we restart anything. Haven't had any problems since.

parse-js as a HtmlParseFilter

2006-12-29 Thread Michael Stack
The javascript parser will often add the discovered URL as its anchor text (See below linkdb dump for examples). These urls-as-anchor text are tokenized when indexing and then, because anchors by default get a hefty boost at query time, the URL-found-by-the-parse-js-plugin can show high in