Re: how can i go deep?

2006-03-05 Thread Steven Yelton
Yes! I have abandoned the 'crawl' command for even my single site searches. I wrote shell scripts that accomplish (generally) the same tasks the crawl does. The only piece I had to watch out for is: one of the first thing the 'crawl' class does is load 'crawl-tool.xml'. So to get the

Unable to complete fetch

2006-03-05 Thread Gal Nitzan
Hi, 1. I have disabled speculative tasks by setting it to false in hadoop-site.xml 2. Now I notice that the fetcher does not complete the whole fetchlist. 3. By adding additional logging info in generate I see 2 links being generated, but the fetcher without any indication to any error just

Re: project vitality?

2006-03-05 Thread Byron Miller
I like to think of it as a framework. Building blocks to build what you ultimately need. If your after the one stop shop, plug in play, no development necessary then perhaps some other commercial systems may be your best bet. Mailing list is very active, most people get responses fairly quickly.

Re: language-identifier and language filter

2006-03-05 Thread Byron Miller
Make sure you have language-identifier enabled in your web deployment as well. WEB-INF/classes/nutch-site.xml or nutch-default.xml and restart your app server. -byron --- Teruhiko Kurosaka [EMAIL PROTECTED] wrote: Hello, I enabled language-identifier plugin and indexed some documents. But

Re: [Nutch-general] Re: project vitality?

2006-03-05 Thread Greg Boulter
Hi, I think that this is my first post. I follow the mailing list and read as many of the emails as I can. I'm going to make a few proposals. I have obtained some money to spend on them. I use and get paid for my nutch expertise. I have some experience. I don't just speak for myself but also for

RE: project vitality?

2006-03-05 Thread David Wallace
Hello all, I think Nutch is a fantastic product. I used 0.6 initially, then 0.7. My 0.7 installation is in production, and mostly works really well. I haven't made the move to 0.8 yet, because the direction that Nutch has gone for 0.8 is quite different from what my organisation requires from

Re: project vitality?

2006-03-05 Thread Chris Lamprecht
I think of the Nutch project as a marathon, not a sprint. Nutch's stated goals include: * Scale to entire web - pages on millions of different servers - billions of pages * Support high traffic - thousands of searches per second * State-of-the-art search quality (see

Re: [Nutch-general] Re: project vitality?

2006-03-05 Thread Greg Boulter
Hello again. OK - first of all I hate mailing lists. I don't consider them to be a valid form of communication for anything but the people doing the coding and don't really consider them of much use at all unless there is no other alternative. Except one - and that is when there needs to be

Re: Normal search speeds

2006-03-05 Thread Stefan Groschupf
This is very slow! You can expect results in less than a second from my experience. + check memory settings of tomcat. + you do not use ndfs, right? Am 06.03.2006 um 00:23 schrieb Insurance Squared Inc.: Asking again for the patience of the list, we're still working on speed. I guess what I

going deeper, lost segment

2006-03-05 Thread Richard Braman
In my hacking, i inadvertedly lost a seqment. What happened? How in the Heck did I manage to do something so stupid? Well, when i started another round of fetching I did a generate command and specified /crawldb/segements/* as the segment, and it made my new segment under the directory of the

Re: Normal search speeds

2006-03-05 Thread Insurance Squared Inc.
That's correct, we're not using ndfs. As far as I know it's an out of the box installation of Mandrake 2006, tomcat, and nutch. Byron's suggestion of merging to one index cut speeds by about 1/3 or 1/2. I think we've already looked at the tomcat memory settings but I'll ask our developer to

Re: NullPointerException

2006-03-05 Thread Stefan Groschupf
Hi, http or www are very good test queries. double check that the nutch-default.xml which inside the nutch.war points to the correct folder namesearcher.dir/name. Stefan Am 06.03.2006 um 02:31 schrieb Hasan Diwan: I've followed the nutch tutorial for crawling and started tomcat from the

Re: NullPointerException

2006-03-05 Thread Stefan Groschupf
If none are being fetched, something is definaltely wrong with your filter or url file. Yes, since it is blog it may has dynamic pages like foo.com?entry=23 this definitely filtered by default. - blog: http://www.find23.org company:

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
Gentlemen: On 05/03/06, Richard Braman [EMAIL PROTECTED] wrote: This sounds like your crawl didn't get anything. I have seen that happen when the url wasn't added right, or the filter was bad. Pipe the crawl to crawl.log and look in there. It should show some pages being fecthed. If none

nutch 0.7.0 search performance measurement

2006-03-05 Thread Stefan Groschupf
Hi, for people that found that interesting I had published some measurement values I had done a long time ago. http://www.find23.net/Web-Site/blog/A712F01B-4BB1-4FC6-AE95- E64988FBCC79.html All time related values are in milliseconds. Don't take the values to serious however at least they

RE: NullPointerException

2006-03-05 Thread Richard Braman
It did fetch some urls: -Original Message- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 9:35 PM To: nutch-user@lucene.apache.org Subject: Re: NullPointerException Hey Hasan Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean

find duplicate urls in webdb

2006-03-05 Thread Elwin
When I read pages out of a webdb and printed out the url of each page, I found two urls are just the same. Is it possible that two pages with the same url? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
Mr Tang: Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean [your-query-string] in shell/cmd? server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan 060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml 060305 192042 10 parsing

RE: [Nutch-general] Re: project vitality?

2006-03-05 Thread Richard Braman
I'll take part in your forum. Just added first post. -Original Message- From: Greg Boulter [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 6:33 PM To: nutch-user@lucene.apache.org Subject: Re: [Nutch-general] Re: project vitality? Hello again. OK - first of all I hate mailing

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
Mr Tang: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: Weird! You are running nutch on local file system or distributed file system? Local file system And can you find the same query hasan via luke? Nope -- Cheers, Hasan Diwan [EMAIL PROTECTED]

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: Mr Tang: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: Weird! You are running nutch on local file system or distributed file system? Local file system And can you find the same query hasan via luke? Nope ok. As stepan said, can you get

RE: how can i go deep?

2006-03-05 Thread Richard Braman
Steven Could you share those schell scripts? -Original Message- From: Steven Yelton [mailto:[EMAIL PROTECTED] Sent: Sunday, March 05, 2006 10:22 AM To: nutch-user@lucene.apache.org Subject: Re: how can i go deep? Yes! I have abandoned the 'crawl' command for even my single site

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: ok. As stepan said, can you get any hit when you try to search http or www? No Hey, can you zip the index and send it to me directly? -- Cheers, Hasan Diwan [EMAIL PROTECTED] -- Keep

Re: NullPointerException

2006-03-05 Thread Jack Tang
Hasan It seems your index is not completed. If you get whole(correct) indices, index dir should include 1. segements file 2. deletable file 3. other files I am not sure what's wrong in nutch-0.7.1 indexing, but now it is possible to upgrade to nutch 0.8(svn version)? /Jack On 3/6/06, Jack

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: I am not sure what's wrong in nutch-0.7.1 indexing, but now it is possible to upgrade to nutch 0.8(svn version)? It is possible, but I was under the assumption that 0.8 required NDFS? -- Cheers, Hasan Diwan [EMAIL PROTECTED]

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: You can still build it on local file system:) Build, yes, but what of deployment? Can I use it in the same way? At present, I don't have enough resources to run a distributed crawl. -- Cheers, Hasan Diwan [EMAIL PROTECTED]

Re: NullPointerException

2006-03-05 Thread Jack Tang
On 3/6/06, Hasan Diwan [EMAIL PROTECTED] wrote: On 05/03/06, Jack Tang [EMAIL PROTECTED] wrote: You can still build it on local file system:) Build, yes, but what of deployment? Can I use it in the same way? Of course yes. At present, I don't have enough resources to run a distributed

Re: Normal search speeds

2006-03-05 Thread Howie Wang
If you want to narrow down whether it's a Tomcat issue, maybe you could try running Nutch on another app server like Resin to see if there's a difference. It's been a while since I used Tomcat, but I did find the performance to be kind of slow. I think things are supposed to be better now, but

Re: NullPointerException

2006-03-05 Thread Hasan Diwan
Right then.. compiled the svn version of nutch. Tried running the crawl with it and this is the log: server: 11:32pm % ./bin/nutch crawl ../SpectraSearch/urls -dir ../SpectraSearch/crawl -depth 2 -threads 20 060305 233255 parsing