Re: Search Results?

2006-01-30 Thread Saravanaraj Duraisamy
Yeah, copy the segments folder to the 'bin' directory of tomcat and it will work On 1/31/06, Raghavendra Prabhu <[EMAIL PROTECTED]> wrote: > > Then you have to copy the segments folder to the tomcat directory > > When i tried it out , i had to copy in the directory above webapps > > > Do copy the

Re: benchmark and performance

2006-01-30 Thread Raghavendra Prabhu
Hi Thanks . I got the explanation .So with mapreduce we will be able to process crawldb efficiently Rgds Prabhu On 1/31/06, Byron Miller <[EMAIL PROTECTED]> wrote: > > Prabhu, > > For nutch .7x the upper limit of webdb isn't > necessarily file size but hardware/computation size. > You basicall

Use Nutch to collect web statistic

2006-01-30 Thread Meryl Silverburgh
Hi, Is it possible to use utch to collect web statistic like the one google did about web html/css usage: http://code.google.com/webstats/index.html if not, is there a better way to do it? Thank you.

Recovering from Socket closed

2006-01-30 Thread Chris Schneider
Dear Nutch Users, We've been using Nutch 0.8 (MapReduce) to perform some internet crawling. Things seemed to be going well on our 11 machines (1 master with JobTracker/NameNode, 10 slaves with TaskTrackers/DataNodes) until... 060129 222409 Lost tracker 'tracker_56288' 060129 222409 Task 'tas

Re: Problems with MapRed-

2006-01-30 Thread Mike Smith
I have a huge a disk too and /tmp folder was fine and has almost 200G free space on that partition but it still fails. I am going to do the same and look for the bad URL that makes the problem. But how come Nutch is sensitive to a particular URL and fails!? It might be because of the parser plugins

Nutch nightly build-crawl-search

2006-01-30 Thread Andy Morris
Okay, I think I have done a good crawl and when I do a stats command I get this... [EMAIL PROTECTED] nutch-nightly]# bin/nutch readdb crawl/crawldb -stats 060130 175317 CrawlDb statistics start: crawl/crawldb 060130 175317 parsing file:/nutch_binaries/nutch-nightly/conf/nutch-default.xml 060130

Re: Problems with MapRed-

2006-01-30 Thread Rafit Izhak_Ratzin
Hi, I don't think its a problem of disc capacity since I am working on huge Disk and only 10% is used, What I decide to do is to split the seed into two part and see if I still get this problem so one half ended succesfuly but the second had the same problem so I continue with teh spliting,

Re: Problems with MapRed-

2006-01-30 Thread Rafit Izhak_Ratzin
Hi, I don't think its a problem of disc capacity since I am working on huge Disk and only 10% is used, What I decide to do is to split the seed into two part and see if I still get this problem so one half ended succesfuly but the second had the same problem so I continue with teh spliting,

Re: benchmark and performance

2006-01-30 Thread Byron Miller
Prabhu, For nutch .7x the upper limit of webdb isn't necessarily file size but hardware/computation size. You basically need 210% of your webdb size to do any processing of it so if you have 100 million urls and a 1.5 terrabyte webdb you need (on the same server) 3.7 terrabytes of disk space to

Re: Configuring for multiple sites indexing

2006-01-30 Thread Raghavendra Prabhu
If i am not mistaken I think you can enter like this also +^http://lucene.apache.org/nutch/ any link going out which meets the above condition will work On 1/31/06, Lakshman, Madhusudhan <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I am trying to configure for multiple site indexing using intr

Re: benchmark and performance

2006-01-30 Thread Raghavendra Prabhu
Hi Stefan Thanks for your mail What i would like to know is (since i am using nutch-0.7) ,what is the upper limit on the webdb size if any such limit exists in nutch-0.7 Will the generate for a web db formed from one TB of data (just an example) work ? And what is the difference between webdb

Re: Search Results?

2006-01-30 Thread Raghavendra Prabhu
Then you have to copy the segments folder to the tomcat directory When i tried it out , i had to copy in the directory above webapps Do copy the segments folder to the parent directory of webapps On 1/30/06, Sameer Tamsekar <[EMAIL PROTECTED]> wrote: > > yes, i do have 0.7.1 version but proble

Re: searcher memory question

2006-01-30 Thread Piotr Kosiorowski
On Unix you can delete the file but it's contents would not be removed till application that is keeping file open would close it. Just look at the disk usage change after deletion and after shutting down searchers. Regards Piotr Sunnyvale Fl wrote: My question was buried in another thread so I

Configuring for multiple sites indexing

2006-01-30 Thread Lakshman, Madhusudhan
Hi, I am trying to configure for multiple site indexing using intranet crawling. I need help on how to keep the entries in the "urls" flat file and the crawl-urlfilter.txt files. For example, I want to configure for the below mentioned 2 URLs, 1.http://lucene.apache.org/nutch/ 2.http

searcher memory question

2006-01-30 Thread Sunnyvale Fl
My question was buried in another thread so I'll pull it out here. Can someone help clarify? Does the searcher load everything into memory, including segments, on startup? Because it seems that if I delete segments or replace them while the searcher is running, it doesn't affect the search result

RE: Searching

2006-01-30 Thread Vanderdray, Jacob
Thanks Stefan. It turns out my problem/confusion was because I was using fields="[my_field_name]" instead of fields="DEFAULT" in the plugin.xml definition of my query filter. If I understand it correctly that was causing my filter to only get used when I did a search for "[my_field_name]:

Re: Search Results?

2006-01-30 Thread Sameer Tamsekar
yes, i do have 0.7.1 version but problem persists.

Differences between intranet crawl and whole-web crawl

2006-01-30 Thread 盖世豪侠
Is there any difference differences between the two situations: 1) use several entry urls in flat file and url patterns in crawl-urlfilter.txt when doing intranet crawl 2) inject only a few urls and use url patterns in regex-urlfilter.txt when doing whole-web crawl -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍

Re: puzzle about regx ofurl pattern

2006-01-30 Thread 盖世豪侠
Thank you. 在06-1-30,Steve Betts <[EMAIL PROTECTED]> 写道: > > Actually, the ^ means start of line. This character is used as a negative > indicator only within the context of sets, eg, [^0-9]. > > Thanks, > > Steve Betts > [EMAIL PROTECTED] > 937-477-1797 > > -Original Message- > From: 盖世豪侠

Re: puzzle about regx ofurl pattern

2006-01-30 Thread Gal Nitzan
^ is a negate in character classes if I remember correctly, however in this REGEX it means beginning of the line, like $ is end of line (input) On Mon, 2006-01-30 at 21:57 +0800, 盖世豪侠 wrote: > +^http://([a-z0-9]*\.)*pilat.free.fr/ > As far as I know, '^' means matching the characters not within

RE: puzzle about regx ofurl pattern

2006-01-30 Thread Steve Betts
Actually, the ^ means start of line. This character is used as a negative indicator only within the context of sets, eg, [^0-9]. Thanks, Steve Betts [EMAIL PROTECTED] 937-477-1797 -Original Message- From: ¸ÇÊÀºÀÏÀ [mailto:[EMAIL PROTECTED] Sent: Monday, January 30, 2006 8:58 AM To: nutch

RE: download/mirror

2006-01-30 Thread Vanderdray, Jacob
If you just want to save a copy of the site locally, you'd probably be better off using wget. Jake. -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Sunday, January 29, 2006 7:21 PM To: nutch-user@lucene.apache.org Subject: Re: download/mirror Nutch al

Re: Search setup

2006-01-30 Thread Andrzej Bialecki
Dominik Friedrich wrote: in the searcher log it clearly says it opens the linkdb for some reason: well, i haven't looked at the searcher code yet so I might be wrong and the linkdb is used for searching. maybe somebody can explain what data is used for searching. NutchBean uses linkDB to r

puzzle about regx ofurl pattern

2006-01-30 Thread 盖世豪侠
+^http://([a-z0-9]*\.)*pilat.free.fr/ As far as I know, '^' means matching the characters not within a range by * complementing* the set, so why it's a accepted pattern for crawl urls? So the same with -^(file|ftp|mailto) Any differences? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然

Re: Search setup

2006-01-30 Thread Dominik Friedrich
Gal Nitzan schrieb: 1. fetch/parse - ndfs 2. decide how many segments (datasize?) you want on each searcher machine 3. invert,index,dedup, merge selected indexes to some ndfs folder 4. copy the ndfs folder to searcher machine sounds ok to me. when updating the linkdb you should also include p

Re: Search setup

2006-01-30 Thread Gal Nitzan
Dominik, thank you so much for your answers, you have been very helpful. Just one more :) if I understand correctly... the way to go about the whole process: 1. fetch/parse - ndfs 2. decide how many segments (datasize?) you want on each searcher machine 3. invert,index,dedup, merge selected inde

Nutch irc channel

2006-01-30 Thread Dominik Friedrich
I look through the mailinglist archive for a nutch irc channel and searched with google but didn't find any. So I opened #nutch on efnet and will be there while being online. If there is already an nutch irc channel please let me know where. best regards, Dominik

Re: distributed computing doubt

2006-01-30 Thread Dominik Friedrich
Yes, this is correct. For each job a new job.xml is generated. best regards, Dominik Raghavendra Prabhu schrieb: Hi Thanks for helping me out .But i have some additional doubts So essentially while running a nutch indexing process ,does the jobtracker distribute this file job.xml (after gene

Re: Search setup

2006-01-30 Thread Dominik Friedrich
Gal Nitzan schrieb: I have copied only the segments directory but the searcher returns 0 hits. You have to put the index and segments dir into a directory named "crawl" and start tomcat from the directory that contains crawl. The nutch.war file contains a nutch-default.xml with searcher.

Re: Search setup

2006-01-30 Thread Gal Nitzan
Thanks for your reply. I have copied only the segments directory but the searcher returns 0 hits. Do I need to copy the linkdb and the index folders as well? Thanks. On Sun, 2006-01-29 at 23:18 +0100, Dominik Friedrich wrote: > Gal Nitzan schrieb: > > 1. If NDFS is too slow and all data must be

Re: benchmark and performance

2006-01-30 Thread Stefan Groschupf
You can already use ndfs in 0.7, however in case the webdb is to lareg it took to much time to generate segments. So the problem is the webdb size, not the hdd limit. Am 30.01.2006 um 07:31 schrieb Raghavendra Prabhu: Hi Stefan So can i assume that hard disk space is the only constraint in

Re: Search Results?

2006-01-30 Thread Raghavendra Prabhu
Sameer Can you tell me which version of nutch you have If you are using the nutch-0.8 The search index has to be in a foder called crawl under tomcat directory(assuming that you are using tomcat ) /usr/bin/tomcat/crawl The crawl should have the index and segments if you are using nutch-0.7 copy

Search Results?

2006-01-30 Thread Sameer Tamsekar
Hi all, I crawled one website successfully using Whole web crawling, but when i try to search the index i don't get any result. When i perform same search using Luke (using the same index) , i get some results. Can somebody help me for tackling the problem Regards, Sameer Tamsekar

Re: distributed computing doubt

2006-01-30 Thread Raghavendra Prabhu
Hi Thanks for helping me out .But i have some additional doubts So essentially while running a nutch indexing process ,does the jobtracker distribute this file job.xml (after generating it ) to the clients (task trackers ) And if i run a new indexing process with the mapred-default.xml , will t

Re: distributed computing doubt

2006-01-30 Thread Dominik Friedrich
When the tasktracker starts a task it reads the config files in the following order: nutch-default.xml, mapred-default.xml, job.xml, nutch-site.xml. Except the job.xml all files are those local in the tasktrackers conf directory. The job.xml is generated for each job by the tool you use, e.g. G