Fetching/Indexing process is taking a lot of time

2012-03-17 Thread George
Hello I.m using nutch 9.0 default installation single machine: 2x2.5 quad core 16 GB ram 6 x 1TB sata raid 1 Network 1 gbps. Not using any distributed file system. Of cource have it configured All headers Threads : 100 Trying to crawl 3 url-s with generate per site -1 fetching with :

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
this is the return after crawling with nutch and indexing on solr: doc float name=boost0.298293/float - str name=content Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf Tue, 17 Aug 2004 20:09:52

Re: Fetching/Indexing process is taking a lot of time

2012-03-17 Thread Mathijs Homminga
Hi, Your hardware looks okay. Moving data from 30,000 urls takes a week at 500kb/s? That would mean ~10Mb per url. Could that be right? Anyway, can you tell us at what stage your crawl script is when this kicks in? Mathijs On 17 mrt. 2012, at 07:40, George wrote: Hello I.m using nutch

Re: nutch crawling file system SOLVED

2012-03-17 Thread Lewis John Mcgibbney
Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: suggestions? For what?

Re: nutch crawling file system SOLVED

2012-03-17 Thread alessio crisantemi
I would that the result of my search be the text of my pdf file and not the list of documents into the directory and the path address.. Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney lewis.mcgibb...@gmail.com ha scritto: Hi Alessio, On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi

Re: Fetching/Indexing process is taking a lot of time

2012-03-17 Thread George
no for example if i run dept 3 it fetching data to hadoop temporary directory then moving data to new segment and do this cycles 3 times all data is fetched to dadoop-root (temporary hadoop directory) and then nutch is moving this data to the segment dir in segment folder. and for example