Hello
I.m using nutch 9.0 default installation single machine:
2x2.5 quad core
16 GB ram
6 x 1TB sata raid 1
Network 1 gbps.
Not using any distributed file system.
Of cource have it configured
All headers
Threads : 100
Trying to crawl 3 url-s with generate per site -1
fetching with :
this is the return after crawling with nutch and indexing on solr:
doc
float name=boost0.298293/float
-
str name=content
Index of C:\Documents and Settings\Alessio\Documenti Index of C:\Documents
and Settings\Alessio\Documenti ../ - - - 003_C_001_Alessio_2004_08_13.dvf
Tue, 17 Aug 2004 20:09:52
Hi,
Your hardware looks okay.
Moving data from 30,000 urls takes a week at 500kb/s?
That would mean ~10Mb per url. Could that be right?
Anyway, can you tell us at what stage your crawl script is when this kicks in?
Mathijs
On 17 mrt. 2012, at 07:40, George wrote:
Hello
I.m using nutch
Hi Alessio,
On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
suggestions?
For what?
I would that the result of my search be the text of my pdf file and not the
list of documents into the directory and the path address..
Il giorno 17 marzo 2012 21:11, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com ha scritto:
Hi Alessio,
On Sat, Mar 17, 2012 at 5:31 PM, alessio crisantemi
no
for example if i run dept 3
it fetching data to hadoop temporary directory then moving data to new
segment
and do this cycles 3 times
all data is fetched to dadoop-root (temporary hadoop directory)
and then nutch is moving this data to the segment dir in segment folder.
and for example
6 matches
Mail list logo