Hi there,

I need to read some pages from segments to get the raw HTML.

I do it like:

nutch-1.2/bin nutch readseg -get /path/to/segment http://key.value.html 
-nofetch -nogenerate -noparse -noparsedata -noparsetext

That works fine but it takes 2 or 3 full seconds per page! My very small test 
environment has about 20 crawled and indexed pages and is on a single machine. 
A search over the Lucene index takes only milli seconds.

Is there a way to read segments faster?
Is it the right way to implement SegmentReader.class to get original HTML?

Best Regards
Thomas



GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management 
Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. 
Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm R. Wessels; 
Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged 
information. Please note that unauthorized copying, disclosure or distribution 
of the material in this email is not permitted.

Reply via email to