Hi Thomas, It's not so much that readseg is too slow, it's just that it is probably not the right tool. Readseg is used primarily to debug and check the content of a segment. What is it that you are trying to achieve?
If you need to put the original content in a set of files or a DB you should do that with a custom map-reduce job, as it is done e.g. in the Nutch module of Behemoth ( https://github.com/jnioche/behemoth/blob/master/modules/io/src/main/java/com/digitalpebble/behemoth/io/nutch/NutchSegmentConverterJob.java). HTH Julien On 28 February 2011 17:34, Eggebrecht, Thomas (GfK Marktforschung) < [email protected]> wrote: > Hi there, > > I need to read some pages from segments to get the raw HTML. > > I do it like: > > nutch-1.2/bin nutch readseg -get /path/to/segment > http://key.value.html-nofetch -nogenerate -noparse -noparsedata -noparsetext > > That works fine but it takes 2 or 3 full seconds per page! My very small > test environment has about 20 crawled and indexed pages and is on a single > machine. A search over the Lucene index takes only milli seconds. > > Is there a way to read segments faster? > Is it the right way to implement SegmentReader.class to get original HTML? > > Best Regards > Thomas > > > > GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; > Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp > (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm > R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert > This email and any attachments may contain confidential or privileged > information. Please note that unauthorized copying, disclosure or > distribution of the material in this email is not permitted. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

