Hi Thomas,

It's not so much that readseg is too slow, it's just that it is probably not
the right tool. Readseg is used primarily to debug and check the content of
a segment. What is it that you are trying to achieve?

If you need to put the original content in a set of files or a DB you should
do that with a custom map-reduce job, as it is done e.g. in the Nutch module
of Behemoth (
https://github.com/jnioche/behemoth/blob/master/modules/io/src/main/java/com/digitalpebble/behemoth/io/nutch/NutchSegmentConverterJob.java).


HTH

Julien

On 28 February 2011 17:34, Eggebrecht, Thomas (GfK Marktforschung) <
[email protected]> wrote:

> Hi there,
>
> I need to read some pages from segments to get the raw HTML.
>
> I do it like:
>
> nutch-1.2/bin nutch readseg -get /path/to/segment 
> http://key.value.html-nofetch -nogenerate -noparse -noparsedata -noparsetext
>
> That works fine but it takes 2 or 3 full seconds per page! My very small
> test environment has about 20 crawled and indexed pages and is on a single
> machine. A search over the Lucene index takes only milli seconds.
>
> Is there a way to read segments faster?
> Is it the right way to implement SegmentReader.class to get original HTML?
>
> Best Regards
> Thomas
>
>
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to