Re: Too low performance of SegmentReader

Julien Nioche Mon, 28 Feb 2011 13:11:20 -0800

Hi Thomas,

It's not so much that readseg is too slow, it's just that it is probably not
the right tool. Readseg is used primarily to debug and check the content of
a segment. What is it that you are trying to achieve?


If you need to put the original content in a set of files or a DB you should
do that with a custom map-reduce job, as it is done e.g. in the Nutch module
of Behemoth (
https://github.com/jnioche/behemoth/blob/master/modules/io/src/main/java/com/digitalpebble/behemoth/io/nutch/NutchSegmentConverterJob.java).


HTH

Julien

On 28 February 2011 17:34, Eggebrecht, Thomas (GfK Marktforschung) <
[email protected]> wrote:

> Hi there,
>
> I need to read some pages from segments to get the raw HTML.
>
> I do it like:
>
> nutch-1.2/bin nutch readseg -get /path/to/segment 
> http://key.value.html-nofetch -nogenerate -noparse -noparsedata -noparsetext
>
> That works fine but it takes 2 or 3 full seconds per page! My very small
> test environment has about 20 crawled and indexed pages and is on a single
> machine. A search over the Lucene index takes only milli seconds.
>
> Is there a way to read segments faster?
> Is it the right way to implement SegmentReader.class to get original HTML?
>
> Best Regards
> Thomas
>
>
>
> GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014;
> Management Board: Professor Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp
> (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein, Debra A. Pruent, Wilhelm
> R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
> This email and any attachments may contain confidential or privileged
> information. Please note that unauthorized copying, disclosure or
> distribution of the material in this email is not permitted.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Too low performance of SegmentReader

Reply via email to