RE: how to force nutch to do a recrawl

2009-12-11 Thread Peters, Vijaya
Adam, I'm using cygwin to run the scripts. I use EditPlus to edit the files. But EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file to a unix machine. Vijaya Peters SRA International, Inc. 12500 Fair Lakes Circle Room 3507 Fairfax, VA 22033 Tel: 703-222-9207

Nutch with hadoop 0.20.x

2009-12-11 Thread Tom Landvoigt
Hallo, Does anyone know when nutch will use the new hadoop version? Thanks a lot Tom

Re: Nutch with hadoop 0.20.x

2009-12-11 Thread Dennis Kubes
It has already been commited to SVN. You can pull and build an SVN release or we will be doing a 1.1 release shortly. Dennis Tom Landvoigt wrote: Hallo, Does anyone know when nutch will use the new hadoop version? Thanks a lot Tom

RE: how to force nutch to do a recrawl

2009-12-11 Thread BELLINI ADAM
hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file Subject: RE: how to force nutch to do a recrawl Date: Fri, 11 Dec 2009 09:14:26

RE: NOINDEX, NOFOLLOW

2009-12-11 Thread BELLINI ADAM
hi, since i have custom plugin which parse and index DC meta, i was filling the dc.description and dc.keywords...and since in the solr i was searching also in description and keywords and display the title and 4 first lines of content, this make the noindexed page to be displayed in the

Luke reading index in hdfs

2009-12-11 Thread MilleBii
Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? -- -MilleBii-

Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki
On 2009-12-11 22:21, MilleBii wrote: Guys is there a way you can get Luke to read the index from hdfs:// ??? Or you have to copy it out to the local filesystem? Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x. Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1. Start

stripping irrelevant contents

2009-12-11 Thread Ted Yu
Hi, We want to strip out irrelevant contents from the web pages we crawl. Examples of irrelevant contents are display ads that surround the main body of article on a web page. Please share your experience. Thanks