Re: nutch's design document

2009-12-14 Thread MilleBii
Welcome !!! Nutch is different from anything else I have seen before, but its great and also difficult. So expect to spend some time. Best way to learn is practice to understand what it does. 1. Front-End (search) : is a web site which wraps a Lucene based index. If you are not familiar with

Optimization in crawling and indexing

2009-12-14 Thread Rupesh Mankar
I want to see if there is any possible bandwidth optimization while using Nutch. a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl command after every 6 hours will crawl and fetch all documents. ['db.fetch.interval.default' is 6 hours]. It should just bring updated

Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27

Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for

Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at

Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice to have a page there listing the differences between the capabilities of the Tika plugin and the existing Nutch parsing plugins as there might be differences

Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use it for your how-to J. 2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice

Re: Distributed Search problem

2009-12-14 Thread Dennis Kubes
Index and segments is the minimum yes. You only need the segments for the indexes that you are serving on the local box. Dennis MilleBii wrote: Ok I don't per say need distributed search. I was trying to avoid a copy to local file system to optimize on ressources working off HDFS What is

Re: OR support

2009-12-14 Thread BrunoWL
Nobody? Please, any answer would good. -- View this message in context: http://old.nabble.com/OR-support-tp26680899p26779229.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: OR support

2009-12-14 Thread Andrzej Bialecki
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/ Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM
yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly .

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Thanks. I'm on a development system, so every hour is okay. I guess that's why the last time I changed the properties file it didn't take any effect (because crawldb won't change the fetch time automatically). I'll give this a try - thanks much. Vijaya Peters SRA International, Inc. 4350 Fair

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM
but just think about one thing...if you are recrawling to much urls and the crawl time will be more than 1 hours, so your crawl will not finish...becoz every time it find and url so it will find that the fetchtime is ready and it fetch it again to well sett your fetchtime you have to crawl

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Okay. Our fetch finishes in less than 10 minutes (just intranet). But, I'll set it to 2 hours. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10

converting nutch crawl output to human readable content

2009-12-14 Thread Ted Yu
Hi, I used crawl command of bin/nutch and obtained the following: ls crawl/crawldb/current/part-0/ data.data.crc index .index.crc How do I convert the output to human readable format ? Thanks

Why readdb and readseg shows different figures?

2009-12-14 Thread bhavin pandya
Hi, I am using Nutch 1.0. For simple excercise i have crawled one single domain and after that i tried both command readdb and readseg... Both showing different figures. Which one i should consider? does something went wrong while crawling? Here is the output of both command. OUTPUT FROM

Re: Why readdb and readseg shows different figures?

2009-12-14 Thread MilleBii
Every thing seems right. Both stats are interesting and it all depends on what you are looking for. Readdb gives you global stats where readseg is about each segments ie fetch/parse run. 2009/12/15, bhavin pandya bvnpan...@gmail.com: Hi, I am using Nutch 1.0. For simple excercise i have