Welcome !!!
Nutch is different from anything else I have seen before, but its
great and also difficult. So expect to spend some time.
Best way to learn is practice to understand what it does.
1. Front-End (search) : is a web site which wraps a Lucene based
index. If you are not familiar with
I want to see if there is any possible bandwidth optimization while using Nutch.
a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl
command after every 6 hours will crawl and fetch all documents.
['db.fetch.interval.default' is 6 hours]. It should just bring updated
Hi all,
Anyone successfully used nutch to index Office 2007 documents? I know that
this question has already been asked, but considering the number of e-mails
asking the same question, looks like that Nutch does not support Office 2007
documents.
Best,
Adilson
On Wed, Dec 9, 2009 at 2:27
Hi,
There is a Tika plugin in JIRA (
https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
the support for the Office 2007 was imminent in POI (which Tika uses
internally). The plan for Nutch is to progressively delegate the parsing to
Tika; Nutch-766 has been implemented for
Hi,
Thanks for the reply. I will try to use Tika with Nutch to parse the
documents. My current Nutch setup is working quite nice and I don't want to
configure another Nutch instance.
If I manage to put it to work I will write here a mini how-to.
Best,
Adilson
On Mon, Dec 14, 2009 at
If I manage to put it to work I will write here a mini how-to.
The Nutch Wiki would be the right place for doing that. It would be nice to
have a page there listing the differences between the capabilities of the
Tika plugin and the existing Nutch parsing plugins as there might be
differences
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use
it for your how-to
J.
2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com
If I manage to put it to work I will write here a mini how-to.
The Nutch Wiki would be the right place for doing that. It would be nice
Index and segments is the minimum yes. You only need the segments for
the indexes that you are serving on the local box.
Dennis
MilleBii wrote:
Ok I don't per say need distributed search.
I was trying to avoid a copy to local file system to optimize on
ressources working off HDFS
What is
Nobody?
Please, any answer would good.
--
View this message in context:
http://old.nabble.com/OR-support-tp26680899p26779229.html
Sent from the Nutch - User mailing list archive at Nabble.com.
On 2009-12-14 16:05, BrunoWL wrote:
Nobody?
Please, any answer would good.
Please check this issue:
https://issues.apache.org/jira/browse/NUTCH-479
That's the current status, i.e. this functionality is available only as
a patch.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _
Adam,
I finally go the command to work on another server (see below). to
change the retry interval, should I just add the two properties into
nutch-site.xml (though I tried this before and it didn't work):
http://mysite/ Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Jan 08 15:42:33 EST 2010
yes just add those config in the nutch-site.xml and it should work. but are
you going to recrawl every hour ??? i see 3600 secondes !!
another thing is you have to make an initial clean crawl with the new
fetchtime , because in the crawldb it will not change the fetch time
automaticly .
Thanks.
I'm on a development system, so every hour is okay.
I guess that's why the last time I changed the properties file it didn't
take any effect (because crawldb won't change the fetch time
automatically).
I'll give this a try - thanks much.
Vijaya Peters
SRA International, Inc.
4350 Fair
but just think about one thing...if you are recrawling to much urls and the
crawl time will be more than 1 hours, so your crawl will not finish...becoz
every time it find and url so it will find that the fetchtime is ready and it
fetch it again
to well sett your fetchtime you have to crawl
Okay. Our fetch finishes in less than 10 minutes (just intranet). But,
I'll set it to 2 hours.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
Hi,
I used crawl command of bin/nutch and obtained the following:
ls crawl/crawldb/current/part-0/
data.data.crc index .index.crc
How do I convert the output to human readable format ?
Thanks
Hi,
I am using Nutch 1.0.
For simple excercise i have crawled one single domain and after that i
tried both command readdb and readseg...
Both showing different figures. Which one i should consider? does
something went wrong while crawling?
Here is the output of both command.
OUTPUT FROM
Every thing seems right.
Both stats are interesting and it all depends on what you are looking for.
Readdb gives you global stats where readseg is about each segments ie
fetch/parse run.
2009/12/15, bhavin pandya bvnpan...@gmail.com:
Hi,
I am using Nutch 1.0.
For simple excercise i have
18 matches
Mail list logo