difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
hi, i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder

Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread Avni, Itamar
Hi all, It's my first project with Nutch, so be gentile with me :-) 1) I want nutch (1.0) to index only the essence of a current URL. I plugged a new implementation of org.apache.nutch.parse.Parser, which calls Parse.setText with the essence content of the page reviewed. This Parse is set

RE: Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread BELLINI ADAM
hi you now that you can extract the content of the page by reading the segment: type readseg to see the options : to dump only content you will use this command, it displays only content. ./bin/nutch readseg -dump crawl_folder/segments/20091001145126/ dump_folder -nofetch -nogenerate

Accessing crawled data

2009-12-16 Thread Claudio Martella
Hello list, I'm using nutch 1.0 to crawl some intranet sites and i want to later put the crawled data into my solr server. Though nutch 1.0 comes with solr support out of the box i think that solution doesn't fit me. First, i need to run my own code on the crawled data (particularly what comes

RE: Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread Avni, Itamar
Thanks Thanks BELLINI ADAM Is there a way to do it in java? Itamar Avni -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Wednesday, December 16, 2009 6:35 PM To: nutch-user@lucene.apache.org Subject: RE: Extracting Essence of Page and Indexing only when Changed

Activating Parsing Plugins

2009-12-16 Thread Claudio Martella
Hello folks, I'd like to active as many parser plugins as possible to extract text. i'm using vanilla nutch 1.0 but i get this error: Error parsing: http://www.tis.bz.it/doc-bereiche/dt_doc/files/0technologiesreflector/20090929_Financial_BzSmarterTown.pdf: org.apache.nutch.parse.ParseException:

RE: Activating Parsing Plugging

2009-12-16 Thread Avni, Itamar
Check http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00183.html. Thanks Itamar Avni -Original Message- From: Claudio Martella [mailto:claudio.marte...@tis.bz.it] Sent: Wednesday, December 16, 2009 6:52 PM To: nutch-user@lucene.apache.org Subject: Activating Parsing

RE: Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread BELLINI ADAM
i sugest you to crawl only one page without your plugin after that plug your plugin which will create a new root variable which will contain only your important tags. and when you will extacts outlinks just use the original root. but for extracting text that will be indexed use at this

RE: Activating Parsing Plugging

2009-12-16 Thread BELLINI ADAM
parse-(text|html|msword|pdf) this will parse doc files and pdf From: itamar.a...@verint.com To: nutch-user@lucene.apache.org Date: Wed, 16 Dec 2009 18:54:43 +0200 Subject: RE: Activating Parsing Plugging Check http://www.mail-archive.com/nutch-user@lucene.apache.org/msg00183.html.

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread xiao yang
It depends on your crawldb size, and the number of urls you fetch. Crawldb stores the urls fetched and to be fetched. When you recrawl with seperated command, first you will read data from crawldb and generate the urls will be fetched this round. An initial crawl first injects seed urls into

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread Peters, Vijaya
My experience has been that, when I delete the crawldb and do a crawl again, it seems to concatenate the urls so the same file gets fetched over and over again. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
thx for the explanation, so if i well understood using the separates commands i dont have to run as many times as i did it in the initial crawl (with depth 10). in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times (generateting fetching parsing updating ) ?? mabe i

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
in my case i didnt noticed thatbut mabe recrawling with a full crawldb seems to be more quick than the initial crawl...but i needed someone tell me i'm right or not, mabe with some metrics Subject: RE: difference in time between an initial crawl and recrawl with a full crawldb Date:

Multiple Nutch instances for crawling?

2009-12-16 Thread Felix Zimmermann
Hi, I would like to run at least two instances of nutch ONLY for crawling at one time; one for very frequently updated sites and one for other sites. Will the nutch instances get in trouble when running several crawlscripts, especially the nutch confdir variable? Thanks! Felix.

Re: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread MilleBii
Well, Doing a crawl of depth 10 or 10 times a loop of individual commands will give you essentially the same results (bare in mind it does not use the same file for url filtering). I don't know what you guys call initial crawl, I suspect that you want to say I start with a crawl command and

Re: Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread Ted Yu
Hi, For this page: http://online.wsj.com/article/BT-CO-20091216-711161.html I wonder if nutch parser can remove the following javascript entirely: script type=text/javascript(function(){djcs=function(){var _url={decode:function(str){var string=;var i=0;var c=0;var c1=0;var c2=0;var utftext=null

RE: difference in time between an initial crawl and recrawl with a full crawldb

2009-12-16 Thread BELLINI ADAM
hi i will answer some of your question and just tell me if i'm on the right way : you said : 1-I suspect that you want to say I start with a crawl command and later on do incremental steps by hand. ... -yes it's exactly what i mean. 2... although it depends on your steps. - and I droped

Re: Accessing crawled data

2009-12-16 Thread reinhard schwab
if you dont want to refetch already fetched pages, i think of 3 possibilities: a/ set a very high fetch interval b/ use a customized fetch schedule class instead of DefaultFetchSchedule implement there a method public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) which returns

Nutch search works, but no results in Tomcat

2009-12-16 Thread Noah Silverman
Hi, Just installed Nutch 1.0 and tomcat. Starting to play around with things. I've managed to execute a crawl using : Nutch crawl It appears as if the crawl worked. I can do a test search from the command line with: bin/nutch org.apache.nutch.searcher.NutchBean foobar It returns 10 results

Customize crawl

2009-12-16 Thread Noah Silverman
Hi, More questions about Nutch. I have a list of 1000 URLs that I want to crawl and index. Our plan is to check the same sites often for updates and/or new content. How would you suggest configuring Nutch for this? Or, more generally, is there good source of documentation for all of

RE: Extracting Essence of Page and Indexing only when Changed

2009-12-16 Thread Avni, Itamar
, For this page: http://online.wsj.com/article/BT-CO-20091216-711161.html I wonder if nutch parser can remove the following javascript entirely: script type=text/javascript(function(){djcs=function(){var _url={decode:function(str){var string=;var i=0;var c=0;var c1=0;var c2=0;var utftext=null;if(!str