Recover fetching process

2006-09-05 Thread Sergey Levickiy
Hi. As I can restore featch process? I want to continue featching to not to run again on downloaded reference. -- Best regards, Sergey Levickiy ICQ: 283616567 tel: +38(067)6483250

searching more than one specific url

2006-09-05 Thread David Podunavac
Hi there i wonder if there is a way after i crawled to specify more than one url to look for e.g. I have in my crawlurl-filter.txt http://www.firstUrl.com http://www.secondUrl.com http://*.thirdUrl.com but I don't want results from all of these when i enter a search term in the webinterface so my

Caching the search results

2006-09-05 Thread Marco Vanossi
Hi, Anybody knows how can I set Nutch to cache the results of the searches? I've heard about this feature but I am not finding the information Thanks, Marco

Re: Caching the search results

2006-09-05 Thread Andrzej Bialecki
Marco Vanossi wrote: Hi, Anybody knows how can I set Nutch to cache the results of the searches? I've heard about this feature but I am not finding the information Trivial web-level caching is easy to implement - just download osCache and modify your web application settings according to

Setting mapred.tasktracker.tasks.maximum doesn't change # of tasks executed in parallel

2006-09-05 Thread Vishal Shah
Hi, I am using Nutch 0.9 for crawling. I recollect that mapred.tasktracker.tasks.maximum can be used to control the max # of tasks executed in parallel by a tasktracker. I am running a fetch with the following config: 3 machines My mapred-default.xml contains: mapred.map.tasks=13

ignore content between tags? crawl only between tags?

2006-09-05 Thread Philip Brown
Is it possible on some pages to crawl only between tags or have it not crawl between tags. ie. nocrawlblah blah blah/nocrawl crawlherethe content only that I want to crawl/crawlhere nocrawlblah blah blah/nocrawl appreciate any input kind regards

RE: Caching the search results

2006-09-05 Thread Chirag Chaman
Marco, We use a search caching system at Filangy -- uses lucene to save the Search string, count, date and top 20 IDs of the pages. So all you have to do is search for those IDs. Yes, it still involves a search, but we have a distributed system with the ID as the hash key for specifying on which

Re: ignore content between tags? crawl only between tags?

2006-09-05 Thread Andrzej Bialecki
Philip Brown wrote: Is it possible on some pages to crawl only between tags or have it not crawl between tags. ie. nocrawlblah blah blah/nocrawl crawlherethe content only that I want to crawl/crawlhere nocrawlblah blah blah/nocrawl appreciate any input kind regards You can modify

Re: how to combine two run's result for search

2006-09-05 Thread Renaud Richardet
@Dennis, Can you explain how to setup distributed search while storing the 2 indexes on the same local machine (if possible)? @Feng, We created a shell script to merge 2 runs, let us know if that works for you. http://wiki.apache.org/nutch/MergeCrawl Renaud Dennis Kubes wrote: You can

ZIP parser in Nutch 0.7.2

2006-09-05 Thread Lourival Júnior
Hi all! Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How can I do this? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: how to combine two run's result for search

2006-09-05 Thread Zaheed Haque
Hi: Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 In nutch-site.xml searcher.dir = /data Under /data you have a text file called search-server.txt (I think do check nutch-site search.dir description please) In the text file you will have the following hostname1

crawling frequently changing data on an intranet - how?

2006-09-05 Thread Tomi NA
The task --- I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt, .xls, etc.) to index. Dozens, or even hundreds and thousands of documents can change their content, be created or deleted every day. The crawler will run on a HP DL380 G4 server - don't know the exact specs

Re: how to combine two run's result for search

2006-09-05 Thread Renaud Richardet
Zaheed, Thank you, that works good. Do you know if there is a big performance overhead with starting 2 servers? As an alternative, we could use Lucene's Multisearcher? -- Renaud Zaheed Haque wrote: Hi: Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 In nutch-site.xml

Re: how to combine two run's result for search

2006-09-05 Thread Zaheed Haque
Renaud: Yes or No!. I have done some testing as Dennis Kubes suggested and got similler results like his test. In short having 4 nutch search servers in one box but in 4 different disks with in my case 0.75 mil docs per disk. I had about 4 gig memory and 1 AMD 64 processor and it worked out

Re: how to combine two run's result for search

2006-09-05 Thread Feng Ji
thanks, Renaud: I figured out the same senario as your script, it works well. Michael On 9/5/06, Renaud Richardet [EMAIL PROTECTED] wrote: @Dennis, Can you explain how to setup distributed search while storing the 2 indexes on the same local machine (if possible)? @Feng, We created a shell

filter urls from search result

2006-09-05 Thread Feng Ji
Hi there, I want to filter out particular ursl from search result. And I try to use segement merger to do it; Firstly, I put target urls in regex-urlfiter.txt and automaton-urlfiter.txt, as -http://abc.com/;. then, run nutch/mergesegs and nutch/index, but the search page still show the urls I

writing plugin in nutch 0.8

2006-09-05 Thread [EMAIL PROTECTED]
Is there any changes about the writing plugins between nutch 0.8 and 0.7?I have some problems folowinng the plugin guide of nutch 0.7

Nutch Cannot Find Indexed Pages?

2006-09-05 Thread victor_emailbox
Hi, I followed all the steps in the 0.8 tutorial except that I have only 2 urls in the crawling list. When I do a search in Nutch in my browser, it can't find anything as if it doesn't have anything in the db or index. Does anyone know why? Thanks. -- View this message in context: