date:20060905

Recover fetching process

2006-09-05 Thread Sergey Levickiy

Hi. As I can restore featch process? I want to continue featching to not to run again on downloaded reference. -- Best regards, Sergey Levickiy ICQ: 283616567 tel: +38(067)6483250

searching more than one specific url

2006-09-05 Thread David Podunavac

Hi there i wonder if there is a way after i crawled to specify more than one url to look for e.g. I have in my crawlurl-filter.txt http://www.firstUrl.com http://www.secondUrl.com http://*.thirdUrl.com but I don't want results from all of these when i enter a search term in the webinterface so my

Caching the search results

2006-09-05 Thread Marco Vanossi

Hi, Anybody knows how can I set Nutch to cache the results of the searches? I've heard about this feature but I am not finding the information Thanks, Marco

Re: Caching the search results

2006-09-05 Thread Andrzej Bialecki

Marco Vanossi wrote: Hi, Anybody knows how can I set Nutch to cache the results of the searches? I've heard about this feature but I am not finding the information Trivial web-level caching is easy to implement - just download osCache and modify your web application settings according to

Setting mapred.tasktracker.tasks.maximum doesn't change # of tasks executed in parallel

2006-09-05 Thread Vishal Shah

Hi, I am using Nutch 0.9 for crawling. I recollect that mapred.tasktracker.tasks.maximum can be used to control the max # of tasks executed in parallel by a tasktracker. I am running a fetch with the following config: 3 machines My mapred-default.xml contains: mapred.map.tasks=13

ignore content between tags? crawl only between tags?

2006-09-05 Thread Philip Brown

Is it possible on some pages to crawl only between tags or have it not crawl between tags. ie. nocrawlblah blah blah/nocrawl crawlherethe content only that I want to crawl/crawlhere nocrawlblah blah blah/nocrawl appreciate any input kind regards

RE: Caching the search results

2006-09-05 Thread Chirag Chaman

Marco, We use a search caching system at Filangy -- uses lucene to save the Search string, count, date and top 20 IDs of the pages. So all you have to do is search for those IDs. Yes, it still involves a search, but we have a distributed system with the ID as the hash key for specifying on which

Re: ignore content between tags? crawl only between tags?

2006-09-05 Thread Andrzej Bialecki

Philip Brown wrote: Is it possible on some pages to crawl only between tags or have it not crawl between tags. ie. nocrawlblah blah blah/nocrawl crawlherethe content only that I want to crawl/crawlhere nocrawlblah blah blah/nocrawl appreciate any input kind regards You can modify

Re: how to combine two run's result for search

2006-09-05 Thread Renaud Richardet

@Dennis, Can you explain how to setup distributed search while storing the 2 indexes on the same local machine (if possible)? @Feng, We created a shell script to merge 2 runs, let us know if that works for you. http://wiki.apache.org/nutch/MergeCrawl Renaud Dennis Kubes wrote: You can

ZIP parser in Nutch 0.7.2

2006-09-05 Thread Lourival Júnior

Hi all! Has anyone successful implemented the ZIP plugin in nutch version 0.7.2? How can I do this? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]

Re: how to combine two run's result for search

2006-09-05 Thread Zaheed Haque

Hi: Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 In nutch-site.xml searcher.dir = /data Under /data you have a text file called search-server.txt (I think do check nutch-site search.dir description please) In the text file you will have the following hostname1

crawling frequently changing data on an intranet - how?

2006-09-05 Thread Tomi NA

The task --- I have less than 100GB of diverse documents (.doc, .pdf, .ppt, .txt, .xls, etc.) to index. Dozens, or even hundreds and thousands of documents can change their content, be created or deleted every day. The crawler will run on a HP DL380 G4 server - don't know the exact specs

Re: how to combine two run's result for search

2006-09-05 Thread Renaud Richardet

Zaheed, Thank you, that works good. Do you know if there is a big performance overhead with starting 2 servers? As an alternative, we could use Lucene's Multisearcher? -- Renaud Zaheed Haque wrote: Hi: Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 In nutch-site.xml

Re: how to combine two run's result for search

2006-09-05 Thread Zaheed Haque

Renaud: Yes or No!. I have done some testing as Dennis Kubes suggested and got similler results like his test. In short having 4 nutch search servers in one box but in 4 different disks with in my case 0.75 mil docs per disk. I had about 4 gig memory and 1 AMD 64 processor and it worked out

Re: how to combine two run's result for search

2006-09-05 Thread Feng Ji

thanks, Renaud: I figured out the same senario as your script, it works well. Michael On 9/5/06, Renaud Richardet [EMAIL PROTECTED] wrote: @Dennis, Can you explain how to setup distributed search while storing the 2 indexes on the same local machine (if possible)? @Feng, We created a shell

filter urls from search result

2006-09-05 Thread Feng Ji

Hi there, I want to filter out particular ursl from search result. And I try to use segement merger to do it; Firstly, I put target urls in regex-urlfiter.txt and automaton-urlfiter.txt, as -http://abc.com/;. then, run nutch/mergesegs and nutch/index, but the search page still show the urls I

writing plugin in nutch 0.8

2006-09-05 Thread [EMAIL PROTECTED]

Is there any changes about the writing plugins between nutch 0.8 and 0.7?I have some problems folowinng the plugin guide of nutch 0.7

Nutch Cannot Find Indexed Pages?

2006-09-05 Thread victor_emailbox

Hi, I followed all the steps in the 0.8 tutorial except that I have only 2 urls in the crawling list. When I do a search in Nutch in my browser, it can't find anything as if it doesn't have anything in the db or index. Does anyone know why? Thanks. -- View this message in context:

Recover fetching process

searching more than one specific url

Caching the search results

Re: Caching the search results

Setting mapred.tasktracker.tasks.maximum doesn't change # of tasks executed in parallel

ignore content between tags? crawl only between tags?

RE: Caching the search results

Re: ignore content between tags? crawl only between tags?

Re: how to combine two run's result for search

ZIP parser in Nutch 0.7.2

Re: how to combine two run's result for search

crawling frequently changing data on an intranet - how?

Re: how to combine two run's result for search

Re: how to combine two run's result for search

Re: how to combine two run's result for search

filter urls from search result

writing plugin in nutch 0.8

Nutch Cannot Find Indexed Pages?

18 matches

Site Navigation

Mail list logo

Footer information