Nutch Hadoop question

2009-11-11 Thread Eran Zinman
Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like

Issue with with scoring and new webcolums with latest nutchbase

2009-11-11 Thread MilleBii
Hi all , I recently dowloaded the latest nutchbase ... just finding out that many things have changed for my plugins notably the addition on hhbase columns. Not really difficult to change although plenty of plugins now don't work. Still I face a problem with the new scoring plug-in because

Re: Issue with with scoring and new webcolums with latest nutchbase

2009-11-11 Thread MilleBii
More over the class Crawl does exist after building so you can not run nutch crawl... I'm going to revert to the reference code it does not work. 2009/11/11 MilleBii mille...@gmail.com Hi all , I recently dowloaded the latest nutchbase ... just finding out that many things have changed for

Re: How do I block/ban a specific domain name or a tld?

2009-11-11 Thread opsec
Hello, Thanks for the reply, but this doesn't seem to work either. I removed the crawl dir, added the regex you posted, removed the one I had in regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My crawls spend about 90% of their time on who.int .. I have no idea how to

Re: How do I block/ban a specific domain name or a tld?

2009-11-11 Thread reinhard schwab
hello, the first matching rule wins. may be you have a rule before, which matches. can you send me your filter files by private mail? regards reinhard opsec schrieb: Hello, Thanks for the reply, but this doesn't seem to work either. I removed the crawl dir, added the regex you posted,

Problems with Hadoop source

2009-11-11 Thread Pablo Aragón
Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in:

Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki
Pablo Aragón wrote: Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in:

Re: Nutch/Solr question

2009-11-11 Thread Otis Gospodnetic
Solr is just a search and indexing server. It doesn't do crawling. Nutch does the crawling and page parsing, and can index into Lucene or into a Solr server. Nutch is a biggish beast, and if you just need to index a site or even a small set of them, you may have an easier time with Droids.

Stopping at depth=0 - no more URLs to fetch

2009-11-11 Thread kvorion
Hi all, I have been trying to run a crawl on a couple of different domains using nutch: bin/nutch crawl urls -dir crawled -depth 3 Everytime I get the response: Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the first level get crawled and in most other cases, nothing

Nutch does not crawl pages starting with ~

2009-11-11 Thread Varish Mulwad
Hi, I have setup Nutch on a multinode cluster and the crawl is working fine. However it seems that Nutch cannot crawl any pages with ~. I setup nutch to crawl http://www.cs.umbc.edu . Within this website it did not crawl pages like - - www.cs.umbc.edu/~varish1 - www.cs.umbc.edu/~relan1

re-fetch interval

2009-11-11 Thread fadzi
hi, i hope someone can find me an answer on this; why is nutch re-fetching pages that we fetched and indexed already - over and over regardless of the db.fetch properties? here are my settings: db.fetch.interval.default = 2592000 db.default.fetch.interval = 30 db.fetch.interval.max = 7776000

Re: Nutch does not crawl pages starting with ~

2009-11-11 Thread John Whelan
Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1 -- View this message in context: http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Stopping at depth=0 - no more URLs to fetch

2009-11-11 Thread John Whelan
Any other rules in your filter that preceed that one? (+^http://([a-z0-9]*\.)*blogspot.com/) -- View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html Sent from the Nutch - User mailing list archive at Nabble.com.