Nutch Hadoop question
Hi All, I'm using Nutch with Hadoop with great pleasure - working great and really increase crawling performance on multiple machines. I have two strong machines and two older machines which I would like to use. So far I've been using only the two strong machines with Hadoop. Now I would like to add the two less powerful machines to do some processing as well. My question is - Right now the HDFS is shared between the two powerful computers. I don't want the two other computer to store any content on them as they have a slow and unreliable harddisk. I just want the two other machines to do processing (i.e. mapreduce) and not store any content on them. Is that possible - or do I have to use HDFS on all machines that do processing? If it's possible to use a machine only for mapreduce - how this is done? Thank you for your help, Eran
Issue with with scoring and new webcolums with latest nutchbase
Hi all , I recently dowloaded the latest nutchbase ... just finding out that many things have changed for my plugins notably the addition on hhbase columns. Not really difficult to change although plenty of plugins now don't work. Still I face a problem with the new scoring plug-in because before now the afterParsingScoring is gone ... so my application would not work any more. Is it planned to be changed or do I need to revert to an older reference. -- -MilleBii-
Re: Issue with with scoring and new webcolums with latest nutchbase
More over the class Crawl does exist after building so you can not run nutch crawl... I'm going to revert to the reference code it does not work. 2009/11/11 MilleBii mille...@gmail.com Hi all , I recently dowloaded the latest nutchbase ... just finding out that many things have changed for my plugins notably the addition on hhbase columns. Not really difficult to change although plenty of plugins now don't work. Still I face a problem with the new scoring plug-in because before now the afterParsingScoring is gone ... so my application would not work any more. Is it planned to be changed or do I need to revert to an older reference. -- -MilleBii- -- -MilleBii-
Re: How do I block/ban a specific domain name or a tld?
Hello, Thanks for the reply, but this doesn't seem to work either. I removed the crawl dir, added the regex you posted, removed the one I had in regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My crawls spend about 90% of their time on who.int .. I have no idea how to remove this domain or all .int domains from being crawled. Do I have the regex in the wrong conf file? Thanks, -Warren reinhard schwab wrote: opsec schrieb: I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt yet when I start a crawl this domain is heavily spidered. I would like to remove it from my search results entirely and prevent it from being crawled in the future and possibly all *.int tlds, how can i accomplish this? -^http://([a-z0-9]*\.)*who.int/ why not -^http://[^/]*\.int/ Thanks for your time and any assistance, -Warren -- View this message in context: http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How do I block/ban a specific domain name or a tld?
hello, the first matching rule wins. may be you have a rule before, which matches. can you send me your filter files by private mail? regards reinhard opsec schrieb: Hello, Thanks for the reply, but this doesn't seem to work either. I removed the crawl dir, added the regex you posted, removed the one I had in regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My crawls spend about 90% of their time on who.int .. I have no idea how to remove this domain or all .int domains from being crawled. Do I have the regex in the wrong conf file? Thanks, -Warren reinhard schwab wrote: opsec schrieb: I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt yet when I start a crawl this domain is heavily spidered. I would like to remove it from my search results entirely and prevent it from being crawled in the future and possibly all *.int tlds, how can i accomplish this? -^http://([a-z0-9]*\.)*who.int/ why not -^http://[^/]*\.int/ Thanks for your time and any assistance, -Warren
Problems with Hadoop source
Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in: http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. After adding it to the project in Eclipse everything seems correct but the execution shows: Exception in thread main java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157) at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91) at org.apache.nutch.crawl.Crawl.main(Crawl.java:103) Any idea? Thanks -- View this message in context: http://old.nabble.com/Problems-with-Hadoop-source-tp26307608p26307608.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Problems with Hadoop source
Pablo Aragón wrote: Hej, I am developing a project based on Nutch. It works great (in Eclipse) but due to new requirements I have to change the library hadoop-0.12.2-core.jar to the original source code. I download succesfully that code in: http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. After adding it to the project in Eclipse everything seems correct but the execution shows: Exception in thread main java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157) at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91) at org.apache.nutch.crawl.Crawl.main(Crawl.java:103) Any idea? Yes - when you worked with a pre-built jar it contained an embedded hadoop-default.xml that defines the implementation of the file:// schema FileSystem. Now you probably forgot to put hadoop-default.xml on your classpath. Go to Build Path and add this file to your classpath, and all should be ok. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch/Solr question
Solr is just a search and indexing server. It doesn't do crawling. Nutch does the crawling and page parsing, and can index into Lucene or into a Solr server. Nutch is a biggish beast, and if you just need to index a site or even a small set of them, you may have an easier time with Droids. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Bartosz Gadzimski bartek...@o2.pl To: nutch-user@lucene.apache.org Sent: Wed, November 4, 2009 10:41:14 AM Subject: Nutch/Solr question Hi, I want to make site search for few of my (and friends) websites but without access to database data. So using nutch crawling and then I have 2 ways. 1. index data to solr 2. leave it with nutch index I need help in finding advantages/disadvantages of solr vs nutch searching because I don't know solr (it's hard to have a big picture) Each site is quite small so it can be held by solr with no problems. In solr I probably can't use faceted search or range queries etc. because I don't have necessary data in schema? In nutch I can have one search server and use site:domain to limit results (like google site search) or use multiple indexes (mentioned on mailing list) but what with solr? Any input highly appreciated. Thanks, Bartosz
Stopping at depth=0 - no more URLs to fetch
Hi all, I have been trying to run a crawl on a couple of different domains using nutch: bin/nutch crawl urls -dir crawled -depth 3 Everytime I get the response: Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the first level get crawled and in most other cases, nothing gets crawled. I don't know if I have been making a mistake in the crawl-urlfilter.txt file. Here is how it looks for me: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*blogspot.com/ (rest all other sections in the file have default values) My urllist.txt file has only one url: http://gmailblog.blogspot.com The only website where the crawl seems to be working properly is http://lucene.apache.org Any suggestions are appreciated. -- View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26310955.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch does not crawl pages starting with ~
Hi, I have setup Nutch on a multinode cluster and the crawl is working fine. However it seems that Nutch cannot crawl any pages with ~. I setup nutch to crawl http://www.cs.umbc.edu . Within this website it did not crawl pages like - - www.cs.umbc.edu/~varish1 - www.cs.umbc.edu/~relan1 and so on. Any idea how to fix this issue ? Thanks, Regards, Varish Mulwad
re-fetch interval
hi, i hope someone can find me an answer on this; why is nutch re-fetching pages that we fetched and indexed already - over and over regardless of the db.fetch properties? here are my settings: db.fetch.interval.default = 2592000 db.default.fetch.interval = 30 db.fetch.interval.max = 7776000 we are NOT specifying topN or adddays I have tried all sorts of settings but nothing seems to work. I have gone through the forums and tried all sorts of different suggestions to no avail.. i have looked through code and cant see anything unusual; would appreciate any suggestions on this. what am i missing? thanks..
Re: Nutch does not crawl pages starting with ~
Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1 -- View this message in context: http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Stopping at depth=0 - no more URLs to fetch
Any other rules in your filter that preceed that one? (+^http://([a-z0-9]*\.)*blogspot.com/) -- View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html Sent from the Nutch - User mailing list archive at Nabble.com.