Nutch Hadoop question

2009-11-11 Thread Eran Zinman
Hi All,

I'm using Nutch with Hadoop with great pleasure - working great and really
increase crawling performance on multiple machines.

I have two strong machines and two older machines which I would like to use.

So far I've been using only the two strong machines with Hadoop.

Now I would like to add the two less powerful machines to do some processing
as well.

My question is - Right now the HDFS is shared between the two powerful
computers. I don't want the two other computer to store any content on them
as they have a slow and unreliable harddisk. I just want the two other
machines to do processing (i.e. mapreduce) and not store any content on
them.

Is that possible - or do I have to use HDFS on all machines that do
processing?

If it's possible to use a machine only for mapreduce - how this is done?

Thank you for your help,
Eran


Issue with with scoring and new webcolums with latest nutchbase

2009-11-11 Thread MilleBii
Hi all ,

I recently dowloaded the latest nutchbase ... just finding out that many
things have changed for my plugins notably the addition on hhbase columns.
Not really difficult to change although plenty of plugins now don't work.

Still I face a problem with the new scoring plug-in because before now the
afterParsingScoring is gone ... so my application would not work any more.

Is it planned to be changed or do I need to revert to an older reference.

-- 
-MilleBii-


Re: Issue with with scoring and new webcolums with latest nutchbase

2009-11-11 Thread MilleBii
More over the class Crawl does exist after building so you can not run nutch
crawl... I'm going to revert to the reference code it does not work.

2009/11/11 MilleBii mille...@gmail.com

 Hi all ,

 I recently dowloaded the latest nutchbase ... just finding out that many
 things have changed for my plugins notably the addition on hhbase columns.
 Not really difficult to change although plenty of plugins now don't work.

 Still I face a problem with the new scoring plug-in because before now the
 afterParsingScoring is gone ... so my application would not work any more.

 Is it planned to be changed or do I need to revert to an older reference.

 --
 -MilleBii-




-- 
-MilleBii-


Re: How do I block/ban a specific domain name or a tld?

2009-11-11 Thread opsec

Hello,

 Thanks for the reply, but this doesn't seem to work either. I removed the
crawl dir, added the regex you posted, removed the one I had in
regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
crawls spend about 90% of their time on who.int .. I have no idea how to
remove this domain or all .int domains from being crawled. Do I have the
regex in the wrong conf file?

Thanks, 

-Warren

reinhard schwab wrote:
 
 opsec schrieb:
 I've added this to my conf/crawl-urlfilter.txt and
 conf/regex-urlfilter.txt
 yet when I start a crawl this domain is heavily spidered. I would like to
 remove it from my search results entirely and prevent it from being
 crawled
 in the future and possibly all *.int tlds, how can i accomplish this?

 -^http://([a-z0-9]*\.)*who.int/
   
 why not
 
 -^http://[^/]*\.int/
 
 
 
 Thanks for your time and any assistance, 

 -Warren
   
 
 
 

-- 
View this message in context: 
http://old.nabble.com/How-do-I-block-ban-a-specific-domain-name-or-a-tld--tp26289091p26306461.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How do I block/ban a specific domain name or a tld?

2009-11-11 Thread reinhard schwab
hello,

the first matching rule wins.
may be you have a rule before, which matches.
can you send me your filter files by private mail?

regards
reinhard

opsec schrieb:
 Hello,

  Thanks for the reply, but this doesn't seem to work either. I removed the
 crawl dir, added the regex you posted, removed the one I had in
 regex-urlfilter.txt and crawl-urlfilter.txt and restarted the crawl. My
 crawls spend about 90% of their time on who.int .. I have no idea how to
 remove this domain or all .int domains from being crawled. Do I have the
 regex in the wrong conf file?

 Thanks, 

 -Warren

 reinhard schwab wrote:
   
 opsec schrieb:
 
 I've added this to my conf/crawl-urlfilter.txt and
 conf/regex-urlfilter.txt
 yet when I start a crawl this domain is heavily spidered. I would like to
 remove it from my search results entirely and prevent it from being
 crawled
 in the future and possibly all *.int tlds, how can i accomplish this?

 -^http://([a-z0-9]*\.)*who.int/
   
   
 why not

 -^http://[^/]*\.int/



 
 Thanks for your time and any assistance, 

 -Warren
   
   

 

   



Problems with Hadoop source

2009-11-11 Thread Pablo Aragón

Hej,

I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.

I download succesfully that code in:
http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. 

After adding it to the project in Eclipse everything seems correct but the
execution shows:

Exception in thread main java.io.IOException: No FileSystem for scheme:
file
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:103)

Any idea?

Thanks


-- 
View this message in context: 
http://old.nabble.com/Problems-with-Hadoop-source-tp26307608p26307608.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki

Pablo Aragón wrote:

Hej,

I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.

I download succesfully that code in:
http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. 


After adding it to the project in Eclipse everything seems correct but the
execution shows:

Exception in thread main java.io.IOException: No FileSystem for scheme:
file
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:103)

Any idea?


Yes - when you worked with a pre-built jar it contained an embedded 
hadoop-default.xml that defines the implementation of the file:// 
schema FileSystem. Now you probably forgot to put hadoop-default.xml on 
your classpath. Go to Build Path and add this file to your classpath, 
and all should be ok.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch/Solr question

2009-11-11 Thread Otis Gospodnetic
Solr is just a search and indexing server.  It doesn't do crawling.  Nutch does 
the crawling and page parsing, and can index into Lucene or into a Solr server.

Nutch is a biggish beast, and if you just need to index a site or even a small 
set of them, you may have an easier time with Droids.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: Bartosz Gadzimski bartek...@o2.pl
 To: nutch-user@lucene.apache.org
 Sent: Wed, November 4, 2009 10:41:14 AM
 Subject: Nutch/Solr question
 
 Hi,
 
 I want to make site search for few of my (and friends) websites but without 
 access to database data. So using nutch crawling and then I have 2 ways.
 1. index data to solr
 2. leave it with nutch index
 
 I need help in finding advantages/disadvantages of solr vs nutch searching 
 because I don't know solr (it's hard to have a big picture)
 
 Each site is quite small so it can be held by solr with no problems.
 In solr I probably can't use faceted search or range queries etc. because I 
 don't have necessary data in schema?
 
 In nutch I can have one search server and use site:domain to limit results 
 (like 
 google site search) or use multiple indexes (mentioned on mailing list) but 
 what 
 with solr?
 
 Any input highly appreciated.
 
 Thanks,
 Bartosz



Stopping at depth=0 - no more URLs to fetch

2009-11-11 Thread kvorion

Hi all,

I have been trying to run a crawl on a couple of different domains using
nutch:

bin/nutch crawl urls -dir crawled -depth 3

 Everytime I get the response:
Stopping at depth=x - no more URLs to fetch. Sometimes a page or two at the
first level get crawled and in most other cases, nothing gets crawled. I
don't know if I have been making a mistake in the crawl-urlfilter.txt file.
Here is how it looks for me:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*blogspot.com/

(rest all other sections in the file have default values)

My urllist.txt file has only one url:
http://gmailblog.blogspot.com

The only website where the crawl seems to be working properly is
http://lucene.apache.org

Any suggestions are appreciated.



-- 
View this message in context: 
http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26310955.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Nutch does not crawl pages starting with ~

2009-11-11 Thread Varish Mulwad
Hi,

I have setup Nutch on a multinode cluster and the crawl is working fine.
However it seems that Nutch cannot crawl any pages with ~.

I setup nutch to crawl http://www.cs.umbc.edu . Within this website it did
not crawl pages like  -


   - www.cs.umbc.edu/~varish1
   - www.cs.umbc.edu/~relan1


and so on. Any idea how to fix this issue ?

Thanks,

Regards,
Varish Mulwad


re-fetch interval

2009-11-11 Thread fadzi
hi,

i hope someone can find me an answer on this;

why is nutch re-fetching pages that we fetched and indexed already - over
and over regardless of the db.fetch properties? here are my settings:

db.fetch.interval.default = 2592000
db.default.fetch.interval = 30
db.fetch.interval.max = 7776000

we are NOT specifying topN or adddays

I have tried all sorts of settings but nothing seems to work. I have gone
through the forums and tried all sorts of different suggestions to no
avail..

i have looked through code and cant see anything unusual;

would appreciate any suggestions on this. what am i missing?

thanks..





Re: Nutch does not crawl pages starting with ~

2009-11-11 Thread John Whelan

Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1

-- 
View this message in context: 
http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Stopping at depth=0 - no more URLs to fetch

2009-11-11 Thread John Whelan

Any other rules in your filter that preceed that one?
(+^http://([a-z0-9]*\.)*blogspot.com/)
-- 
View this message in context: 
http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html
Sent from the Nutch - User mailing list archive at Nabble.com.