Re: Nutch Hadoop question

2009-11-13 Thread Eran Zinman
Hi All,

Don't want to bother you guys too much... I've tried searching for this
topic and do some testing myself but so far was quite unsuccessful.

Basically - I wish to use some computers only for map-reduce processing and
not for HDFS, does anyone know how this can be done?

Thanks,
Eran

On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote:

 Hi All,

 I'm using Nutch with Hadoop with great pleasure - working great and really
 increase crawling performance on multiple machines.

 I have two strong machines and two older machines which I would like to
 use.

 So far I've been using only the two strong machines with Hadoop.

 Now I would like to add the two less powerful machines to do some
 processing as well.

 My question is - Right now the HDFS is shared between the two powerful
 computers. I don't want the two other computer to store any content on them
 as they have a slow and unreliable harddisk. I just want the two other
 machines to do processing (i.e. mapreduce) and not store any content on
 them.

 Is that possible - or do I have to use HDFS on all machines that do
 processing?

 If it's possible to use a machine only for mapreduce - how this is done?

 Thank you for your help,
 Eran



Re: Nutch Hadoop question

2009-11-13 Thread TuxRacer69

Hi Eran,

mapreduce has to store its data on HDFS file system.
But if you want to separate the two groups of servers, you could build 
two separate HDFS filesystems. To separate the two setups, you will need 
to make sure there is no cross communication between the two parts,


cheer
Alex

Eran Zinman wrote:

Hi All,

Don't want to bother you guys too much... I've tried searching for this
topic and do some testing myself but so far was quite unsuccessful.

Basically - I wish to use some computers only for map-reduce processing and
not for HDFS, does anyone know how this can be done?

Thanks,
Eran

On Wed, Nov 11, 2009 at 12:19 PM, Eran Zinman zze...@gmail.com wrote:

  

Hi All,

I'm using Nutch with Hadoop with great pleasure - working great and really
increase crawling performance on multiple machines.

I have two strong machines and two older machines which I would like to
use.

So far I've been using only the two strong machines with Hadoop.

Now I would like to add the two less powerful machines to do some
processing as well.

My question is - Right now the HDFS is shared between the two powerful
computers. I don't want the two other computer to store any content on them
as they have a slow and unreliable harddisk. I just want the two other
machines to do processing (i.e. mapreduce) and not store any content on
them.

Is that possible - or do I have to use HDFS on all machines that do
processing?

If it's possible to use a machine only for mapreduce - how this is done?

Thank you for your help,
Eran




  




Re: Nutch Hadoop question

2009-11-13 Thread Andrzej Bialecki

TuxRacer69 wrote:

Hi Eran,

mapreduce has to store its data on HDFS file system.


More specifically, it needs read/write access to a shared filesystem. If 
you are brave enough you can use NFS, too, or any other type of 
filesystem that can be mounted locally on each node (e.g. a NetApp).


But if you want to separate the two groups of servers, you could build 
two separate HDFS filesystems. To separate the two setups, you will need 
to make sure there is no cross communication between the two parts,


You can run two separate clusters even on the same set of machines, just 
 configure them to use different ports AND different local paths.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki

Dharan Althuru wrote:

Hi,


We are trying to incorporate synonym filter during indexing using Nutch. As
per my understanding Nutch doesn’t have synonym indexing plug-in by default.
Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in
available in Lucene using WordNet or custom synonym plug-in without any
negative impacts to existing Nutch indexing (i.e., considering bigram etc).


Synonym expansion should be done when the text is analyzed (using 
Analyzers), so you can reuse the Lucene's synonym filter.


Unfortunately, this happens at different stages depending on whether you 
use the built-in Lucene indexer, or the Solr indexer.


If you use the Lucene indexer, this happens in LuceneWriter, and the 
only way to affect it is to implement an analysis plugin, so that it's 
returned from AnalyzerFactory, and use your analysis plugin instead of 
the default one. See e.g. analysis-fr for an example of how to implement 
such plugin.


However, when you index to Solr you need to configure the Solr's 
analysis chain, i.e. in your schema.xml you need to define for your 
fieldType that it has the synonym filter in its indexing analysis chain.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



How to configure nutch to crawl parallelly

2009-11-13 Thread xiao yang
Hi, All

I'm using Nutch-1.0 on a 12 nodes cluster, and configure
conf/hadoop-site.xml as follow:
  ...
  property
namemapred.tasktracker.map.tasks.maximum/name
value20/value
  /property
  property
namemapred.tasktracker.reduce.tasks.maximum/name
value20/value
  /property
  ...
but the Running Jobs section in page
http://cluster0:50030/jobtracker.jsp never has more than one item.

Thanks!
Xiao


can't deploy nutch-1.0.war ???

2009-11-13 Thread MilleBii
I'm stuck and not able to deploy nutch-1.0.war

I get following error in the catalina.log:

Exception when processing TLD indicated by the ressource path
 /WEB-INF/taglibs-i18n.tld in the context /nutch-1.0


What could it be the taglibs is there, the *.properties files are there.
ANY HELP where to look very welcomed.

-- 
-MilleBii-


Re: How to configure nutch to crawl parallelly

2009-11-13 Thread Otis Gospodnetic
I don't recall off the top of my head what that jobtracker.jsp shows, but 
judging by name, it shows your job.  Each job is composed of multiple map and 
reduce tasks.  Drill into your job and you should see multiple tasks running.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 From: xiao yang yangxiao9...@gmail.com
 To: nutch-user@lucene.apache.org
 Sent: Fri, November 13, 2009 12:16:55 PM
 Subject: How to configure nutch to crawl parallelly
 
 Hi, All
 
 I'm using Nutch-1.0 on a 12 nodes cluster, and configure
 conf/hadoop-site.xml as follow:
   ...
   
 mapred.tasktracker.map.tasks.maximum
 20
   
   
 mapred.tasktracker.reduce.tasks.maximum
 20
   
   ...
 but the Running Jobs section in page
 http://cluster0:50030/jobtracker.jsp never has more than one item.
 
 Thanks!
 Xiao



Re: Nutch Hadoop question

2009-11-13 Thread Eran Zinman
Thanks for the help guys.

On Fri, Nov 13, 2009 at 5:20 PM, Andrzej Bialecki a...@getopt.org wrote:

 TuxRacer69 wrote:

 Hi Eran,

 mapreduce has to store its data on HDFS file system.


 More specifically, it needs read/write access to a shared filesystem. If
 you are brave enough you can use NFS, too, or any other type of filesystem
 that can be mounted locally on each node (e.g. a NetApp).


 But if you want to separate the two groups of servers, you could build two
 separate HDFS filesystems. To separate the two setups, you will need to make
 sure there is no cross communication between the two parts,


 You can run two separate clusters even on the same set of machines, just
  configure them to use different ports AND different local paths.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com