Maximum number of files in hadoop

2008-06-07 Thread karthik raman
Hi,
   What is the maximum number of files that can be stored on HDFS? Is it 
dependent on namenode memory configuration? Also does this impact on the 
performance of namenode anyway?
thanks in advance
Karthik


  From Chandigarh to Chennai - find friends all over India. Go to 
http://in.promos.yahoo.com/groups/citygroups/

Couple of basic hdfs starter issues

2008-06-07 Thread chris collins
Sorry in advance if these challenges are covered in a document somewhere.

I have setup hadoop on a centos 64 bit Linux box.  I have verified that it is 
up and running only through seeing the java processes running and that I can 
access it from the admin ui.

hadoop version is 1.7.0 but I also tried 1.6.4 for the following issue:

From a mac osx box using java 1.5 I am trying to run the following:

String home = hdfs://linuxbox:9000;
URI uri = new URI(home);
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(uri, conf);

The call to FileSystem.get throws an IOException stating that there is a login 
error with message whoami.

When I single step through the code there is an attempt to figure out what user 
is running this process by creating a processbuilder with whoami.  This fails 
with a not found error.  I believe this is because you have to have a fully 
qualified path for processbuilder on the mac???

I also verified that my hadoop-default.xml and hadoop-site.xml is infact found 
in the classpath.

All this is being attempted via a debug session in intellij ide.

Any ideas on what I am doing wrong, I am sure its a configuration blunder on my 
part?

Further, we used to use an old copy of nutch, of course now the hadoop part of 
nutch is its own jar file, so I upgraded the nutch jars too.  We were using a 
few things within the nutch project that seem to of gone away:

net.sf incarnation of the snowball stemmer (I fixed this by pulling directly 
the source from the author).
language identificationany idea where it went?
carrot2 clusteringany idea where that went?

Thanks in advance.

Chris


contrib EC2 with hadoop 0.17

2008-06-07 Thread Chris Anderson
First of all, thanks to whoever maintains the hadoop-ec2 scripts.
They've saved us untold time and frustration getting started with a
small testing cluster (5 instances).

A question: when we log into the newly created cluster, and run jobs
from the example jar (pi, etc) everything works great. We expect our
custom jobs will run just as smoothly.

However, when we restart the namenodes and tasktrackers by running
bin/stop-all.sh on the master, it tries to stop only activity on
localhost. Running start-all.sh then boots up a localhost-only cluster
(on which jobs run just fine).

The only way we've been able to recover from this situation is to use
bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then
start again from scratch with a new cluster.

There must be a simple way to restart the namenodes and jobtrackers
across all machines from the master. Also, I think understanding the
answer to this question might put a lot more into perspective for me,
so I can go on to do more advanced things on my own.

Thanks for any assistance / insight!

Chris


output from stop-all.sh
==

stopping jobtracker
localhost: Warning: Permanently added 'localhost' (RSA) to the list of
known hosts.
localhost: no tasktracker to stop
stopping namenode
localhost: no datanode to stop
localhost: no secondarynamenode to stop


conf files in /usr/local/hadoop-0.17.0
==

# cat conf/slaves
localhost
# cat conf/masters
localhost




-- 
Chris Anderson
http://jchris.mfdz.com


RE: Couple of basic hdfs starter issues

2008-06-07 Thread chris collins
I should update this to stupidity on my part (though the hidden shell execution 
within the client thats error gets masked is somewhat fickle).  Of course if I 
dont start the thing up via the ide, but from the command line it goes past 
this problem (security issue, but that one is probably a more obvious thing).  

Still if anyone has an idea what happened to language id and the carrot2 stuff 
inside nutch that would be appreciated.

C


-Original Message-
From: chris collins [mailto:[EMAIL PROTECTED]
Sent: Sat 6/7/2008 10:54 AM
To: core-user@hadoop.apache.org
Subject: Couple of basic hdfs starter issues
 
Sorry in advance if these challenges are covered in a document somewhere.

I have setup hadoop on a centos 64 bit Linux box.  I have verified that it is 
up and running only through seeing the java processes running and that I can 
access it from the admin ui.

hadoop version is 1.7.0 but I also tried 1.6.4 for the following issue:

From a mac osx box using java 1.5 I am trying to run the following:

String home = hdfs://linuxbox:9000;
URI uri = new URI(home);
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(uri, conf);

The call to FileSystem.get throws an IOException stating that there is a login 
error with message whoami.

When I single step through the code there is an attempt to figure out what user 
is running this process by creating a processbuilder with whoami.  This fails 
with a not found error.  I believe this is because you have to have a fully 
qualified path for processbuilder on the mac???

I also verified that my hadoop-default.xml and hadoop-site.xml is infact found 
in the classpath.

All this is being attempted via a debug session in intellij ide.

Any ideas on what I am doing wrong, I am sure its a configuration blunder on my 
part?

Further, we used to use an old copy of nutch, of course now the hadoop part of 
nutch is its own jar file, so I upgraded the nutch jars too.  We were using a 
few things within the nutch project that seem to of gone away:

net.sf incarnation of the snowball stemmer (I fixed this by pulling directly 
the source from the author).
language identificationany idea where it went?
carrot2 clusteringany idea where that went?

Thanks in advance.

Chris



Re: contrib EC2 with hadoop 0.17

2008-06-07 Thread Chris K Wensel
The new scripts do not use the start/stop-all.sh scripts, and thus do  
not maintain the slaves file. This is so cluster startup is much  
faster and a bit more reliable (keys do not need to be pushed to the  
slaves). Also we can grow the cluster lazily just by starting slave  
nodes. That is, they are mostly optimized for booting a large cluster  
fast, doing work, then shutting down (allowing for huge short lived  
clusters, vs a smaller/cheaper long lived one).


But it probably would be wise to provide scripts to build/refresh the  
slaves file, and push keys to slaves, so the cluster can be  
traditionally maintained, instead of just re-instantiated with new  
parameters etc.


I wonder if these scripts would make sense in general, instead of  
being ec2 specific?


ckw

On Jun 7, 2008, at 11:31 AM, Chris Anderson wrote:


First of all, thanks to whoever maintains the hadoop-ec2 scripts.
They've saved us untold time and frustration getting started with a
small testing cluster (5 instances).

A question: when we log into the newly created cluster, and run jobs
from the example jar (pi, etc) everything works great. We expect our
custom jobs will run just as smoothly.

However, when we restart the namenodes and tasktrackers by running
bin/stop-all.sh on the master, it tries to stop only activity on
localhost. Running start-all.sh then boots up a localhost-only cluster
(on which jobs run just fine).

The only way we've been able to recover from this situation is to use
bin/terminate-hadoop-cluster and bin/destroy-hadoop-cluster and then
start again from scratch with a new cluster.

There must be a simple way to restart the namenodes and jobtrackers
across all machines from the master. Also, I think understanding the
answer to this question might put a lot more into perspective for me,
so I can go on to do more advanced things on my own.

Thanks for any assistance / insight!

Chris


output from stop-all.sh
==

stopping jobtracker
localhost: Warning: Permanently added 'localhost' (RSA) to the list of
known hosts.
localhost: no tasktracker to stop
stopping namenode
localhost: no datanode to stop
localhost: no secondarynamenode to stop


conf files in /usr/local/hadoop-0.17.0
==

# cat conf/slaves
localhost
# cat conf/masters
localhost




--
Chris Anderson
http://jchris.mfdz.com


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/