Re: distcp question

2012-10-12 Thread J. Rottinghuis
Rita,

Are you doing a push from the source cluster or a pull from the target
cluster?

Doing a pull with distcp using hftp (to accomodate for version differences)
has the advantage of slightly fewer transfers of blocks over the TORs. Each
block is read from exactly the datanode where it is located, and on the
target side (where the mappers run) the first write is to the local
datanode. With RF=3 each block transfers out of the source TOR, into the
target TOR, out of the first target-cluster TOR into a different
target-cluster TOR for replica 2  3. Overall 2 time out, and 2 times in.

Doing a pull with webhdfs:// the proxy server has to collect all blocks
from the source DNs, then they get pulled to the target machine.
Situation is similar as above, with the one extra transfer of all data
going through the proxy server.

Doing a push with webhdfs:// on the target cluster size, the mapper has to
collect all blocks from one or more files (depending on # mappers used) and
send them to the proxy server, which then writes blocks to the target
cluster. Advantage on the target cluster is that each block for a large
multi-block files get spread over different datanodes on the target side.
But if I'm counting correctly, you'll have the most data transfer. Out of
each source DN, through source cluster mapper DN, through target proxy
server, to target DN, and out/in again for replicas 23.

So convenience and setup aside, I think the first option would be the least
network transfers.
Now if you're clusters are separated over a WAN, then this may not matter
all at.

Just something to think about.

Cheers,

Joep


On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote:

 Rita,

 I believe, per the implementation, that webhdfs:// URIs should work
 fine. Please give it a try and let us know.

 On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote:
  I have 2 different versions of Hadoop running. I need to copy significant
  amount of data  (100tb) from one cluster to another. I know distcp is the
  way to do. On the target cluster I have webhdfs running. Would that work?
 
  The DistCp manual says, I need to use HftpFileSystem. Is that necessary
  or will webhdfs do the task?
 
 
 
  --
  --- Get your facts first, then you can distort them as you please.--



 --
 Harsh J



Re: Which hardware to choose

2012-10-03 Thread J. Rottinghuis
Of course it all depends...
But something like this could work:

Leave 1-2 GB for the kernel, pagecache, tools, overhead etc.
Plan 3-4 GB for Datanode and Tasktracker each

Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more
or less memory per slot.
Have 2-3 times as many mappers as reducers (depending on the kinds of jobs
you run).

As Micheal pointed out the ratio of cores (hyperthreads) per disk matters.

With those initial rules of thumb you'd arrive somewhere between
10 mappers + 5 reducers
and
9 mappers + 4 reducers

Try, test, measure, adjust, rinse, repeat.

Cheers,

Joep

On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote:

 All configs are per node.
 No HBase, only Hive and Pig installed

 On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  I think he's saying that its 24 maps 8 reducers per node and at 48GB that
  could be too many mappers.
  Especially if they want to run HBase.
 
  On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:
 
   Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
  right?  Sounds VERY low for a cluster that size.
  
   We have only 10 c2100's and are running I believe 140 map and 70 reduce
  slots so far with pretty decent performance.
  
  
  
   On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
   38 data nodes + 2 Name Nodes
 
   Data Node:
   Dell PowerEdge C2100 series
   2 x XEON x5670
   48 GB RAM ECC  (12x4GB 1333MHz)
   12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
   Intel Gigabit ET Dual port PCIe x4
   Redundant Power Supply
   Hadoop CDH3
   max map tasks 24
   max reduce tasks 8
  
  
 
 



Re: Small question

2012-10-03 Thread J. Rottinghuis
moved common-user@hadoop.apache.org to bcc and added u...@pig.apache.org

Best asked on the Pig users list.

Cheers,

Joep

On Wed, Oct 3, 2012 at 7:04 AM, Abhishek abhishek.dod...@gmail.com wrote:

 Hi all,

 Below hive query in pig latin how to do that.

 select t2.col1, t3.col2

 from table2 t2

 join table3 t3

 WHERE t3.col2 IS NOT NULL

 AND t2.col1 LIKE CONCAT(CONCAT('%',t3.col2),'%')

 Regards
 Abhi





Re: Does hadoop installations need to be at same locations in cluster ?

2011-12-23 Thread J. Rottinghuis
Agreed that different locations is not a good idea.
However, the question was, can it be done? Yes, with some hacking I suppose.
Do I recommend hacking? No.

But, if you cannot help yourself, then having data nodes in a different
locations per slave: create a hdfs-site.xml per node (enjoy).
For the hadoop installation itself it is a bit more tricky.
Look at bin/hadoop-deamons.sh. It finds the location where it is running
from and assumes that the clients are in the same location.
For further hackery and confusion, look at the HADOOP_SSH_OPTS environment
variable set in hadoop-env.sh. Note that passing HADOOP_CONF_DIR requires
support from the server. The ssh deamon may not accept client-side SendEnv
to avoid LD_* types of environment variables as this opens a security hole.
See settings in /etc/sshd_config on the slaves.
Alternatively you can have a symlink on the client in the same location as
the master pointing to your different location.
Finally you may be able to start hadoop deamons by hand.

Have the correct amount of fun!

Joep

On Fri, Dec 23, 2011 at 9:55 AM, Michael Segel michael_se...@hotmail.comwrote:


 Ok,

 Here's the thing...

 1) When building the cluster, you want to be consistent.
 2) Location of $HADOOP_HOME is configurable. So you can place it anywhere.

 Putting the software in two different locations isn't a good idea because
 you now have to set it up with a unique configuration per node.

 It would be faster and make your life a lot easier by putting the software
 in the same location on *all* machines.
 So my suggestion would be to bite the bullet and rebuild your cluster.

 HTH

 -Mike


  Date: Fri, 23 Dec 2011 19:47:45 +0530
  Subject: Re: Does hadoop installations need to be at same locations in
 cluster ?
  From: praveen...@gmail.com
  To: common-user@hadoop.apache.org
 
  What I mean to say is, Does hadoop internally assumes that all
  installations on each nodes need to be in same location.
  I was having hadoop installed on different location on 2 different nodes.
  I configured  hadoop config files to be a part of same cluster.
  But when I started hadoop on master, I saw it was also searching for
  hadoop starting scripts in the same location as of master.
  Do we have any workaround in these kind of situation or do I have to
  reinstall hadoop again on same location as master.
 
  Thanks,
  Praveenesh
 
  On Fri, Dec 23, 2011 at 6:26 PM, Michael Segel
  michael_se...@hotmail.com wrote:
   Sure,
   You could do that, but in doing so, you will make your life a living
 hell.
   Literally.
  
   Think about it... You will have to manually manage each nodes config
 files...
  
   So if something goes wrong you will have a hard time diagnosing the
 issue.
  
   Why make life harder?
  
   Why not just do the simple think and make all of your DN the same?
  
   Sent from my iPhone
  
   On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com
 wrote:
  
   When installing hadoop on slave machines, do we have to install hadoop
   at same locations on each machine ?
   Can we have hadoop installation at different location on different
   machines at same cluster ?
   If yes, what things we have to take care in that case
  
   Thanks,
   Praveenesh




Re: Hadoop and hardware

2011-12-16 Thread J. Rottinghuis
Pierre,

As discussed in recent other threads, it depends.
The most sensible thing for Hadoop nodes is to find a sweet spot for
price/performance.
In general that will mean keeping a balance between compute power, disks,
and network bandwidth, and factor in racks, space, operating costs etc.

How much storage capacity are you thinking of when you target about 120
data nodes?

If you had for example 60 quad core nodes with 12 * 2 TB disks (or more) I
would suspect you would be bottle-necked on your 1GB network connections.

Other things to consider is how many nodes per rack? If these 60 nodes
would be 2u and you'd fit 20 nodes in a rack, then loosing one top of the
rack switch means loosing 1/3 of the capacity of your cluster.

Yet another consideration is how easily you want to be able to expand your
cluster incrementally? Until you run Hadoop 0.23 you probably want all your
nodes to be roughly similar in capacity.

Cheers,

Joep

On Fri, Dec 16, 2011 at 3:50 AM, Cussol pierre.cus...@cnes.fr wrote:



 In my company, we intend to set up an hadoop cluster to run analylitics
 applications. This cluster would have about 120 data nodes with dual
 sockets
 servers with a GB interconnect. We are also exploring a solution with 60
 quad sockets servers. How do compare the quad sockets and dual sockets
 servers in an hadoop cluster ?

 any help ?

 pierre
 --
 View this message in context:
 http://old.nabble.com/Hadoop-and-hardware-tp32987374p32987374.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: mapreduce matrix multiplication on hadoop

2011-11-30 Thread J. Rottinghuis
The error is that you cannot open /tmp/MatrixMultiply/out/_logs
Does the directory exist?
Do you have proper access rights set?

Joep

On Wed, Nov 30, 2011 at 3:23 AM, ChWaqas waqas...@gmail.com wrote:


 Hi I am trying to run the matrix multiplication example mentioned(with
 source
 code) on the following link:

 http://www.norstad.org/matrix-multiply/index.html

 I have hadoop setup in pseudodistributed mode and I configured it using
 this
 tutorial:


 http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html?showComment=1321528406255#c3661776111033973764

 When I run my jar file then I get the following error:

 Identity test
 11/11/30 10:37:34 INFO input.FileInputFormat: Total input paths to process
 :
 2
 11/11/30 10:37:34 INFO mapred.JobClient: Running job: job_20291041_0010
 11/11/30 10:37:35 INFO mapred.JobClient:  map 0% reduce 0%
 11/11/30 10:37:44 INFO mapred.JobClient:  map 100% reduce 0%
 11/11/30 10:37:56 INFO mapred.JobClient:  map 100% reduce 100%
 11/11/30 10:37:58 INFO mapred.JobClient: Job complete:
 job_20291041_0010
 11/11/30 10:37:58 INFO mapred.JobClient: Counters: 17
 11/11/30 10:37:58 INFO mapred.JobClient:   Job Counters
 11/11/30 10:37:58 INFO mapred.JobClient: Launched reduce tasks=1
 11/11/30 10:37:58 INFO mapred.JobClient: Launched map tasks=2
 11/11/30 10:37:58 INFO mapred.JobClient: Data-local map tasks=2
 11/11/30 10:37:58 INFO mapred.JobClient:   FileSystemCounters
 11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_READ=114
 11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_READ=248
 11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=298
 11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124
 11/11/30 10:37:58 INFO mapred.JobClient:   Map-Reduce Framework
 11/11/30 10:37:58 INFO mapred.JobClient: Reduce input groups=2
 11/11/30 10:37:58 INFO mapred.JobClient: Combine output records=0
 11/11/30 10:37:58 INFO mapred.JobClient: Map input records=4
 11/11/30 10:37:58 INFO mapred.JobClient: Reduce shuffle bytes=60
 11/11/30 10:37:58 INFO mapred.JobClient: Reduce output records=2
 11/11/30 10:37:58 INFO mapred.JobClient: Spilled Records=8
 11/11/30 10:37:58 INFO mapred.JobClient: Map output bytes=100
 11/11/30 10:37:58 INFO mapred.JobClient: Combine input records=0
 11/11/30 10:37:58 INFO mapred.JobClient: Map output records=4
 11/11/30 10:37:58 INFO mapred.JobClient: Reduce input records=4
 11/11/30 10:37:58 INFO input.FileInputFormat: Total input paths to process
 :
 1
 11/11/30 10:37:59 INFO mapred.JobClient: Running job: job_20291041_0011
 11/11/30 10:38:00 INFO mapred.JobClient:  map 0% reduce 0%
 11/11/30 10:38:09 INFO mapred.JobClient:  map 100% reduce 0%
 11/11/30 10:38:21 INFO mapred.JobClient:  map 100% reduce 100%
 11/11/30 10:38:23 INFO mapred.JobClient: Job complete:
 job_20291041_0011
 11/11/30 10:38:23 INFO mapred.JobClient: Counters: 17
 11/11/30 10:38:23 INFO mapred.JobClient:   Job Counters
 11/11/30 10:38:23 INFO mapred.JobClient: Launched reduce tasks=1
 11/11/30 10:38:23 INFO mapred.JobClient: Launched map tasks=1
 11/11/30 10:38:23 INFO mapred.JobClient: Data-local map tasks=1
 11/11/30 10:38:23 INFO mapred.JobClient:   FileSystemCounters
 11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_READ=34
 11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_READ=124
 11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=100
 11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124
 11/11/30 10:38:23 INFO mapred.JobClient:   Map-Reduce Framework
 11/11/30 10:38:23 INFO mapred.JobClient: Reduce input groups=2
 11/11/30 10:38:23 INFO mapred.JobClient: Combine output records=2
 11/11/30 10:38:23 INFO mapred.JobClient: Map input records=2
 11/11/30 10:38:23 INFO mapred.JobClient: Reduce shuffle bytes=0
 11/11/30 10:38:23 INFO mapred.JobClient: Reduce output records=2
 11/11/30 10:38:23 INFO mapred.JobClient: Spilled Records=4
 11/11/30 10:38:23 INFO mapred.JobClient: Map output bytes=24
 11/11/30 10:38:23 INFO mapred.JobClient: Combine input records=2
 11/11/30 10:38:23 INFO mapred.JobClient: Map output records=2
 11/11/30 10:38:23 INFO mapred.JobClient: Reduce input records=2
 Exception in thread main java.io.IOException: Cannot open filename
 /tmp/MatrixMultiply/out/_logs
at
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja
 va:1497)
at
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java
 :1488)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
at
 org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst
 em.java:178)
at
 org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1 437)
at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:142 4)
at