Re: distcp question
Rita, Are you doing a push from the source cluster or a pull from the target cluster? Doing a pull with distcp using hftp (to accomodate for version differences) has the advantage of slightly fewer transfers of blocks over the TORs. Each block is read from exactly the datanode where it is located, and on the target side (where the mappers run) the first write is to the local datanode. With RF=3 each block transfers out of the source TOR, into the target TOR, out of the first target-cluster TOR into a different target-cluster TOR for replica 2 3. Overall 2 time out, and 2 times in. Doing a pull with webhdfs:// the proxy server has to collect all blocks from the source DNs, then they get pulled to the target machine. Situation is similar as above, with the one extra transfer of all data going through the proxy server. Doing a push with webhdfs:// on the target cluster size, the mapper has to collect all blocks from one or more files (depending on # mappers used) and send them to the proxy server, which then writes blocks to the target cluster. Advantage on the target cluster is that each block for a large multi-block files get spread over different datanodes on the target side. But if I'm counting correctly, you'll have the most data transfer. Out of each source DN, through source cluster mapper DN, through target proxy server, to target DN, and out/in again for replicas 23. So convenience and setup aside, I think the first option would be the least network transfers. Now if you're clusters are separated over a WAN, then this may not matter all at. Just something to think about. Cheers, Joep On Fri, Oct 12, 2012 at 8:37 AM, Harsh J ha...@cloudera.com wrote: Rita, I believe, per the implementation, that webhdfs:// URIs should work fine. Please give it a try and let us know. On Fri, Oct 12, 2012 at 7:14 PM, Rita rmorgan...@gmail.com wrote: I have 2 different versions of Hadoop running. I need to copy significant amount of data (100tb) from one cluster to another. I know distcp is the way to do. On the target cluster I have webhdfs running. Would that work? The DistCp manual says, I need to use HftpFileSystem. Is that necessary or will webhdfs do the task? -- --- Get your facts first, then you can distort them as you please.-- -- Harsh J
Re: Which hardware to choose
Of course it all depends... But something like this could work: Leave 1-2 GB for the kernel, pagecache, tools, overhead etc. Plan 3-4 GB for Datanode and Tasktracker each Plan 2.5-3 GB per slot. Depending on the kinds of jobs, you may need more or less memory per slot. Have 2-3 times as many mappers as reducers (depending on the kinds of jobs you run). As Micheal pointed out the ratio of cores (hyperthreads) per disk matters. With those initial rules of thumb you'd arrive somewhere between 10 mappers + 5 reducers and 9 mappers + 4 reducers Try, test, measure, adjust, rinse, repeat. Cheers, Joep On Tue, Oct 2, 2012 at 8:42 PM, Alexander Pivovarov apivova...@gmail.comwrote: All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.com wrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Small question
moved common-user@hadoop.apache.org to bcc and added u...@pig.apache.org Best asked on the Pig users list. Cheers, Joep On Wed, Oct 3, 2012 at 7:04 AM, Abhishek abhishek.dod...@gmail.com wrote: Hi all, Below hive query in pig latin how to do that. select t2.col1, t3.col2 from table2 t2 join table3 t3 WHERE t3.col2 IS NOT NULL AND t2.col1 LIKE CONCAT(CONCAT('%',t3.col2),'%') Regards Abhi
Re: Does hadoop installations need to be at same locations in cluster ?
Agreed that different locations is not a good idea. However, the question was, can it be done? Yes, with some hacking I suppose. Do I recommend hacking? No. But, if you cannot help yourself, then having data nodes in a different locations per slave: create a hdfs-site.xml per node (enjoy). For the hadoop installation itself it is a bit more tricky. Look at bin/hadoop-deamons.sh. It finds the location where it is running from and assumes that the clients are in the same location. For further hackery and confusion, look at the HADOOP_SSH_OPTS environment variable set in hadoop-env.sh. Note that passing HADOOP_CONF_DIR requires support from the server. The ssh deamon may not accept client-side SendEnv to avoid LD_* types of environment variables as this opens a security hole. See settings in /etc/sshd_config on the slaves. Alternatively you can have a symlink on the client in the same location as the master pointing to your different location. Finally you may be able to start hadoop deamons by hand. Have the correct amount of fun! Joep On Fri, Dec 23, 2011 at 9:55 AM, Michael Segel michael_se...@hotmail.comwrote: Ok, Here's the thing... 1) When building the cluster, you want to be consistent. 2) Location of $HADOOP_HOME is configurable. So you can place it anywhere. Putting the software in two different locations isn't a good idea because you now have to set it up with a unique configuration per node. It would be faster and make your life a lot easier by putting the software in the same location on *all* machines. So my suggestion would be to bite the bullet and rebuild your cluster. HTH -Mike Date: Fri, 23 Dec 2011 19:47:45 +0530 Subject: Re: Does hadoop installations need to be at same locations in cluster ? From: praveen...@gmail.com To: common-user@hadoop.apache.org What I mean to say is, Does hadoop internally assumes that all installations on each nodes need to be in same location. I was having hadoop installed on different location on 2 different nodes. I configured hadoop config files to be a part of same cluster. But when I started hadoop on master, I saw it was also searching for hadoop starting scripts in the same location as of master. Do we have any workaround in these kind of situation or do I have to reinstall hadoop again on same location as master. Thanks, Praveenesh On Fri, Dec 23, 2011 at 6:26 PM, Michael Segel michael_se...@hotmail.com wrote: Sure, You could do that, but in doing so, you will make your life a living hell. Literally. Think about it... You will have to manually manage each nodes config files... So if something goes wrong you will have a hard time diagnosing the issue. Why make life harder? Why not just do the simple think and make all of your DN the same? Sent from my iPhone On Dec 23, 2011, at 6:51 AM, praveenesh kumar praveen...@gmail.com wrote: When installing hadoop on slave machines, do we have to install hadoop at same locations on each machine ? Can we have hadoop installation at different location on different machines at same cluster ? If yes, what things we have to take care in that case Thanks, Praveenesh
Re: Hadoop and hardware
Pierre, As discussed in recent other threads, it depends. The most sensible thing for Hadoop nodes is to find a sweet spot for price/performance. In general that will mean keeping a balance between compute power, disks, and network bandwidth, and factor in racks, space, operating costs etc. How much storage capacity are you thinking of when you target about 120 data nodes? If you had for example 60 quad core nodes with 12 * 2 TB disks (or more) I would suspect you would be bottle-necked on your 1GB network connections. Other things to consider is how many nodes per rack? If these 60 nodes would be 2u and you'd fit 20 nodes in a rack, then loosing one top of the rack switch means loosing 1/3 of the capacity of your cluster. Yet another consideration is how easily you want to be able to expand your cluster incrementally? Until you run Hadoop 0.23 you probably want all your nodes to be roughly similar in capacity. Cheers, Joep On Fri, Dec 16, 2011 at 3:50 AM, Cussol pierre.cus...@cnes.fr wrote: In my company, we intend to set up an hadoop cluster to run analylitics applications. This cluster would have about 120 data nodes with dual sockets servers with a GB interconnect. We are also exploring a solution with 60 quad sockets servers. How do compare the quad sockets and dual sockets servers in an hadoop cluster ? any help ? pierre -- View this message in context: http://old.nabble.com/Hadoop-and-hardware-tp32987374p32987374.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: mapreduce matrix multiplication on hadoop
The error is that you cannot open /tmp/MatrixMultiply/out/_logs Does the directory exist? Do you have proper access rights set? Joep On Wed, Nov 30, 2011 at 3:23 AM, ChWaqas waqas...@gmail.com wrote: Hi I am trying to run the matrix multiplication example mentioned(with source code) on the following link: http://www.norstad.org/matrix-multiply/index.html I have hadoop setup in pseudodistributed mode and I configured it using this tutorial: http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html?showComment=1321528406255#c3661776111033973764 When I run my jar file then I get the following error: Identity test 11/11/30 10:37:34 INFO input.FileInputFormat: Total input paths to process : 2 11/11/30 10:37:34 INFO mapred.JobClient: Running job: job_20291041_0010 11/11/30 10:37:35 INFO mapred.JobClient: map 0% reduce 0% 11/11/30 10:37:44 INFO mapred.JobClient: map 100% reduce 0% 11/11/30 10:37:56 INFO mapred.JobClient: map 100% reduce 100% 11/11/30 10:37:58 INFO mapred.JobClient: Job complete: job_20291041_0010 11/11/30 10:37:58 INFO mapred.JobClient: Counters: 17 11/11/30 10:37:58 INFO mapred.JobClient: Job Counters 11/11/30 10:37:58 INFO mapred.JobClient: Launched reduce tasks=1 11/11/30 10:37:58 INFO mapred.JobClient: Launched map tasks=2 11/11/30 10:37:58 INFO mapred.JobClient: Data-local map tasks=2 11/11/30 10:37:58 INFO mapred.JobClient: FileSystemCounters 11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_READ=114 11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_READ=248 11/11/30 10:37:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=298 11/11/30 10:37:58 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124 11/11/30 10:37:58 INFO mapred.JobClient: Map-Reduce Framework 11/11/30 10:37:58 INFO mapred.JobClient: Reduce input groups=2 11/11/30 10:37:58 INFO mapred.JobClient: Combine output records=0 11/11/30 10:37:58 INFO mapred.JobClient: Map input records=4 11/11/30 10:37:58 INFO mapred.JobClient: Reduce shuffle bytes=60 11/11/30 10:37:58 INFO mapred.JobClient: Reduce output records=2 11/11/30 10:37:58 INFO mapred.JobClient: Spilled Records=8 11/11/30 10:37:58 INFO mapred.JobClient: Map output bytes=100 11/11/30 10:37:58 INFO mapred.JobClient: Combine input records=0 11/11/30 10:37:58 INFO mapred.JobClient: Map output records=4 11/11/30 10:37:58 INFO mapred.JobClient: Reduce input records=4 11/11/30 10:37:58 INFO input.FileInputFormat: Total input paths to process : 1 11/11/30 10:37:59 INFO mapred.JobClient: Running job: job_20291041_0011 11/11/30 10:38:00 INFO mapred.JobClient: map 0% reduce 0% 11/11/30 10:38:09 INFO mapred.JobClient: map 100% reduce 0% 11/11/30 10:38:21 INFO mapred.JobClient: map 100% reduce 100% 11/11/30 10:38:23 INFO mapred.JobClient: Job complete: job_20291041_0011 11/11/30 10:38:23 INFO mapred.JobClient: Counters: 17 11/11/30 10:38:23 INFO mapred.JobClient: Job Counters 11/11/30 10:38:23 INFO mapred.JobClient: Launched reduce tasks=1 11/11/30 10:38:23 INFO mapred.JobClient: Launched map tasks=1 11/11/30 10:38:23 INFO mapred.JobClient: Data-local map tasks=1 11/11/30 10:38:23 INFO mapred.JobClient: FileSystemCounters 11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_READ=34 11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_READ=124 11/11/30 10:38:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=100 11/11/30 10:38:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=124 11/11/30 10:38:23 INFO mapred.JobClient: Map-Reduce Framework 11/11/30 10:38:23 INFO mapred.JobClient: Reduce input groups=2 11/11/30 10:38:23 INFO mapred.JobClient: Combine output records=2 11/11/30 10:38:23 INFO mapred.JobClient: Map input records=2 11/11/30 10:38:23 INFO mapred.JobClient: Reduce shuffle bytes=0 11/11/30 10:38:23 INFO mapred.JobClient: Reduce output records=2 11/11/30 10:38:23 INFO mapred.JobClient: Spilled Records=4 11/11/30 10:38:23 INFO mapred.JobClient: Map output bytes=24 11/11/30 10:38:23 INFO mapred.JobClient: Combine input records=2 11/11/30 10:38:23 INFO mapred.JobClient: Map output records=2 11/11/30 10:38:23 INFO mapred.JobClient: Reduce input records=2 Exception in thread main java.io.IOException: Cannot open filename /tmp/MatrixMultiply/out/_logs at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja va:1497) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.init(DFSClient.java :1488) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst em.java:178) at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1 437) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:142 4) at