Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp
Really depends on your requirements for the format of the data. The easiest way I can think of is to stream batches of data into a pub sub system that the target system can access and then consume. Verify each batch and then ditch them. You can throttle the size of the intermediary infrastructure based on your batches. Seems the most efficient approach. On Thursday, June 18, 2015, Divya Gehlot divya.htco...@gmail.com wrote: Hi, I need to copy data from first hadoop cluster to second hadoop cluster. I cant access second hadoop cluster from first hadoop cluster due to some security issue. Can any point me how can I do apart from distcp command. For instance Cluster 1 secured zone - copy hdfs data to - cluster 2 in non secured zone Thanks, Divya
Hadoop / HBase hotspotting / overloading specific nodes
I'm not sure if this is an HBase issue or an Hadoop issue so if this is off-topic please forgive. I am having a problem with Hadoop maxing out drive space on a select few nodes when I am running an HBase job. The scenario is this: - The job is a data import using Map/Reduce / HBase - The data is being imported to one table - The table only has a couple of regions - As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the datanode / regionserver that is hosting the regions - As the job progresses (and more data is imported) the two datanodes hosting the regions start to get full and eventually drive space hits 100% utilization whilst the other nodes in the cluster are at 40% or less drive space utilization - The job in Hadoop then begins to hang with multiple out of space errors and eventually fails. I have tried running hadoop balancer during the job run and this helped but only really succeeded in prolonging the eventual job failure. How can I get Hadoop / HBase to distribute the data to HDFS more evenly when it is favoring the nodes that the regions are on? Am I missing something here? Thanks for any help.
Re: Hadoop / HBase hotspotting / overloading specific nodes
This doesn't help because the space is simply reserved for the OS. Hadoop still maxes out its quota and spits out out of space errors. Thanks On Wednesday, October 8, 2014, Bing Jiang jiangbinglo...@gmail.com wrote: Could you set a reserved room for non-dfs usage? Just to avoid the disk gets full. hdfs-site.xml property namedfs.datanode.du.reserved/name value/value descriptionReserved space in bytes per volume. Always leave this much space free for non dfs use. /description /property 2014-10-09 14:01 GMT+08:00 SF Hadoop sfhad...@gmail.com javascript:_e(%7B%7D,'cvml','sfhad...@gmail.com');: I'm not sure if this is an HBase issue or an Hadoop issue so if this is off-topic please forgive. I am having a problem with Hadoop maxing out drive space on a select few nodes when I am running an HBase job. The scenario is this: - The job is a data import using Map/Reduce / HBase - The data is being imported to one table - The table only has a couple of regions - As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the datanode / regionserver that is hosting the regions - As the job progresses (and more data is imported) the two datanodes hosting the regions start to get full and eventually drive space hits 100% utilization whilst the other nodes in the cluster are at 40% or less drive space utilization - The job in Hadoop then begins to hang with multiple out of space errors and eventually fails. I have tried running hadoop balancer during the job run and this helped but only really succeeded in prolonging the eventual job failure. How can I get Hadoop / HBase to distribute the data to HDFS more evenly when it is favoring the nodes that the regions are on? Am I missing something here? Thanks for any help. -- Bing Jiang
Re: Hadoop / HBase hotspotting / overloading specific nodes
Haven't tried this. I'll give it a shot. Thanks On Thursday, October 9, 2014, Ted Yu yuzhih...@gmail.com wrote: Looks like the number of regions is lower than the number of nodes in the cluster. Can you split the table such that, after hbase balancer is run, there is region hosted by every node ? Cheers On Oct 8, 2014, at 11:01 PM, SF Hadoop sfhad...@gmail.com javascript:; wrote: I'm not sure if this is an HBase issue or an Hadoop issue so if this is off-topic please forgive. I am having a problem with Hadoop maxing out drive space on a select few nodes when I am running an HBase job. The scenario is this: - The job is a data import using Map/Reduce / HBase - The data is being imported to one table - The table only has a couple of regions - As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the datanode / regionserver that is hosting the regions - As the job progresses (and more data is imported) the two datanodes hosting the regions start to get full and eventually drive space hits 100% utilization whilst the other nodes in the cluster are at 40% or less drive space utilization - The job in Hadoop then begins to hang with multiple out of space errors and eventually fails. I have tried running hadoop balancer during the job run and this helped but only really succeeded in prolonging the eventual job failure. How can I get Hadoop / HBase to distribute the data to HDFS more evenly when it is favoring the nodes that the regions are on? Am I missing something here? Thanks for any help.
Re: Hadoop configuration for cluster machines with different memory capacity / # of cores etc.
Yes. You are correct. Just keep in mind, for every spec X machine you have to have version X of hadoop configs (that only reside on spec X machines). Version Y configs reside on only version Y machines, and so on. But yes, it is possible. On Thu, Oct 9, 2014 at 9:40 AM, Manoj Samel manojsamelt...@gmail.com wrote: So, in that case, the resource manager will allocate containers of different capacity based on node capacity ? Thanks, On Wed, Oct 8, 2014 at 9:42 PM, Nitin Pawar nitinpawar...@gmail.com wrote: you can have different values on different nodes On Thu, Oct 9, 2014 at 4:15 AM, Manoj Samel manojsamelt...@gmail.com wrote: In a hadoop cluster where different machines have different memory capacity and / or different # of cores etc., it is required that memory/core related parameters be set to SAME for all nodes ? Or it is possible to set different values for different nodes ? E.g. can yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores have different values for different nodes ? Thanks, -- Nitin Pawar
Re: Standby Namenode and Datanode coexistence
You can run any of the daemons on any machine you want, you just have to be aware of the trade offs you are making with RAM allocation. I am hoping this is a DEV cluster. This is definitely not a configuration you would want to use in production. If you are asking in regards to a production cluster, the NNs should live apart from the datanodes though it is perfectly fine to run the journal node and zookeeper instances on the NNs. But again, you should NEVER have the NN and DN on the same machine (unless you are in a DEV cluster and experimenting). On Thu, Oct 9, 2014 at 4:19 AM, oc tsdb oc.t...@gmail.com wrote: Hi, We have cluster with 3 nodes (1 namenode + 2 datanodes). Cluster is running with hadoop 2.4.0 version. We would like to add High Availability(HA) to Namenode using the Quorum Journal Manager. As per the below link, we need two NN machines with same configuration. http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Hardware resources Our query is: As we have existing cluster with 3 nodes (1 namenode + 2 datanodes), can we configure standby namenode on one of the datanodes? Will there be any issues if we run standby namenode and datanode together? Or we should add one more machine and configure it as standby namenode? Regarding Journal node, Can we run it on any machine (datanode and namenode)? Thanks in advance. Thanks oc.tsdb
Re: MapReduce jobs start only on the PC they are typed on
What is in /etc/hadoop/conf/slaves? Something tells me it just says 'localhost'. You need to specify your slaves in that file. On Thu, Oct 9, 2014 at 2:24 PM, Piotr Kubaj pku...@riseup.net wrote: Hi. I'm trying to run Hadoop on a 2-PC cluster (I need to do some benchmarks for my bachelor thesis) and it works, but jobs start only on the PC I typed the command (doesn't matter whether it has better specs or not or where data is physically since I count Pi). My mapred-site.xml is: configuration property namemapred.job.tracker/name value10.0.0.1:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.framework.name/name valueyarn/value /property property namemapred.map.tasks/name value20/value /property property namemapred.reduce.tasks/name value20/value /property property namemapreduce.tasktracker.map.tasks.maximum/name value20/value /property property namemapreduce.tasktracker.reduce.tasks.maximum/name value20/value /property property namemapreduce.tasktracker.map.tasks.maximum/name value30/value finaltrue/final /property property namemapreduce.tasktracker.reduce.tasks.maximum/name value30/value /property property namemapreduce.job.maps/name value3500/value /property property namemapreduce.job.reduces/name value3500/value /property property namemapred.child.java.opts/name value-Xmx2048m/value /property property namemapreduce.reduce.shuffle.parallelcopies/name value10/value /property property namemapreduce.jobhistory.address/name valueDESKTOP1:10020/value /property property namemapreduce.jobhistory.webapp.address/name valueDESKTOP1:19888/value /property /configuration And yarn-site.xml: configuration property nameyarn.nodemanager.local-dirs/name value/var/cache/hadoop-hdfs/hdfs/value descriptionComma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR. For example, /grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn./description /property property nameyarn.nodemanager.log-dirs/name value/var/log/hadoop/yarn/value descriptionUse the list of directories from $YARN_LOG_DIR. For example, /var/log/hadoop/yarn./description /property property nameyarn.resourcemanager.hostname/name value10.0.0.1/value /property property nameyarn.resourcemanager.address/name value${yarn.resourcemanager.hostname}:8032/value /property property nameyarn.resourcemanager.scheduler.address/name value${yarn.resourcemanager.hostname}:8030/value /property property nameyarn.resourcemanager.resource-tracker.address/name value${yarn.resourcemanager.hostname}:8031/value /property property nameyarn.resourcemanager.admin.address/name value${yarn.resourcemanager.hostname}:8033/value /property property descriptionThe address of the RM web application./description nameyarn.resourcemanager.webapp.address/name value${yarn.resourcemanager.hostname}:8088/value /property property nameyarn.scheduler.maximum-allocation-mb/name value131072/value /property property nameyarn.nodemanager.resource.memory-mb/name value131072/value /property property descriptionNumber of CPU cores that can be allocated for containers./description nameyarn.nodemanager.resource.cpu-vcores/name value8/value /property property nameyarn.resourcemanager.am.max-attempts/name value3/value /property property nameyarn.log-aggregation-enable/name valuetrue/value /property property nameyarn.log-aggregation.retain-seconds/name value604800/value /property /configuration
Block placement without rack aware
What is the block placement policy hadoop follows when rack aware is not enabled? Does it just round robin? Thanks.
Re: Block placement without rack aware
Thanks for the info. Exactly what I needed. Cheers. On Thu, Oct 2, 2014 at 4:21 PM, Pradeep Gollakota pradeep...@gmail.com wrote: It appears to be randomly chosen. I just came across this blog post from Lars George about HBase file locality in HDFS http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote: What is the block placement policy hadoop follows when rack aware is not enabled? Does it just round robin? Thanks.
Re: Data node with multiple disks
just set you replication factor to 1 and you will be fine. On Tue, May 13, 2014 at 8:12 AM, Marcos Sousa falecom...@marcossousa.comwrote: Yes, I don't want to replicate, just use as one disk? Isn't possible to make this work? Best regards, Marcos On Tue, May 13, 2014 at 6:55 AM, Rahul Chaudhari rahulchaudhari0...@gmail.com wrote: Marcos, While configuring hadoop, the dfs.datanode.data.dir property in hdfs-default.xml should have this list of disks specified on separate line. If you specific comma separated list, it will replicate on all those disks/partitions. _Rahul Sent from my iPad On 13-May-2014, at 12:22 am, Marcos Sousa falecom...@marcossousa.com wrote: Hi, I have 20 servers with 10 HD with 400GB SATA. I'd like to use them to be my datanode: /vol1/hadoop/data /vol2/hadoop/data /vol3/hadoop/data /volN/hadoop/data How do user those distinct discs not to replicate? Best regards, -- Marcos Sousa -- Marcos Sousa www.marcossousa.com Enjoy it!
Re: Data node with multiple disks
Your question is unclear. Please restate and describe what you are attempting to do. Thanks. On Monday, May 12, 2014, Marcos Sousa falecom...@marcossousa.com wrote: Hi, I have 20 servers with 10 HD with 400GB SATA. I'd like to use them to be my datanode: /vol1/hadoop/data /vol2/hadoop/data /vol3/hadoop/data /volN/hadoop/data How do user those distinct discs not to replicate? Best regards, -- Marcos Sousa
Re: Not information in Job History UI
That explains a lot. Thanks for the information. I appreciate your help. On Mon, Mar 3, 2014 at 7:47 PM, Jian He j...@hortonworks.com wrote: You said, there are no job logs generated on the server that is running the job.. that was quoting your previous sentence and answer your question.. If I were to run a job and I wanted to tail the job log as it was running, where would I find that log? 1) set yarn.nodemanager.delete.debug-delay-sec to be a larger value, and look for logs in local dirs specified by yarn.nodemanager.log-dirs. Or 2) enable log aggregation yarn.log-aggregation-enable. Log aggregation is to aggregate those NM local logs and upload them to HDFS once application is finished.Then you can use yarn logs command or simply go the history UI to see the logs. You can find good explanation from http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/ Thanks. On Mon, Mar 3, 2014 at 4:29 PM, SF Hadoop sfhad...@gmail.com wrote: Thanks for that info Jian. You said, there are no job logs generated on the server that is running the job.. So am I correct in assuming the logs will be in the dir specified by yarn.nodemanager.log-dirs on the datanodes? I am quite confused as to where the logs for each specific part of the ecosystem reside. If I were to run a job and I wanted to tail the job log as it was running, where would I find that log? Thanks for your help. On Mon, Mar 3, 2014 at 11:46 AM, Jian He j...@hortonworks.com wrote: Note that node manager will not keep the finished applications and only show running apps, so the UI won't show the finished apps. Conversely, job history server UI will only show the finished apps but not the running apps. bq. there are no job logs generated on the server that is running the job. by default, the local logs will be deleted after job finished. you can config yarn.nodemanager.delete.debug-delay-sec, to delay the deletion of the logs. Jian On Mon, Mar 3, 2014 at 10:45 AM, SF Hadoop sfhad...@gmail.com wrote: Hadoop 2.2.0 CentOS 6.4 Viewing UI in various browsers. I am having a problem where no information is visible in my Job History UI. I run test jobs, they complete without error, but no information ever populates the nodemanager or jobhistory server UI. Also, there are no job logs generated on the server that is running the job. I have the following settings configured: yarn.nodemanager.local-dirs yarn.nodemanager.log-dirs yarn.log.server.url ...plus the basic yarn log dir. I get output in regards to the daemons but very little in regards to the job. All I get that refers to the jobhistory server is the following (so it appears to be functioning properly): 2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 19888 2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26 2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp 2014-02-18 11:43:07,085 INFO org.mortbay.log: Started SelectChannelConnector@server:19888 2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 19888 2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules I have a feeling this is a misconfiguration but I cannot figure out what setting is missing or wrong. Other than not being able to see any of the jobs in the UIs, everything appears to be working correctly so this is quite confusing. Any help is appreciated. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Not information in Job History UI
Hadoop 2.2.0 CentOS 6.4 Viewing UI in various browsers. I am having a problem where no information is visible in my Job History UI. I run test jobs, they complete without error, but no information ever populates the nodemanager or jobhistory server UI. Also, there are no job logs generated on the server that is running the job. I have the following settings configured: yarn.nodemanager.local-dirs yarn.nodemanager.log-dirs yarn.log.server.url ...plus the basic yarn log dir. I get output in regards to the daemons but very little in regards to the job. All I get that refers to the jobhistory server is the following (so it appears to be functioning properly): 2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 19888 2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26 2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp 2014-02-18 11:43:07,085 INFO org.mortbay.log: Started SelectChannelConnector@server:19888 2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 19888 2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules I have a feeling this is a misconfiguration but I cannot figure out what setting is missing or wrong. Other than not being able to see any of the jobs in the UIs, everything appears to be working correctly so this is quite confusing. Any help is appreciated.
Re: Not information in Job History UI
Thanks for that info Jian. You said, there are no job logs generated on the server that is running the job.. So am I correct in assuming the logs will be in the dir specified by yarn.nodemanager.log-dirs on the datanodes? I am quite confused as to where the logs for each specific part of the ecosystem reside. If I were to run a job and I wanted to tail the job log as it was running, where would I find that log? Thanks for your help. On Mon, Mar 3, 2014 at 11:46 AM, Jian He j...@hortonworks.com wrote: Note that node manager will not keep the finished applications and only show running apps, so the UI won't show the finished apps. Conversely, job history server UI will only show the finished apps but not the running apps. bq. there are no job logs generated on the server that is running the job. by default, the local logs will be deleted after job finished. you can config yarn.nodemanager.delete.debug-delay-sec, to delay the deletion of the logs. Jian On Mon, Mar 3, 2014 at 10:45 AM, SF Hadoop sfhad...@gmail.com wrote: Hadoop 2.2.0 CentOS 6.4 Viewing UI in various browsers. I am having a problem where no information is visible in my Job History UI. I run test jobs, they complete without error, but no information ever populates the nodemanager or jobhistory server UI. Also, there are no job logs generated on the server that is running the job. I have the following settings configured: yarn.nodemanager.local-dirs yarn.nodemanager.log-dirs yarn.log.server.url ...plus the basic yarn log dir. I get output in regards to the daemons but very little in regards to the job. All I get that refers to the jobhistory server is the following (so it appears to be functioning properly): 2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 19888 2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26 2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp 2014-02-18 11:43:07,085 INFO org.mortbay.log: Started SelectChannelConnector@server:19888 2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /jobhistory started at 19888 2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules I have a feeling this is a misconfiguration but I cannot figure out what setting is missing or wrong. Other than not being able to see any of the jobs in the UIs, everything appears to be working correctly so this is quite confusing. Any help is appreciated. CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Java version with Hadoop 2.0
I am preparing to deploy multiple cluster / distros of Hadoop for testing / benchmarking. In my research I have noticed discrepancies in the version of the JDK that various groups are using. Example: Hortonworks is suggesting JDK6u31, CDH recommends either 6 or 7 providing you stick to some guidelines for each and Apache Hadoop seems to be somewhat of a no mans land; a lot of people using a lot of different versions. Does anyone have any insight they could share about how to approach choosing the best JDK release? (I'm a total Java newb, so any info / further reading you guys can provide is appreciated.) Thanks. sf
Re: Java version with Hadoop 2.0
I hadn't. Thank you!!! Very helpful. Andy On Wed, Oct 9, 2013 at 2:25 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: maybe you've already seen this. http://wiki.apache.org/hadoop/HadoopJavaVersions On Oct 9, 2013, at 2:16 PM, SF Hadoop sfhad...@gmail.com wrote: I am preparing to deploy multiple cluster / distros of Hadoop for testing / benchmarking. In my research I have noticed discrepancies in the version of the JDK that various groups are using. Example: Hortonworks is suggesting JDK6u31, CDH recommends either 6 or 7 providing you stick to some guidelines for each and Apache Hadoop seems to be somewhat of a no mans land; a lot of people using a lot of different versions. Does anyone have any insight they could share about how to approach choosing the best JDK release? (I'm a total Java newb, so any info / further reading you guys can provide is appreciated.) Thanks. sf