from:"SF Hadoop"

Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp

2015-06-20 Thread SF Hadoop

Really depends on your requirements for the format of the data.

The easiest way I can think of is to stream batches of data into a pub
sub system that the target system can access and then consume.

Verify each batch and then ditch them.

You can throttle the size of the intermediary infrastructure based on your
batches.

Seems the most efficient approach.

On Thursday, June 18, 2015, Divya Gehlot divya.htco...@gmail.com wrote:

 Hi,
 I need to copy data from first hadoop cluster to second hadoop cluster.
 I cant access second hadoop cluster from first hadoop cluster due to some
 security issue.
 Can any point me how can I do apart from distcp command.
 For instance
 Cluster 1 secured zone - copy hdfs data  to - cluster 2 in non secured
 zone



 Thanks,
 Divya

Hadoop / HBase hotspotting / overloading specific nodes

2014-10-09 Thread SF Hadoop

I'm not sure if this is an HBase issue or an Hadoop issue so if this is
off-topic please forgive.

I am having a problem with Hadoop maxing out drive space on a select few
nodes when I am running an HBase job.  The scenario is this:

- The job is a data import using Map/Reduce / HBase
- The data is being imported to one table
- The table only has a couple of regions
- As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the
datanode / regionserver that is hosting  the regions
- As the job progresses (and more data is imported) the two datanodes
hosting the regions start to get full and eventually drive space hits 100%
utilization whilst the other nodes in the cluster are at 40% or less drive
space utilization
- The job in Hadoop then begins to hang with multiple out of space errors
and eventually fails.

I have tried running hadoop balancer during the job run and this helped but
only really succeeded in prolonging the eventual job failure.

How can I get Hadoop / HBase to distribute the data to HDFS more evenly
when it is favoring the nodes that the regions are on?

Am I missing something here?

Thanks for any help.

Re: Hadoop / HBase hotspotting / overloading specific nodes

2014-10-09 Thread SF Hadoop

This doesn't help because the space is simply reserved for the OS. Hadoop
still maxes out its quota and spits out out of space errors.

Thanks

On Wednesday, October 8, 2014, Bing Jiang jiangbinglo...@gmail.com wrote:

 Could you set a reserved room for non-dfs usage? Just to avoid the disk
 gets full.  hdfs-site.xml

 property

 namedfs.datanode.du.reserved/name

 value/value

 descriptionReserved space in bytes per volume. Always leave this much
 space free for non dfs use.

 /description

 /property

 2014-10-09 14:01 GMT+08:00 SF Hadoop sfhad...@gmail.com
 javascript:_e(%7B%7D,'cvml','sfhad...@gmail.com');:

 I'm not sure if this is an HBase issue or an Hadoop issue so if this is
 off-topic please forgive.

 I am having a problem with Hadoop maxing out drive space on a select few
 nodes when I am running an HBase job.  The scenario is this:

 - The job is a data import using Map/Reduce / HBase
 - The data is being imported to one table
 - The table only has a couple of regions
 - As the job runs, HBase? / Hadoop? begins placing the data in HDFS on
 the datanode / regionserver that is hosting  the regions
 - As the job progresses (and more data is imported) the two datanodes
 hosting the regions start to get full and eventually drive space hits 100%
 utilization whilst the other nodes in the cluster are at 40% or less drive
 space utilization
 - The job in Hadoop then begins to hang with multiple out of space
 errors and eventually fails.

 I have tried running hadoop balancer during the job run and this helped
 but only really succeeded in prolonging the eventual job failure.

 How can I get Hadoop / HBase to distribute the data to HDFS more evenly
 when it is favoring the nodes that the regions are on?

 Am I missing something here?

 Thanks for any help.




 --
 Bing Jiang

Re: Hadoop / HBase hotspotting / overloading specific nodes

2014-10-09 Thread SF Hadoop

Haven't tried this. I'll give it a shot.

Thanks

On Thursday, October 9, 2014, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the number of regions is lower than the number of nodes in the
 cluster.

 Can you split the table such that, after hbase balancer is run, there is
 region hosted by every node ?

 Cheers

 On Oct 8, 2014, at 11:01 PM, SF Hadoop sfhad...@gmail.com javascript:;
 wrote:

  I'm not sure if this is an HBase issue or an Hadoop issue so if this is
 off-topic please forgive.
 
  I am having a problem with Hadoop maxing out drive space on a select few
 nodes when I am running an HBase job.  The scenario is this:
 
  - The job is a data import using Map/Reduce / HBase
  - The data is being imported to one table
  - The table only has a couple of regions
  - As the job runs, HBase? / Hadoop? begins placing the data in HDFS on
 the datanode / regionserver that is hosting  the regions
  - As the job progresses (and more data is imported) the two datanodes
 hosting the regions start to get full and eventually drive space hits 100%
 utilization whilst the other nodes in the cluster are at 40% or less drive
 space utilization
  - The job in Hadoop then begins to hang with multiple out of space
 errors and eventually fails.
 
  I have tried running hadoop balancer during the job run and this helped
 but only really succeeded in prolonging the eventual job failure.
 
  How can I get Hadoop / HBase to distribute the data to HDFS more evenly
 when it is favoring the nodes that the regions are on?
 
  Am I missing something here?
 
  Thanks for any help.

Re: Hadoop configuration for cluster machines with different memory capacity / # of cores etc.

2014-10-09 Thread SF Hadoop

Yes.  You are correct.  Just keep in mind, for every spec X machine you
have to have version X of hadoop configs (that only reside on spec X
machines).  Version Y configs reside on only version Y machines, and so on.

But yes, it is possible.

On Thu, Oct 9, 2014 at 9:40 AM, Manoj Samel manojsamelt...@gmail.com
wrote:

 So, in that case, the resource manager will allocate containers of
 different capacity based on node capacity ?

 Thanks,

 On Wed, Oct 8, 2014 at 9:42 PM, Nitin Pawar nitinpawar...@gmail.com
 wrote:

 you can have different values on different nodes

 On Thu, Oct 9, 2014 at 4:15 AM, Manoj Samel manojsamelt...@gmail.com
 wrote:

 In a hadoop cluster where different machines have different memory
 capacity and / or different # of cores etc., it is required that
 memory/core related parameters be set to SAME for all nodes ? Or it is
 possible to set different values for different nodes ?

 E.g. can yarn.nodemanager.resource.memory-mb
 and yarn.nodemanager.resource.cpu-vcores have different values for
 different nodes ?

 Thanks,





 --
 Nitin Pawar

Re: Standby Namenode and Datanode coexistence

2014-10-09 Thread SF Hadoop

You can run any of the daemons on any machine you want, you just have to be
aware of the trade offs you are making with RAM allocation.

I am hoping this is a DEV cluster.  This is definitely not a configuration
you would want to use in production.  If you are asking in regards to a
production cluster, the NNs should live apart from the datanodes though it
is perfectly fine to run the journal node and zookeeper instances on the
NNs.  But again, you should NEVER have the NN and DN on the same machine
(unless you are in a DEV cluster and experimenting).


On Thu, Oct 9, 2014 at 4:19 AM, oc tsdb oc.t...@gmail.com wrote:

 Hi,

 We have cluster with 3 nodes (1 namenode + 2 datanodes).
 Cluster is running with hadoop 2.4.0 version.

 We would like to add High Availability(HA) to Namenode using the Quorum
 Journal Manager.

 As per the below link, we need two NN machines with same configuration.


 http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Hardware
 resources

 Our query is:

 As we have existing cluster with 3 nodes (1 namenode + 2 datanodes), can
 we configure standby namenode on one of the datanodes? Will there be any
 issues if we run standby namenode and datanode together?
 Or we should add one more machine and configure it as standby namenode?

 Regarding Journal node, Can we run it on any machine (datanode and
 namenode)?

 Thanks in advance.

 Thanks
 oc.tsdb

Re: MapReduce jobs start only on the PC they are typed on

2014-10-09 Thread SF Hadoop

What is in /etc/hadoop/conf/slaves?

Something tells me it just says 'localhost'.  You need to specify your
slaves in that file.

On Thu, Oct 9, 2014 at 2:24 PM, Piotr Kubaj pku...@riseup.net wrote:

 Hi. I'm trying to run Hadoop on a 2-PC cluster (I need to do some
 benchmarks for my bachelor thesis) and it works, but jobs start only on
 the PC I typed the command (doesn't matter whether it has better specs
 or not or where data is physically since I count Pi). My mapred-site.xml
 is:

 configuration
 property
   namemapred.job.tracker/name
   value10.0.0.1:54311/value
   descriptionThe host and port that the MapReduce job tracker runs
   at.  If local, then jobs are run in-process as a single map
   and reduce task.
   /description
 /property
 property
  namemapred.framework.name/name
  valueyarn/value
 /property
 property
   namemapred.map.tasks/name
   value20/value
 /property
 property
   namemapred.reduce.tasks/name
   value20/value
 /property
 property
   namemapreduce.tasktracker.map.tasks.maximum/name
   value20/value
 /property
 property
   namemapreduce.tasktracker.reduce.tasks.maximum/name
   value20/value
 /property
 property
   namemapreduce.tasktracker.map.tasks.maximum/name
   value30/value
   finaltrue/final
 /property
 property
   namemapreduce.tasktracker.reduce.tasks.maximum/name
   value30/value
 /property
 property
   namemapreduce.job.maps/name
   value3500/value
 /property
 property
   namemapreduce.job.reduces/name
   value3500/value
 /property
 property
   namemapred.child.java.opts/name
   value-Xmx2048m/value
 /property
 property
   namemapreduce.reduce.shuffle.parallelcopies/name
   value10/value
 /property
 property
   namemapreduce.jobhistory.address/name
   valueDESKTOP1:10020/value
 /property
 property
   namemapreduce.jobhistory.webapp.address/name
   valueDESKTOP1:19888/value
 /property
 /configuration

 And yarn-site.xml:
 configuration

 property
  nameyarn.nodemanager.local-dirs/name
  value/var/cache/hadoop-hdfs/hdfs/value
  descriptionComma separated list of paths. Use the list of directories
 from $YARN_LOCAL_DIR.
 For example,
 /grid/hadoop/hdfs/yarn,/grid1/hadoop/hdfs/yarn./description
 /property

 property
  nameyarn.nodemanager.log-dirs/name
  value/var/log/hadoop/yarn/value
  descriptionUse the list of directories from $YARN_LOG_DIR.
 For example, /var/log/hadoop/yarn./description
 /property

 property
 nameyarn.resourcemanager.hostname/name
 value10.0.0.1/value
   /property

   property
 nameyarn.resourcemanager.address/name
 value${yarn.resourcemanager.hostname}:8032/value
   /property

   property
 nameyarn.resourcemanager.scheduler.address/name
 value${yarn.resourcemanager.hostname}:8030/value
   /property

   property
 nameyarn.resourcemanager.resource-tracker.address/name
 value${yarn.resourcemanager.hostname}:8031/value
   /property

   property
 nameyarn.resourcemanager.admin.address/name
 value${yarn.resourcemanager.hostname}:8033/value
   /property

   property
 descriptionThe address of the RM web application./description
 nameyarn.resourcemanager.webapp.address/name
 value${yarn.resourcemanager.hostname}:8088/value
   /property

   property
 nameyarn.scheduler.maximum-allocation-mb/name
 value131072/value
   /property

   property
 nameyarn.nodemanager.resource.memory-mb/name
 value131072/value
   /property

   property
 descriptionNumber of CPU cores that can be allocated
 for containers./description
 nameyarn.nodemanager.resource.cpu-vcores/name
 value8/value
   /property

   property
 nameyarn.resourcemanager.am.max-attempts/name
 value3/value
   /property
   property
 nameyarn.log-aggregation-enable/name
 valuetrue/value
   /property
   property
 nameyarn.log-aggregation.retain-seconds/name
 value604800/value
   /property

 /configuration

Block placement without rack aware

2014-10-02 Thread SF Hadoop

What is the block placement policy hadoop follows when rack aware is not
enabled?

Does it just round robin?

Thanks.

Re: Block placement without rack aware

2014-10-02 Thread SF Hadoop

Thanks for the info.  Exactly what I needed.

Cheers.

On Thu, Oct 2, 2014 at 4:21 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 It appears to be randomly chosen. I just came across this blog post from
 Lars George about HBase file locality in HDFS
 http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

 On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote:

 What is the block placement policy hadoop follows when rack aware is not
 enabled?

 Does it just round robin?

 Thanks.

Re: Data node with multiple disks

2014-05-17 Thread SF Hadoop

just set you replication factor to 1 and you will be fine.


On Tue, May 13, 2014 at 8:12 AM, Marcos Sousa falecom...@marcossousa.comwrote:

 Yes,

 I don't want to replicate, just use as one disk? Isn't possible to make
 this work?

 Best regards,

 Marcos


 On Tue, May 13, 2014 at 6:55 AM, Rahul Chaudhari 
 rahulchaudhari0...@gmail.com wrote:

 Marcos,
 While configuring hadoop, the dfs.datanode.data.dir property in
 hdfs-default.xml should have this list of disks specified on separate line.
 If you specific comma separated list, it will replicate on all those
 disks/partitions.

 _Rahul
 Sent from my iPad

  On 13-May-2014, at 12:22 am, Marcos Sousa falecom...@marcossousa.com
 wrote:
 
  Hi,
 
  I have 20 servers with 10 HD with 400GB SATA. I'd like to use them to
 be my datanode:
 
  /vol1/hadoop/data
  /vol2/hadoop/data
  /vol3/hadoop/data
  /volN/hadoop/data
 
  How do user those distinct discs not to replicate?
 
  Best regards,
 
  --
  Marcos Sousa




 --
 Marcos Sousa
 www.marcossousa.com Enjoy it!

Re: Data node with multiple disks

2014-05-13 Thread SF Hadoop

Your question is unclear. Please restate and describe what you are
attempting to do.

Thanks.


On Monday, May 12, 2014, Marcos Sousa falecom...@marcossousa.com wrote:

 Hi,

 I have 20 servers with 10 HD with 400GB SATA. I'd like to use them to be
 my datanode:

 /vol1/hadoop/data
 /vol2/hadoop/data
 /vol3/hadoop/data
 /volN/hadoop/data

 How do user those distinct discs not to replicate?

 Best regards,

 --
 Marcos Sousa

Re: Not information in Job History UI

2014-03-04 Thread SF Hadoop

That explains a lot. Thanks for the information. I appreciate your help.

On Mon, Mar 3, 2014 at 7:47 PM, Jian He j...@hortonworks.com wrote:

You said, there are no job logs generated on the server that is
running the job..
that was quoting your previous sentence and answer your question..

If I were to run a job and I wanted to tail the job log as it was
running, where would I find that log?
1) set yarn.nodemanager.delete.debug-delay-sec to be a larger value, and
look for logs in local dirs specified by yarn.nodemanager.log-dirs.
Or
2) enable log aggregation yarn.log-aggregation-enable. Log aggregation is
to aggregate those NM local logs and upload them to HDFS once application
is finished.Then you can use yarn logs command or simply go the history UI
to see the logs.
You can find good explanation from
http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/

Thanks.

On Mon, Mar 3, 2014 at 4:29 PM, SF Hadoop sfhad...@gmail.com wrote:

Thanks for that info Jian.

You said, there are no job logs generated on the server that is running
the job.. So am I correct in assuming the logs will be in the dir
specified by yarn.nodemanager.log-dirs on the datanodes?

I am quite confused as to where the logs for each specific part of the
ecosystem reside.

If I were to run a job and I wanted to tail the job log as it was
running, where would I find that log?

Thanks for your help.

On Mon, Mar 3, 2014 at 11:46 AM, Jian He j...@hortonworks.com wrote:

Note that node manager will not keep the finished applications and
only show running apps, so the UI won't show the finished apps.
Conversely, job history server UI will only show the finished apps but
not the running apps.

bq. there are no job logs generated on the server that is running the
job.
by default, the local logs will be deleted after job finished. you can
config yarn.nodemanager.delete.debug-delay-sec, to delay the deletion
of the logs.

Jian

On Mon, Mar 3, 2014 at 10:45 AM, SF Hadoop sfhad...@gmail.com wrote:

Hadoop 2.2.0
CentOS 6.4
Viewing UI in various browsers.

I am having a problem where no information is visible in my Job History
UI. I run test jobs, they complete without error, but no information ever
populates the nodemanager or jobhistory server UI.

Also, there are no job logs generated on the server that is running the
job.

I have the following settings configured:
yarn.nodemanager.local-dirs
yarn.nodemanager.log-dirs
yarn.log.server.url

...plus the basic yarn log dir. I get output in regards to the daemons
but very little in regards to the job. All I get that refers to the
jobhistory server is the following (so it appears to be functioning
properly):

2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty
bound to port 19888
2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26
2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract
jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory
to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp
2014-02-18 11:43:07,085 INFO org.mortbay.log: Started
SelectChannelConnector@server:19888
2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web
app /jobhistory started at 19888
2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps:
Registered webapp guice modules

I have a feeling this is a misconfiguration but I cannot figure out
what setting is missing or wrong.

Other than not being able to see any of the jobs in the UIs, everything
appears to be working correctly so this is quite confusing.

Any help is appreciated.

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity
to which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Not information in Job History UI

2014-03-03 Thread SF Hadoop

Hadoop 2.2.0
CentOS 6.4
Viewing UI in various browsers.

I am having a problem where no information is visible in my Job History UI.
 I run test jobs, they complete without error, but no information ever
populates the nodemanager or jobhistory server UI.

Also, there are no job logs generated on the server that is running the job.

I have the following settings configured:
yarn.nodemanager.local-dirs
yarn.nodemanager.log-dirs
yarn.log.server.url

...plus the basic yarn log dir.  I get output in regards to the daemons but
very little in regards to the job.  All I get that refers to the jobhistory
server is the following (so it appears to be functioning properly):

2014-02-18 11:43:06,824 INFO org.apache.hadoop.http.HttpServer: Jetty bound
to port 19888
2014-02-18 11:43:06,824 INFO org.mortbay.log: jetty-6.1.26
2014-02-18 11:43:06,847 INFO org.mortbay.log: Extract
jar:file:/usr/lib/hadoop-yarn/hadoop-yarn-common-2.1.0.2.0.5.0-67.jar!/webapps/jobhistory
to /tmp/Jetty_server_19888_jobhistoryv7gnnv/webapp
2014-02-18 11:43:07,085 INFO org.mortbay.log: Started
SelectChannelConnector@server:19888
2014-02-18 11:43:07,085 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app
/jobhistory started at 19888
2014-02-18 11:43:07,477 INFO org.apache.hadoop.yarn.webapp.WebApps:
Registered webapp guice modules

I have a feeling this is a misconfiguration but I cannot figure out what
setting is missing or wrong.

Other than not being able to see any of the jobs in the UIs, everything
appears to be working correctly so this is quite confusing.

Any help is appreciated.

Re: Not information in Job History UI

2014-03-03 Thread SF Hadoop

Thanks for that info Jian.

You said, there are no job logs generated on the server that is running
the job.. So am I correct in assuming the logs will be in the dir
specified by yarn.nodemanager.log-dirs on the datanodes?

I am quite confused as to where the logs for each specific part of the
ecosystem reside.

If I were to run a job and I wanted to tail the job log as it was running,
where would I find that log?

Thanks for your help.

On Mon, Mar 3, 2014 at 11:46 AM, Jian He j...@hortonworks.com wrote:

Note that node manager will not keep the finished applications and only
show running apps, so the UI won't show the finished apps.
Conversely, job history server UI will only show the finished apps but
not the running apps.

bq. there are no job logs generated on the server that is running the job.
by default, the local logs will be deleted after job finished. you can
config yarn.nodemanager.delete.debug-delay-sec, to delay the deletion of
the logs.

Jian

On Mon, Mar 3, 2014 at 10:45 AM, SF Hadoop sfhad...@gmail.com wrote:

Hadoop 2.2.0
CentOS 6.4
Viewing UI in various browsers.

I am having a problem where no information is visible in my Job History
UI. I run test jobs, they complete without error, but no information ever
populates the nodemanager or jobhistory server UI.

Also, there are no job logs generated on the server that is running the
job.

I have the following settings configured:
yarn.nodemanager.local-dirs
yarn.nodemanager.log-dirs
yarn.log.server.url

I have a feeling this is a misconfiguration but I cannot figure out what
setting is missing or wrong.

Other than not being able to see any of the jobs in the UIs, everything
appears to be working correctly so this is quite confusing.

Any help is appreciated.

Java version with Hadoop 2.0

2013-10-09 Thread SF Hadoop

I am preparing to deploy multiple cluster / distros of Hadoop for testing /
benchmarking.

In my research I have noticed discrepancies in the version of the JDK that
various groups are using.  Example:  Hortonworks is suggesting JDK6u31, CDH
recommends either 6 or 7 providing you stick to some guidelines for each
and Apache Hadoop seems to be somewhat of a no mans land; a lot of people
using a lot of different versions.

Does anyone have any insight they could share about how to approach
choosing the best JDK release?  (I'm a total Java newb, so any info /
further reading you guys can provide is appreciated.)

Thanks.

sf

Re: Java version with Hadoop 2.0

2013-10-09 Thread SF Hadoop

I hadn't.  Thank you!!!  Very helpful.

Andy


On Wed, Oct 9, 2013 at 2:25 PM, Patai Sangbutsarakum 
patai.sangbutsara...@turn.com wrote:

  maybe you've already seen this.

  http://wiki.apache.org/hadoop/HadoopJavaVersions


  On Oct 9, 2013, at 2:16 PM, SF Hadoop sfhad...@gmail.com
  wrote:

  I am preparing to deploy multiple cluster / distros of Hadoop for
 testing / benchmarking.

  In my research I have noticed discrepancies in the version of the JDK
 that various groups are using.  Example:  Hortonworks is suggesting
 JDK6u31, CDH recommends either 6 or 7 providing you stick to some
 guidelines for each and Apache Hadoop seems to be somewhat of a no mans
 land; a lot of people using a lot of different versions.

  Does anyone have any insight they could share about how to approach
 choosing the best JDK release?  (I'm a total Java newb, so any info /
 further reading you guys can provide is appreciated.)

  Thanks.

  sf

Re: copy data from one hadoop cluster to another hadoop cluster + cant use distcp

Hadoop / HBase hotspotting / overloading specific nodes

Re: Hadoop / HBase hotspotting / overloading specific nodes

Re: Hadoop / HBase hotspotting / overloading specific nodes

Re: Hadoop configuration for cluster machines with different memory capacity / # of cores etc.

Re: Standby Namenode and Datanode coexistence

Re: MapReduce jobs start only on the PC they are typed on

Block placement without rack aware

Re: Block placement without rack aware

Re: Data node with multiple disks

Re: Data node with multiple disks

Re: Not information in Job History UI

Not information in Job History UI

Re: Not information in Job History UI

Java version with Hadoop 2.0

Re: Java version with Hadoop 2.0

16 matches

Site Navigation

Mail list logo

Footer information