RE: [Classpath Issue]NoClassFoundException occurs when depending on the 3rd jar

2015-12-22 Thread Frank Luo
Make sure you call job.setJarByClass with right parameters. 
http://stackoverflow.com/questions/3912267/hadoop-query-regarding-setjarbyclass-method-of-job-class

Other than that, try to do 2 and 3 together just to test it out. There is no 
reason it doesn’t work.

From: Todd [mailto:bit1...@163.com]
Sent: Tuesday, December 22, 2015 12:01 AM
To: user@hadoop.apache.org
Subject: [Classpath Issue]NoClassFoundException occurs when depending on the 
3rd jar

Hi,
I have two jars, A and B. A contains the class that has the main method, B 
contains the mapper and reducer. A will collect B's mapper and reducer through 
reflection.
I am using the following commands to submit the job,but B's mapper class not 
found exception is thrown.
1.
$HADOOP_CLASSPATH=B.jar:other jars
hadoop jar A.jar

2. HADOOP_CLASSPATH=B.jar:other jars;hadoop jar A.jar
3. hadoop jar A.jar -libjars B.jar

All the above three ways don't work. Can someone help me on this? Thanks!!

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


Kerberos authentication using username/pswd

2016-01-12 Thread Frank Luo
Everywhere I searched, there are plenty samples of using keytab file for 
Kerberos authentication, but haven’t yet find anything samples to do so using 
username/pswd. I thought it was impossible until “Toad For Hadoop” started 
doing that.

So doesn’t anyone know how it is implemented? In particular, I’d like to run a 
java program in Windows, authenticate against Kerberos using username/pswd, and 
interact with Hadoop.

The how-to for Toad For Hadoop can be found here. BTW, it is a pretty neat tool 
for interacting with Hadoop world.

http://www.toadworld.com/products/toad-for-hadoop/b/weblog/archive/2015/08/10/connecting-to-your-kerberized-hadoop-environment-with-toad-for-hadoop

[”MerkleONE”]

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: how to use Yarn API to find task/attempt status

2016-03-09 Thread Frank Luo
Let’s say there are 10 standard M/R jobs running. How to find how many tasks 
are done/running/pending?

From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Wednesday, March 09, 2016 9:33 PM
To: Frank Luo
Cc: user@hadoop.apache.org
Subject: Re: how to use Yarn API to find task/attempt status

I don't think it is related with yarn. Yarn don't know about task/task attempt, 
it only knows containers. So it should be your application to provide such 
function.

On Thu, Mar 10, 2016 at 11:29 AM, Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Anyone had a similar issue and knows the answer?

From: Frank Luo
Sent: Wednesday, March 09, 2016 1:59 PM
To: 'user@hadoop.apache.org<mailto:user@hadoop.apache.org>'
Subject: how to use Yarn API to find task/attempt status

I have a need to programmatically find out how many tasks are pending in Yarn. 
Is there a way to do it through a Java API?

I looked at YarnClient, but not able to find what I need.

Thx in advance.

Frank Luo

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.



--
Best Regards

Jeff Zhang

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


how to use Yarn API to find task/attempt status

2016-03-09 Thread Frank Luo
I have a need to programmatically find out how many tasks are pending in Yarn. 
Is there a way to do it through a Java API?

I looked at YarnClient, but not able to find what I need.

Thx in advance.

Frank Luo

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


anyone seen this weird "setXIncludeAware is not supported" error?

2016-07-29 Thread Frank Luo
Ok, this drives me into nuts.

I got a junit test case as simple as below:

  @Before
public void setup() throws IOException {
Job job = Job.getInstance();
Configuration config = job.getConfiguration();

And I got Exception at Job.getInstance() as:

  java.lang.UnsupportedOperationException:  setXIncludeAware is not supported 
on this JAXP implementation or earlier: class 
org.apache.xerces.jaxp.DocumentBuilderFactoryImpl

  at 
javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:614)

at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2523)

at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)

at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)

at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)

at 
org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2069)

at org.apache.hadoop.mapred.JobConf.(JobConf.java:447)

at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:175)

at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:156)

at 
com.merkleinc.crkb.match.keymatching.KeyMatchingTest.setup(KeyMatchingTest.java:52)

What is strange is that the same code works in windows but not on Linux. Even 
on Linux, only one class is facing the problem. Other classes are fine running 
the exact same code.  On that problematic class, there are four test methods, 
one is successful and three are failing.

Anyone has similar experience?


I have Hadoop 2.7.1, hive 1.2.1 and hbase 1.1.2. and I am running with maven 
3.3.3 and/or 3.3.9.

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing

Download the latest installment of our annual Marketing Imperatives, “Winning 
with People-Based 
Marketing”

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: how to add a shareable node label?

2016-10-05 Thread Frank Luo
Sunil, thanks for responding.

So is there any way to dedicate one kind of jobs to certain machines, then 
having those machines be shared if no dedicated job running?

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Wednesday, October 05, 2016 12:50 AM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org; u...@yarn.apache.org
Subject: Re: how to add a shareable node label?

Hi Frank,

As far as I checked, all labels are "exclusive" in 2.7. In upcoming 2.8 
release, we can get "non-exclusive" or sharable node labels.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:40 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
I am using Hadoop 2.7.3, when I run:
$ yarn rmadmin -addToClusterNodeLabels "Label1(exclusive=false)"

I got an error as:

… addToClusterNodeLabels: java.io.IOException: label name should only contains 
{0-9, a-z, A-Z, -, _} and should not started with {-,_}

If I just use “Label1”, it will work fine, but I want a shareable one.

Anyone knows a better way to do it?

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: how to add a shareable node label?

2016-10-07 Thread Frank Luo
Sunil,

Your description pretty much matches my understanding. Except for “Job_A will 
have to run as per its schedule w/o any delay”. My situation is that Job_A can 
be delayed. As long as it runs in queueA, I am happy.

Just as you said, processes normally running in queueB might not be 
preemptable. So if they overflow to queueA then got preempted, then that is not 
good.

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Friday, October 07, 2016 10:50 AM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

HI Frank

Thanks for the details.

I am not quite sure if I understood you problem correctly. I think you are 
looking for a solution to ensure that Job_A will have to run as per its 
schedule w/o any delay. Meantime you also do not want to waste resources on 
those high end machine where Job_A is running.

I think you still need node label exclusivity here since there is h/w 
dependency. But if you have 2 queues' which are shared to use "labelA" here, 
then always "Job_A" can be planned to run in that queue, say "queueA". Other 
jobs could be run in "queueB" here. So if you tune capacities and if preemption 
is enabled per queue level, overutilized resources used by "queueB" could be 
preempted for "Job_A".

But if your sharable jobs are like some linux jobs which should not be 
preempted, then this may be only a half solution.

Thanks
Sunil

On Fri, Oct 7, 2016 at 7:36 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil,

You confirmed my understanding. I got the understanding by reading the docs and 
haven’t really tried 2.8 or 3.0-alphal.

My situation is that I am in a multi-tenant env, and  got several very powerful 
machines with expensive licenses to run a particular linux job, let’s say 
Job_A. But the job is executed infrequently, so I want to let other jobs to use 
the machines when Job_A is not running. In the meaning time, I am not powerful 
enough to force all other jobs to be preemptable. As matter of fact, I know 
they have Hadoop jobs inserting into sql-server, or just pure linux jobs that 
are not preemptable in nature. So preempt jobs is not an option for me.

I hope it makes sense.

Frank

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 2:15 PM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

Ideally those containers will be preempted if there are unsatisfied demand for 
"configured label".

I could explain this:
"labelA" has few empty resources.  All nodes under "default" label is used. 
Hence a new application which is submitted to "default" label has to wait. But 
if "labelA" is non-exclusive and there are some free resources, this new 
application can run on "labelA".
However if there are some more new apps submitted to "labelA", and if there are 
no more resources available in "labelA", then it may preempt containers from 
the app which was sharing containers earlier.

May be you could share some more information so tht it may become more clear. 
Also I suppose you are running this in hadoop 3 alpha1 release. please correct 
me if I m wrong.

Thanks
Sunil

On Thu, Oct 6, 2016 at 9:44 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Thanks Sunil.


>  3. If there is any future ask for those resources , we will preempt the non 
> labeled apps and give them back to labeled apps.

Unfortunately, I am still not able to use it, because of the preemptive 
behavior. The jobs that steals labelled resources are not preemptable, and I’d 
rather waiting instead of killing.

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 1:59 AM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

Hi Frank
I think as of today this is not possible. You could try and experience the 
"non-exlusive" feature of node-label which will officially come in 2.8 soon. Or 
you can try it in "Hadoop 3 alpha1" release too if its fine to check. 
YARN-3214<https://issues.apache.org/jira/browse/YARN-3214> has the details for 
the nodelabel sharing concept.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:14 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil, thanks for responding.

So is there any way to dedicate one kind of jobs to certain machines, then 
having those machines be shared if no dedicated job running?

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com&

RE: how to add a shareable node label?

2016-10-07 Thread Frank Luo
That is correct, Sunil.

Just to confirm,  the Node Labeling feature on 2.8 or 3.0 alpha won’t satisfy 
my need, right?

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Friday, October 07, 2016 12:09 PM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

HI Frank

In that case, preemption may not be needed. So over-utilizing resources of 
queueB will be running till it completes. Since queueA is under served, then 
any next free container could go to queueA which is for Job_A.

Thanks
Sunil

On Fri, Oct 7, 2016 at 9:58 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil,

Your description pretty much matches my understanding. Except for “Job_A will 
have to run as per its schedule w/o any delay”. My situation is that Job_A can 
be delayed. As long as it runs in queueA, I am happy.

Just as you said, processes normally running in queueB might not be 
preemptable. So if they overflow to queueA then got preempted, then that is not 
good.

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Friday, October 07, 2016 10:50 AM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

Thanks for the details.

I am not quite sure if I understood you problem correctly. I think you are 
looking for a solution to ensure that Job_A will have to run as per its 
schedule w/o any delay. Meantime you also do not want to waste resources on 
those high end machine where Job_A is running.

I think you still need node label exclusivity here since there is h/w 
dependency. But if you have 2 queues' which are shared to use "labelA" here, 
then always "Job_A" can be planned to run in that queue, say "queueA". Other 
jobs could be run in "queueB" here. So if you tune capacities and if preemption 
is enabled per queue level, overutilized resources used by "queueB" could be 
preempted for "Job_A".

But if your sharable jobs are like some linux jobs which should not be 
preempted, then this may be only a half solution.

Thanks
Sunil

On Fri, Oct 7, 2016 at 7:36 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil,

You confirmed my understanding. I got the understanding by reading the docs and 
haven’t really tried 2.8 or 3.0-alphal.

My situation is that I am in a multi-tenant env, and  got several very powerful 
machines with expensive licenses to run a particular linux job, let’s say 
Job_A. But the job is executed infrequently, so I want to let other jobs to use 
the machines when Job_A is not running. In the meaning time, I am not powerful 
enough to force all other jobs to be preemptable. As matter of fact, I know 
they have Hadoop jobs inserting into sql-server, or just pure linux jobs that 
are not preemptable in nature. So preempt jobs is not an option for me.

I hope it makes sense.

Frank

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 2:15 PM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

Ideally those containers will be preempted if there are unsatisfied demand for 
"configured label".

I could explain this:
"labelA" has few empty resources.  All nodes under "default" label is used. 
Hence a new application which is submitted to "default" label has to wait. But 
if "labelA" is non-exclusive and there are some free resources, this new 
application can run on "labelA".
However if there are some more new apps submitted to "labelA", and if there are 
no more resources available in "labelA", then it may preempt containers from 
the app which was sharing containers earlier.

May be you could share some more information so tht it may become more clear. 
Also I suppose you are running this in hadoop 3 alpha1 release. please correct 
me if I m wrong.

Thanks
Sunil

On Thu, Oct 6, 2016 at 9:44 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Thanks Sunil.


>  3. If there is any future ask for those resources , we will preempt the non 
> labeled apps and give them back to labeled apps.

Unfortunately, I am still not able to use it, because of the preemptive 
behavior. The jobs that steals labelled resources are not preemptable, and I’d 
rather waiting instead of killing.

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 1:59 AM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hado

RE: how to add a shareable node label?

2016-10-06 Thread Frank Luo
Sunil,

You confirmed my understanding. I got the understanding by reading the docs and 
haven’t really tried 2.8 or 3.0-alphal.

My situation is that I am in a multi-tenant env, and  got several very powerful 
machines with expensive licenses to run a particular linux job, let’s say 
Job_A. But the job is executed infrequently, so I want to let other jobs to use 
the machines when Job_A is not running. In the meaning time, I am not powerful 
enough to force all other jobs to be preemptable. As matter of fact, I know 
they have Hadoop jobs inserting into sql-server, or just pure linux jobs that 
are not preemptable in nature. So preempt jobs is not an option for me.

I hope it makes sense.

Frank

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Thursday, October 06, 2016 2:15 PM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

HI Frank

Ideally those containers will be preempted if there are unsatisfied demand for 
"configured label".

I could explain this:
"labelA" has few empty resources.  All nodes under "default" label is used. 
Hence a new application which is submitted to "default" label has to wait. But 
if "labelA" is non-exclusive and there are some free resources, this new 
application can run on "labelA".
However if there are some more new apps submitted to "labelA", and if there are 
no more resources available in "labelA", then it may preempt containers from 
the app which was sharing containers earlier.

May be you could share some more information so tht it may become more clear. 
Also I suppose you are running this in hadoop 3 alpha1 release. please correct 
me if I m wrong.

Thanks
Sunil

On Thu, Oct 6, 2016 at 9:44 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Thanks Sunil.


>  3. If there is any future ask for those resources , we will preempt the non 
> labeled apps and give them back to labeled apps.

Unfortunately, I am still not able to use it, because of the preemptive 
behavior. The jobs that steals labelled resources are not preemptable, and I’d 
rather waiting instead of killing.

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 1:59 AM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

Hi Frank
I think as of today this is not possible. You could try and experience the 
"non-exlusive" feature of node-label which will officially come in 2.8 soon. Or 
you can try it in "Hadoop 3 alpha1" release too if its fine to check. 
YARN-3214<https://issues.apache.org/jira/browse/YARN-3214> has the details for 
the nodelabel sharing concept.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:14 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil, thanks for responding.

So is there any way to dedicate one kind of jobs to certain machines, then 
having those machines be shared if no dedicated job running?

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Wednesday, October 05, 2016 12:50 AM
To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>; 
u...@yarn.apache.org<mailto:u...@yarn.apache.org>

Subject: Re: how to add a shareable node label?

Hi Frank,

As far as I checked, all labels are "exclusive" in 2.7. In upcoming 2.8 
release, we can get "non-exclusive" or sharable node labels.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:40 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
I am using Hadoop 2.7.3, when I run:
$ yarn rmadmin -addToClusterNodeLabels "Label1(exclusive=false)"

I got an error as:

… addToClusterNodeLabels: java.io.IOException: label name should only contains 
{0-9, a-z, A-Z, -, _} and should not started with {-,_}

If I just use “Label1”, it will work fine, but I want a shareable one.

Anyone knows a better way to do it?

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautio

how to add a shareable node label?

2016-10-04 Thread Frank Luo
I am using Hadoop 2.7.3, when I run:
$ yarn rmadmin -addToClusterNodeLabels "Label1(exclusive=false)"

I got an error as:

… addToClusterNodeLabels: java.io.IOException: label name should only contains 
{0-9, a-z, A-Z, -, _} and should not started with {-,_}

If I just use “Label1”, it will work fine, but I want a shareable one.

Anyone knows a better way to do it?

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 


This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.


RE: how to add a shareable node label?

2016-10-06 Thread Frank Luo
Thanks Sunil.


Ø  3. If there is any future ask for those resources , we will preempt the non 
labeled apps and give them back to labeled apps.

Unfortunately, I am still not able to use it, because of the preemptive 
behavior. The jobs that steals labelled resources are not preemptable, and I’d 
rather waiting instead of killing.

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Thursday, October 06, 2016 1:59 AM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

Hi Frank
I think as of today this is not possible. You could try and experience the 
"non-exlusive" feature of node-label which will officially come in 2.8 soon. Or 
you can try it in "Hadoop 3 alpha1" release too if its fine to check. 
YARN-3214<https://issues.apache.org/jira/browse/YARN-3214> has the details for 
the nodelabel sharing concept.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:14 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil, thanks for responding.

So is there any way to dedicate one kind of jobs to certain machines, then 
having those machines be shared if no dedicated job running?

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Wednesday, October 05, 2016 12:50 AM
To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>; 
u...@yarn.apache.org<mailto:u...@yarn.apache.org>

Subject: Re: how to add a shareable node label?

Hi Frank,

As far as I checked, all labels are "exclusive" in 2.7. In upcoming 2.8 
release, we can get "non-exclusive" or sharable node labels.

Thanks
Sunil

On Wed, Oct 5, 2016 at 8:40 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
I am using Hadoop 2.7.3, when I run:
$ yarn rmadmin -addToClusterNodeLabels "Label1(exclusive=false)"

I got an error as:

… addToClusterNodeLabels: java.io.IOException: label name should only contains 
{0-9, a-z, A-Z, -, _} and should not started with {-,_}

If I just use “Label1”, it will work fine, but I want a shareable one.

Anyone knows a better way to do it?

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.

Access the Q2 2016 Digital Marketing Report for a fresh set of trends and 
benchmarks in digital 
marketing<http://www2.merkleinc.com/l/47252/2016-07-26/47gt7c>

Download our latest report titled “The Case for Change: Exploring the Myths of 
Customer-Centric Transformation” 
<http://www2.merkleinc.com/l/47252/2016-08-04/4b9p7c>

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, di

RE: how to add a shareable node label?

2016-10-12 Thread Frank Luo
Thanks, Sunil. It makes a lot of sense. I will try it out.

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Wednesday, October 12, 2016 9:21 AM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

Hi Frank

Thanks for sharing more details. Let me try this combination and I might be 
wrong, so pls correct me. And I think sharable node-label could help here.



Labels:

Node1-4 = default label

Node8-9 = "special" label



Queues:

"ProdQ" accessable-labels  is ""  (only default label)

"TestQ" accessable-labels  is ""  (only default label)



"LabeledQ" accessable-labels "special"



Capacity Per Queue:

"ProdQ"

capacity=50%

max-capacity=100%



"TestQ"

capacity=50%

max-capacity=50%



"LabeledQ"

special.capacity=100%

special.max-capacity=100%



Various Choices:

* Jobs in ProdQ is assured with 50% of default label resources and it can go to 
100% if there are no resource running in TestQ

* Jobs in TestQ can only get 50% of default label resources.

* If jobs in ProdQ or TestQ needed to make use of "special" label machines, it 
is only possible when there are resource available in "special" label. 
"special" is a non-exclusive label which can share its resource with "default" 
label.

* Any job submitted in "LabeledQ"  is assured with 100% of special resources 
and can use 100% if nothing is there. I think preemption could be made optional 
here.

If Inter queue preemption is enabled, we can enforce a normalization faster for 
default label, else apps might need to wait.  We could also try another 
approach as I shared in an earlier mail. But it ensure some % of resources for 
ProdQ and TestQ in LabeledQ which may not be suitable.



Thanks

Sunil

On Tue, Oct 11, 2016 at 10:38 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:







Hah, how so? I am confused as I was under impression that I needed sharing but 
not preemption.



Let’s model this out.



Assuming I got 4 “normal” machines node1-4, and two special node8 and node9 
where JobA can be executed on.



And I need two queues, ProdQ and TestQ equally sharing Node1-4, and a 
“LabeledQ” with node8/9.



When ProdQ is full, it can overflow to TestQ and further to LabeledQ. If TestQ 
is full, the tasks stay in TestQ, or optionally overflow to LabeledQ (either way
is fine as long as it doesn’t go to ProdQ). And when JobA is running, it can 
only go to LabelledQ. If something else is on LabelledQ, JobA waits.



Do you mind to illustrate how to config the queues to achieve what I am looking 
for?



Thank you Sunil.



From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]


Sent: Tuesday, October 11, 2016 11:44 AM


To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>

Subject: Re: how to add a shareable node label?




Hi Frank






Extremely sorry for the delay..







Yes, you are correct. Sharing feature of node label is not needed in your case.



Existing node labels and a queue model could solve the problem.







Thanks



Sunil







On Fri, Oct 7, 2016 at 11:59 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:





That is correct, Sunil.



Just to confirm,  the Node Labeling feature on 2.8 or 3.0 alpha won’t satisfy
my need, right?



From:
Sunil Govind [mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]


Sent: Friday, October 07, 2016 12:09 PM







To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>;
user@hadoop.apache.org<mailto:user@hadoop.apache.org>

Subject: Re: how to add a shareable node label?








HI Frank






In that case, preemption may not be needed. So over-utilizing resources of 
queueB will be running till it completes. Since queueA is under served, then 
any next free container could
go to queueA which is for Job_A.







Thanks



Sunil







On Fri, Oct 7, 2016 at 9:58 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:





Sunil,



Your description pretty much matches my understanding. Except for “Job_A
will have to run as per its schedule w/o any delay”. My situation is that Job_A 
can be delayed. As long as it runs in queueA, I am happy.



Just as you said, processes normally running in queueB might not be preemptable.
So if they overflow to queueA then got preempted, then that is not good.



From:
Sunil Govind [mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]


Sent: Friday, October 07, 2016 10:50 AM







To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>;
user@hadoop.apache.org<mailto:user@hadoop.apache.org>





Subject:
Re: how to add a shareable node label?









HI 

RE: how to add a shareable node label?

2016-10-11 Thread Frank Luo
Hah, how so? I am confused as I was under impression that I needed sharing but 
not preemption.

Let’s model this out.

Assuming I got 4 “normal” machines node1-4, and two special node8 and node9 
where JobA can be executed on.

And I need two queues, ProdQ and TestQ equally sharing Node1-4, and a 
“LabeledQ” with node8/9.

When ProdQ is full, it can overflow to TestQ and further to LabeledQ. If TestQ 
is full, the tasks stay in TestQ, or optionally overflow to LabeledQ (either 
way is fine as long as it doesn’t go to ProdQ). And when JobA is running, it 
can only go to LabelledQ. If something else is on LabelledQ, JobA waits.

Do you mind to illustrate how to config the queues to achieve what I am looking 
for?

Thank you Sunil.

From: Sunil Govind [mailto:sunil.gov...@gmail.com]
Sent: Tuesday, October 11, 2016 11:44 AM
To: Frank Luo <j...@merkleinc.com>; user@hadoop.apache.org
Subject: Re: how to add a shareable node label?

Hi Frank

Extremely sorry for the delay..

Yes, you are correct. Sharing feature of node label is not needed in your case.
Existing node labels and a queue model could solve the problem.

Thanks
Sunil

On Fri, Oct 7, 2016 at 11:59 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
That is correct, Sunil.

Just to confirm,  the Node Labeling feature on 2.8 or 3.0 alpha won’t satisfy 
my need, right?

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Friday, October 07, 2016 12:09 PM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

In that case, preemption may not be needed. So over-utilizing resources of 
queueB will be running till it completes. Since queueA is under served, then 
any next free container could go to queueA which is for Job_A.

Thanks
Sunil

On Fri, Oct 7, 2016 at 9:58 PM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil,

Your description pretty much matches my understanding. Except for “Job_A will 
have to run as per its schedule w/o any delay”. My situation is that Job_A can 
be delayed. As long as it runs in queueA, I am happy.

Just as you said, processes normally running in queueB might not be 
preemptable. So if they overflow to queueA then got preempted, then that is not 
good.

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Friday, October 07, 2016 10:50 AM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

Thanks for the details.

I am not quite sure if I understood you problem correctly. I think you are 
looking for a solution to ensure that Job_A will have to run as per its 
schedule w/o any delay. Meantime you also do not want to waste resources on 
those high end machine where Job_A is running.

I think you still need node label exclusivity here since there is h/w 
dependency. But if you have 2 queues' which are shared to use "labelA" here, 
then always "Job_A" can be planned to run in that queue, say "queueA". Other 
jobs could be run in "queueB" here. So if you tune capacities and if preemption 
is enabled per queue level, overutilized resources used by "queueB" could be 
preempted for "Job_A".

But if your sharable jobs are like some linux jobs which should not be 
preempted, then this may be only a half solution.

Thanks
Sunil

On Fri, Oct 7, 2016 at 7:36 AM Frank Luo 
<j...@merkleinc.com<mailto:j...@merkleinc.com>> wrote:
Sunil,

You confirmed my understanding. I got the understanding by reading the docs and 
haven’t really tried 2.8 or 3.0-alphal.

My situation is that I am in a multi-tenant env, and  got several very powerful 
machines with expensive licenses to run a particular linux job, let’s say 
Job_A. But the job is executed infrequently, so I want to let other jobs to use 
the machines when Job_A is not running. In the meaning time, I am not powerful 
enough to force all other jobs to be preemptable. As matter of fact, I know 
they have Hadoop jobs inserting into sql-server, or just pure linux jobs that 
are not preemptable in nature. So preempt jobs is not an option for me.

I hope it makes sense.

Frank

From: Sunil Govind 
[mailto:sunil.gov...@gmail.com<mailto:sunil.gov...@gmail.com>]
Sent: Thursday, October 06, 2016 2:15 PM

To: Frank Luo <j...@merkleinc.com<mailto:j...@merkleinc.com>>; 
user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: how to add a shareable node label?

HI Frank

Ideally those containers will be preempted if there are unsatisfied demand for 
"configured label".

I could explain this:
"labelA" has few empty resources.  All nodes under "

RE: Help me understand hadoop caching behavior

2017-12-27 Thread Frank Luo
First, Hadoop itself doesn’t have any caching.

Secondly, if it is a mapper only job, then the data doesn’t go through the 
network.

So look at somewhere else 

From: Avery, John [mailto:jav...@akamai.com]
Sent: Wednesday, December 27, 2017 3:20 PM
To: user@hadoop.apache.org
Subject: Help me understand hadoop caching behavior

I’m writing a program using the C API for Hadoop. I have a 4-node cluster. 
(Cluster was setup according to 
https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) Of the 4 
nodes, one is the namenode and a datanode, the others are datanodes (with one 
being a secondary namenode).

I’ve already managed to write about 1.5TB of data to the cluster. My issue is 
reading data back, specifically, it’s too fast. *Way* too fast, and I don’t 
understand how or why. The 1.5 TB is stored in the form of about 20,000 60-80MB 
files. When I read back the files (7 files in parallel) I get read speeds in 
excess of 75GB/s. Obviously this is DRAM speed, here’s the problem…each of the 
4 nodes only has 32GB of RAM, and I’m asking Hadoop to re-read over 400GB of 
data. I am using the read back data, so it isn’t the compiler optimizing 
something out, because when I turn off optimization flags, it still runs 10x 
faster than the network/disks to this box can run.

Specifically: 2x10Gb network ports, bonded. Maximum network input 2.5GB/s. 
(test verified)
16x 4TB hard drives: 2GB/s maximum throughput (test verified; outside of 
Hadoop).

As for how I’m reading my data, hdfsOpenFile(…,O_RDONLY) and hdfsRead().

So, at best, I should get 4.5GB/s, and that’s in a perfect work world. But 
during my tests I see no network traffic, and very little (~30-70MB/s) disk IO. 
Yet it manages to return to me 300GB of unique data (the data is real, not a 
pattern, not something particularly compressible or dedupable).

I’m at a complete loss for how 300GB of data is getting sent to me so quickly?! 
I feel like I’m overlooking something trivial…I’m specifically asking for 10X 
the system’s memory (and over 2x the cluster’s memory!) in order to *prevent* 
caching from polluting my numbers. Yet it’s doing something that should be 
impossible. I’m at a complete loss. I fully expect to facepalm at the end of 
this.

Oh, and here’s the really weird part (to me). If I request all 20,000 files, it 
zooms past the 5000 I have cached from my 400MB read test and then slows down 
to a more realistic 2GB/s for the rest of the files. Until I re-run the program 
a second time…then it returns a result in something like 35 seconds instead of 
5 minutes. !!!

Named Search Agency of the Year by 
MediaPost

This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author’s prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.