RE: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Diwakar Dhanuskodi
Are you using  Yarn   to  run  spark jobs only  ?. Are you  configuring  spark   properties in  spark-submit parameters? . If  so  did  you  try  with  --no - of - executors x*53 (where  x is  no of  nodes ) --spark executor-memory 1g --spark-driver-memory 1g. You  might  see  yarn  allocating

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
I mean Jonathan On Tue, Feb 9, 2016 at 10:41 AM, Alexander Pivovarov wrote: > I decided to do YARN over-commit and add 896 > to yarn.nodemanager.resource.memory-mb > it was 54,272 > now I set it to 54,272+896 = 55,168 > > Kelly, can I ask you couple questions > 1. it is

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
I decided to do YARN over-commit and add 896 to yarn.nodemanager.resource.memory-mb it was 54,272 now I set it to 54,272+896 = 55,168 Kelly, can I ask you couple questions 1. it is possible to add yarn label to particular instance group boxes on EMR? 2. in addition to maximizeResourceAllocation

Re: Preserving partitioning with dataframe select

2016-02-09 Thread Michael Armbrust
RDD level partitioning information is not used to decide when to shuffle for queries planned using Catalyst (since we have better information about distribution from the query plan itself). Instead you should be looking at the logic in EnsureRequirements

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Thanks Jonathan Actually I'd like to use maximizeResourceAllocation. Ideally for me would be to add new instance group having single small box labelled as AM I'm not sure "aws emr create-cluster" supports setting custom LABELS , the only settings awailable are:

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Interesting, I was not aware of spark.yarn.am.nodeLabelExpression. We do use YARN labels on EMR; each node is automatically labeled with its type (MASTER, CORE, or TASK). And we do set yarn.app.mapreduce.am.labels=CORE in yarn-site.xml, but we do not set spark.yarn.am.nodeLabelExpression. Does

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Sean Owen
If it's too small to run an executor, I'd think it would be chosen for the AM as the only way to satisfy the request. On Tue, Feb 9, 2016 at 8:35 AM, Alexander Pivovarov wrote: > If I add additional small box to the cluster can I configure yarn to select > small box to run

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-09 Thread Steve Loughran
On 9 Feb 2016, at 05:55, Prabhu Joseph > wrote: + Spark-Dev On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph > wrote: Hi All, A long running Spark job on YARN throws

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
If I add additional small box to the cluster can I configure yarn to select small box to run am container? On Mon, Feb 8, 2016 at 10:53 PM, Sean Owen wrote: > Typically YARN is there because you're mediating resource requests > from things besides Spark, so yeah using every

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread praveen S
How about running in client mode, so that the client from which it is run becomes the driver. Regards, Praveen On 9 Feb 2016 16:59, "Steve Loughran" wrote: > > > On 9 Feb 2016, at 06:53, Sean Owen wrote: > > > > > > I think you can let YARN

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Steve Loughran
> On 9 Feb 2016, at 06:53, Sean Owen wrote: > > > I think you can let YARN over-commit RAM though, and allocate more > memory than it actually has. It may be beneficial to let them all > think they have an extra GB, and let one node running the AM > technically be

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Sean, I'm not sure if that's actually the case, since the AM would be allocated before the executors are even requested (by the driver through the AM), right? This must at least be the case with dynamicAllocation enabled, but I would expect that it's true regardless. However, Alex, yes, this

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-09 Thread Steve Loughran
On 9 Feb 2016, at 11:26, Steve Loughran > wrote: On 9 Feb 2016, at 05:55, Prabhu Joseph > wrote: + Spark-Dev On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Praveen, You mean cluster mode, right? That would still in a sense cause one box to be "wasted", but at least it would be used a bit more to its full potential, especially if you set spark.driver.memory to higher than its 1g default. Also, cluster mode is not an option for some applications, such

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
Do you mind pastebin'ning code snippet and exception one more time - I couldn't see them in your original email. Which Spark release are you using ? On Tue, Feb 9, 2016 at 11:55 AM, rakeshchalasani wrote: > Hi All: > > I am getting an "UnsupportedOperationException" when

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Marcelo Vanzin
On Tue, Feb 9, 2016 at 12:16 PM, Jonathan Kelly wrote: > And we do set yarn.app.mapreduce.am.labels=CORE That sounds very mapreduce-specific, so I doubt Spark (or anything non-MR) would honor it. -- Marcelo

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
How about changing the last line to: scala> val df2 = df.select(functions.array(df("a"), df("b")).alias("arrayCol")) df2: org.apache.spark.sql.DataFrame = [arrayCol: array] scala> df2.show() ++ |arrayCol| ++ | [0, 1]| | [1, 2]| | [2, 3]| | [3, 4]| | [4, 5]| | [5, 6]| | [6,

Re: Error aliasing an array column.

2016-02-09 Thread Rakesh Chalasani
Sorry, didn't realize the mail didn't show the code. Using Spark release 1.6.0 Below is an example to reproduce it. import org.apache.spark.sql.SQLContext val sqlContext = new SQLContext(sparkContext) import sqlContext.implicits._ import org.apache.spark.sql.functions case class Test(a:Int,

Re: Error aliasing an array column.

2016-02-09 Thread Rakesh Chalasani
Do you mean using "alias" instead of "as"? Unfortunately, that didn't help > val arrayCol = functions.array(df("a"), df("b")).alias("arrayCol") still throws the error. Surprisingly, doing the same thing inside a select works, > df.select(functions.array(df("a"), df("b")).as("arrayCol")).show()

Re: Error aliasing an array column.

2016-02-09 Thread Ted Yu
What's your plan of using the arrayCol ? It would be part of some query, right ? On Tue, Feb 9, 2016 at 2:27 PM, Rakesh Chalasani wrote: > Do you mean using "alias" instead of "as"? Unfortunately, that didn't help > > > val arrayCol = functions.array(df("a"),

Re: Error aliasing an array column.

2016-02-09 Thread Michael Armbrust
That looks like a bug in toString for columns. Can you open a JIRA? On Tue, Feb 9, 2016 at 1:38 PM, Rakesh Chalasani wrote: > Sorry, didn't realize the mail didn't show the code. Using Spark release > 1.6.0 > > Below is an example to reproduce it. > > import

Re: Error aliasing an array column.

2016-02-09 Thread Rakesh Chalasani
We are trying to dynamically create the query, with columns coming from different places. We can over come this with a few more lines of code, but it would be nice for us pass on the `alias` along (given that we can do so for all the rest of the frame operations.) Created JIRA here

Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-09 Thread Hari Shreedharan
The credentials file approach (using keytab for spark apps) will only update HDFS tokens. YARN's AMRM tokens should be taken care of by YARN internally. Steve - correct me if I am wrong here: If the AMRM tokens are disappearing it might be a YARN bug (does the AMRM token have a 7 day limit as

Error aliasing an array column.

2016-02-09 Thread rakeshchalasani
Hi All: I am getting an "UnsupportedOperationException" when trying to alias an array column. The issue seems to be at "CreateArray" expression -> dataType, which checks for nullability of its children, while aliasing is creating a PrettyAttribute that does not implement nullability. Below is

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
You can set custom per-instance-group configurations (e.g., ["classification":"yarn-site",properties:{"yarn.nodemanager.labels":"SPARKAM"}]) using the Configurations parameter of http://docs.aws.amazon.com/ElasticMapReduce/latest/API/API_InstanceGroupConfig.html. Unfortunately, it's not currently

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Jonathan Kelly
Oh, sheesh, how silly of me. I copied and pasted that setting name without even noticing the "mapreduce" in it. Yes, I guess that would mean that Spark AMs are probably running even on TASK instances currently, which is OK but not consistent with what we do for MapReduce. I'll make sure we set

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Can you add an ability to set custom yarn labels instead/in addition to? On Feb 9, 2016 3:28 PM, "Jonathan Kelly" wrote: > Oh, sheesh, how silly of me. I copied and pasted that setting name without > even noticing the "mapreduce" in it. Yes, I guess that would mean that >

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Great! Thank you! On Tue, Feb 9, 2016 at 4:02 PM, Jonathan Kelly wrote: > You can set custom per-instance-group configurations (e.g., > ["classification":"yarn-site",properties:{"yarn.nodemanager.labels":"SPARKAM"}]) > using the Configurations parameter of >

map-side-combine in Spark SQL

2016-02-09 Thread Rishitesh Mishra
Can anybody confirm, whether ANY operator in Spark SQL uses map-side-combine ? If not, is it safe to assume SortShuffleManager will always use Serialized sorting in case of queries from Spark SQL ?

Re: Kmeans++ using 1 core only Was: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
Forwarding to the dev list, hoping someone can chime in. @mengxr? From: Li Ming Tsai Sent: Wednesday, February 10, 2016 12:43 PM To: u...@spark.apache.org Subject: Re: Slowness in Kmeans calculating fastSquaredDistance Hi, It looks

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Alexander Pivovarov
Am container starts first and yarn selects random computer to run it. Is it possible to configure yarn so that it selects small computer for am container. On Feb 9, 2016 12:40 AM, "Sean Owen" wrote: > If it's too small to run an executor, I'd think it would be chosen for >

Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-09 Thread Marcelo Vanzin
You should be able to use spark.yarn.am.nodeLabelExpression if your version of YARN supports node labels (and you've added a label to the node where you want the AM to run). On Tue, Feb 9, 2016 at 9:51 AM, Alexander Pivovarov wrote: > Am container starts first and yarn