[jira] [Updated] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variables

2018-06-03 Thread Alon Shoham (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alon Shoham updated SPARK-24456:

Description: 
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:
{code:java}
val command = new Command(mainClass,
 Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ 
driverArgs.driverOptions,
 sys.env, classPathEntries, libraryPathEntries, javaOpts){code}
2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:
{code:java}
childEnv.putAll(command.environment.asJava)
childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome){code}
Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 

  was:
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ \{{ {{ Seq("{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", driverArgs.mainClass) 
++ driverArgs.driverOptions,}}{{
{{ {{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)}}

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
 {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 


> Spark submit - server environment variables are overwritten by client 
> environment variables 
> 
>
> Key: SPARK-24456
> URL: https://issues.apache.org/jira/browse/SPARK-24456
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Alon Shoham
>Priority: Minor
>
> When submitting a spark application in --deploy-mode cluster + spark 
> standalone cluster, environment variables from the client machine overwrite 
> server environment variables. 
>  
> We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
> dependencies to the application. We observed that client machine 
> SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
> application submission failure. 
>  
> We have inspected the code and found:
> 1. In org.apache.spark.deploy.Client line 86:
> {code:java}
> val command = new Command(mainClass,
>  Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ 
> driverArgs.driverOptions,
>  sys.env, classPathEntries, libraryPathEntries, javaOpts){code}
> 2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:
> {code:java}
> childEnv.putAll(command.environment.asJava)
> childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome){code}
> Seen in line 35  is that the environment is overwritten in the server machine 
> but in line 36 the SPARK_HOME is restored to the server value.
> We think the bug can be fixed by adding a line that restores 
> SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variables

2018-06-03 Thread Alon Shoham (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alon Shoham updated SPARK-24456:

Description: 
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ {{ Seq("}}{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", driverArgs.mainClass) ++ 
driverArgs.driverOptions,
{{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
 {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 

  was:
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ Seq("\{{WORKER_URL}}", "\{{USER_JAR}}", driverArgs.mainClass) ++ 
driverArgs.driverOptions,}}
{{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)}}

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
{{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 

Summary: Spark submit - server environment variables are overwritten by 
client environment variables   (was: Spark submit - server environment 
variables are overwritten by client environment variable )

> Spark submit - server environment variables are overwritten by client 
> environment variables 
> 
>
> Key: SPARK-24456
> URL: https://issues.apache.org/jira/browse/SPARK-24456
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Alon Shoham
>Priority: Minor
>
> When submitting a spark application in --deploy-mode cluster + spark 
> standalone cluster, environment variables from the client machine overwrite 
> server environment variables. 
>  
> We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
> dependencies to the application. We observed that client machine 
> SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
> application submission failure. 
>  
> We have inspected the code and found:
> 1. In org.apache.spark.deploy.Client line 86:
> {{val command = new Command(mainClass,}}
> {{ {{ Seq("}}{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", driverArgs.mainClass) ++ 
> driverArgs.driverOptions,
> {{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)
> 2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:
> {{childEnv.putAll(command.environment.asJava)}}
>  {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}
> Seen in line 35  is that the environment is overwritten in the server machine 
> but in line 36 the SPARK_HOME is restored to the server value.
> We think the bug can be fixed by adding a line that restores 
> SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variables

2018-06-03 Thread Alon Shoham (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alon Shoham updated SPARK-24456:

Description: 
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ \{{ {{ Seq("{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", driverArgs.mainClass) 
++ driverArgs.driverOptions,}}{{
{{ {{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)}}

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
 {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 

  was:
When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ {{ Seq("}}{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", driverArgs.mainClass) ++ 
driverArgs.driverOptions,
{{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
 {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 


> Spark submit - server environment variables are overwritten by client 
> environment variables 
> 
>
> Key: SPARK-24456
> URL: https://issues.apache.org/jira/browse/SPARK-24456
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Alon Shoham
>Priority: Minor
>
> When submitting a spark application in --deploy-mode cluster + spark 
> standalone cluster, environment variables from the client machine overwrite 
> server environment variables. 
>  
> We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
> dependencies to the application. We observed that client machine 
> SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
> application submission failure. 
>  
> We have inspected the code and found:
> 1. In org.apache.spark.deploy.Client line 86:
> {{val command = new Command(mainClass,}}
> {{ \{{ {{ Seq("{{WORKER_URL}}{{", "}}{{USER_JAR}}{{", 
> driverArgs.mainClass) ++ driverArgs.driverOptions,}}{{
> {{ {{ {{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)}}
> 2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:
> {{childEnv.putAll(command.environment.asJava)}}
>  {{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}
> Seen in line 35  is that the environment is overwritten in the server machine 
> but in line 36 the SPARK_HOME is restored to the server value.
> We think the bug can be fixed by adding a line that restores 
> SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24456) Spark submit - server environment variables are overwritten by client environment variable

2018-06-03 Thread Alon Shoham (JIRA)
Alon Shoham created SPARK-24456:
---

 Summary: Spark submit - server environment variables are 
overwritten by client environment variable 
 Key: SPARK-24456
 URL: https://issues.apache.org/jira/browse/SPARK-24456
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Alon Shoham


When submitting a spark application in --deploy-mode cluster + spark standalone 
cluster, environment variables from the client machine overwrite server 
environment variables. 

 

We use *SPARK_DIST_CLASSPATH* environment variable to add extra required 
dependencies to the application. We observed that client machine 
SPARK_DIST_CLASSPATH overwrite remote server machine value, resulting in 
application submission failure. 

 

We have inspected the code and found:

1. In org.apache.spark.deploy.Client line 86:

{{val command = new Command(mainClass,}}
{{ Seq("\{{WORKER_URL}}", "\{{USER_JAR}}", driverArgs.mainClass) ++ 
driverArgs.driverOptions,}}
{{ *sys.env,* classPathEntries, libraryPathEntries, javaOpts)}}

2. In org.apache.spark.launcher.WorkerCommandBuilder line 35:

{{childEnv.putAll(command.environment.asJava)}}
{{childEnv.put(CommandBuilderUtils.ENV_SPARK_HOME, sparkHome)}}

Seen in line 35  is that the environment is overwritten in the server machine 
but in line 36 the SPARK_HOME is restored to the server value.

We think the bug can be fixed by adding a line that restores 
SPARK_DIST_CLASSPATH to its server value, similar to SPARK_HOME

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23786) CSV schema validation - column names are not checked

2018-06-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23786.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> CSV schema validation - column names are not checked
> 
>
> Key: SPARK-23786
> URL: https://issues.apache.org/jira/browse/SPARK-23786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Here is a csv file contains two columns of the same type:
> {code}
> $cat marina.csv
> depth, temperature
> 10.2, 9.0
> 5.5, 12.3
> {code}
> If we define the schema with correct types but wrong column names (reversed 
> order):
> {code:scala}
> val schema = new StructType().add("temperature", DoubleType).add("depth", 
> DoubleType)
> {code}
> Spark reads the csv file without any errors:
> {code:scala}
> val ds = spark.read.schema(schema).option("header", "true").csv("marina.csv")
> ds.show
> {code}
> and outputs wrong result:
> {code}
> +---+-+
> |temperature|depth|
> +---+-+
> |   10.2|  9.0|
> |5.5| 12.3|
> +---+-+
> {code}
> The correct behavior would be either output error or read columns according 
> its names in the schema.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24369) A bug when having multiple distinct aggregations

2018-06-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24369.
-
   Resolution: Fixed
 Assignee: Wenchen Fan  (was: Takeshi Yamamuro)
Fix Version/s: 2.4.0
   2.3.2

> A bug when having multiple distinct aggregations
> 
>
> Key: SPARK-24369
> URL: https://issues.apache.org/jira/browse/SPARK-24369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> {code}
> SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) FROM
> (VALUES
>(1, 1),
>(2, 2),
>(2, 2)
> ) t(x, y)
> {code}
> It returns 
> {code}
> java.lang.RuntimeException
> You hit a query analyzer bug. Please report your query to Spark user mailing 
> list.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2018-06-03 Thread gagan taneja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499733#comment-16499733
 ] 

gagan taneja commented on SPARK-24437:
--

I have already set the values  of spark.cleaner.periodicGC.interval to 5 min

Which branch did you try to reproduce this issue. There are many change made in 
this area to address some other leaks. Can you try it out with 2.2 branch. If 
its reproducible in 2.2 then we know for sure that we have a way to reproduce 
it and the issue is resolved in post 2.2 fixes  

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499707#comment-16499707
 ] 

Saisai Shao commented on SPARK-20202:
-

What is our plan to to fix this issue, are we going to use new Hive version, or 
we are still stick to 1.2?

If we're still stick to 1.2, [~ste...@apache.org] and I will take this issue 
and make the ball rolling in Hive community.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24455) fix typo in TaskSchedulerImpl's comments

2018-06-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24455:


Assignee: xueyu

> fix typo in TaskSchedulerImpl's comments
> 
>
> Key: SPARK-24455
> URL: https://issues.apache.org/jira/browse/SPARK-24455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: xueyu
>Assignee: xueyu
>Priority: Trivial
> Fix For: 2.3.2, 2.4.0
>
>
> fix the method name in TaskSchedulerImpl.scala 's comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24455) fix typo in TaskSchedulerImpl's comments

2018-06-03 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24455.
--
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 21485
[https://github.com/apache/spark/pull/21485]

> fix typo in TaskSchedulerImpl's comments
> 
>
> Key: SPARK-24455
> URL: https://issues.apache.org/jira/browse/SPARK-24455
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: xueyu
>Priority: Trivial
> Fix For: 2.4.0, 2.3.2
>
>
> fix the method name in TaskSchedulerImpl.scala 's comments



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-03 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499612#comment-16499612
 ] 

Shivaram Venkataraman commented on SPARK-24359:
---

I think where the code sits matters if we want to make more frequent SparkML 
releases when compared to Spark releases. If we have a separate repo then its 
much more easier / cleaner to create releases more frequently.

[~josephkb] it'll not be a separate project  – just a new repo in apache/ – 
similar to say `spark-website` is right now. It will be maintained by the same 
set of committers and have the same JIRA etc.

I'd just like us to understand the pros/cons of this approach vs. the current 
approach of tying releases to Spark releases and list them out to make sure we 
are taking the right call ?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-03 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499535#comment-16499535
 ] 

Ted Yu commented on SPARK-18057:


I have created a PR, shown above.

Kafka 2.0.0 is used since that would be the release KIP-266 is integrated.

If there is no objection, the JIRA title should be modified.

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18057:


Assignee: Apache Spark

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Assignee: Apache Spark
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499534#comment-16499534
 ] 

Apache Spark commented on SPARK-18057:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21488

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0

2018-06-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18057:


Assignee: (was: Apache Spark)

> Update structured streaming kafka from 0.10.0.1 to 1.1.0
> 
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-06-03 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499528#comment-16499528
 ] 

Felix Cheung commented on SPARK-23206:
--

[~irashid] sorry I thought I had replied -

on our end, there are some debate on whether metrics collected at the 
NodeManager (YARN) level is sufficient. IMO we definitely need some breakdown 
of disk IO / app_id (and that will be hard to separate out at NM level), so 
that we can identify the heavy-shuffle app.

I don't think we should increase the payload significantly - so shouldn't 
affect the design much.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2018-06-03 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499476#comment-16499476
 ] 

Ruben Berenguel commented on SPARK-23904:
-

Yes [~igreenfi] I'm using that setting for reproducing

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org