date:20180409

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-04-09 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431765#comment-16431765
 ] 

zhengruifeng commented on SPARK-10884:
--

Is there any plan to expose \{{predictRaw}} and \{{predictProbability}}? If so, 
I have time to work on this.

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
>  Labels: 2.2.0
> Fix For: 2.4.0
>
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23898) Simplify code generation for Add/Subtract with CalendarIntervals

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23898.
-
Resolution: Fixed

> Simplify code generation for Add/Subtract with CalendarIntervals
> 
>
> Key: SPARK-23898
> URL: https://issues.apache.org/jira/browse/SPARK-23898
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23947:
---

Assignee: Kris Mok

> Add hashUTF8String convenience method to hasher classes
> ---
>
> Key: SPARK-23947
> URL: https://issues.apache.org/jira/browse/SPARK-23947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
> generate cleaner code for hashing {{UTF8String}}. No change in behavior 
> otherwise.
> Although with the introduction of SPARK-10399, the code size for hashing 
> {{UTF8String}} is already smaller, it's still good to extract a separate 
> function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23947.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add hashUTF8String convenience method to hasher classes
> ---
>
> Key: SPARK-23947
> URL: https://issues.apache.org/jira/browse/SPARK-23947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
> generate cleaner code for hashing {{UTF8String}}. No change in behavior 
> otherwise.
> Although with the introduction of SPARK-10399, the code size for hashing 
> {{UTF8String}} is already smaller, it's still good to extract a separate 
> function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23772) Provide an option to ignore column of all null values or empty map/array during JSON schema inference

2018-04-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-23772:
-

Assignee: Takeshi Yamamuro

> Provide an option to ignore column of all null values or empty map/array 
> during JSON schema inference
> -
>
> Key: SPARK-23772
> URL: https://issues.apache.org/jira/browse/SPARK-23772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> It is common that we convert data from JSON source to structured format 
> periodically. In the initial batch of JSON data, if a field's values are 
> always null, Spark infers this field as StringType. However, in the second 
> batch, one non-null value appears in this field and its type turns out to be 
> not StringType. Then merge schema failed because schema inconsistency.
> This also applies to empty arrays and empty objects. My proposal is providing 
> an option in Spark JSON source to omit those fields until we see a non-null 
> value.
> This is similar to SPARK-12436 but the proposed solution is different.
> cc: [~rxin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-09 Thread jin xing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-23948:
-
Description: 
SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
"markMapStageJobAsFinished" is called only in 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
 and   
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);

But think about below scenario:
1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but 
stage1 is not inside "runningStages". So even though all splits(including the 
speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there 
is no missing tasks. But in current code, job listener is not triggered

  was:
SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
"markMapStageJobAsFinished" is called only in ();

But think about below scenario:
1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but 
stage1 is not inside "runningStages". So even though all splits(including the 
speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there 
is no missing tasks. But in current code, job listener is not triggered


> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933
>  and   
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314);
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23948:


Assignee: (was: Apache Spark)

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in ();
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431665#comment-16431665
 ] 

Apache Spark commented on SPARK-23948:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/21019

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in ();
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23948:


Assignee: Apache Spark

> Trigger mapstage's job listener in submitMissingTasks
> -
>
> Key: SPARK-23948
> URL: https://issues.apache.org/jira/browse/SPARK-23948
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Major
>
> SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
> "markMapStageJobAsFinished" is called only in ();
> But think about below scenario:
> 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
> 2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
> 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
> resubmitted as stage0_1 and stage1_1;
> 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
> but stage1 is not inside "runningStages". So even though all splits(including 
> the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
> called;
> 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", 
> there is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks

2018-04-09 Thread jin xing (JIRA)

jin xing created SPARK-23948:


 Summary: Trigger mapstage's job listener in submitMissingTasks
 Key: SPARK-23948
 URL: https://issues.apache.org/jira/browse/SPARK-23948
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: jin xing


SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, 
"markMapStageJobAsFinished" is called only in ();

But think about below scenario:
1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0;
2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1
3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but 
stage1 is not inside "runningStages". So even though all splits(including the 
speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there 
is no missing tasks. But in current code, job listener is not triggered



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23880) table cache should be lazy and don't trigger any jobs.

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23880:


Assignee: (was: Apache Spark)

> table cache should be lazy and don't trigger any jobs.
> --
>
> Key: SPARK-23880
> URL: https://issues.apache.org/jira/browse/SPARK-23880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> val df = spark.range(100L)
>   .filter('id > 1000)
>   .orderBy('id.desc)
>   .cache()
> {code}
> This triggers a job while the cache should be lazy. The problem is that, when 
> creating `InMemoryRelation`, we build the RDD, which calls 
> `SparkPlan.execute` and may trigger jobs, like sampling job for range 
> partitioner, or broadcast job.
> We should create the RDD at physical phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23880) table cache should be lazy and don't trigger any jobs.

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431638#comment-16431638
 ] 

Apache Spark commented on SPARK-23880:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21018

> table cache should be lazy and don't trigger any jobs.
> --
>
> Key: SPARK-23880
> URL: https://issues.apache.org/jira/browse/SPARK-23880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> {code}
> val df = spark.range(100L)
>   .filter('id > 1000)
>   .orderBy('id.desc)
>   .cache()
> {code}
> This triggers a job while the cache should be lazy. The problem is that, when 
> creating `InMemoryRelation`, we build the RDD, which calls 
> `SparkPlan.execute` and may trigger jobs, like sampling job for range 
> partitioner, or broadcast job.
> We should create the RDD at physical phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23880) table cache should be lazy and don't trigger any jobs.

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23880:


Assignee: Apache Spark

> table cache should be lazy and don't trigger any jobs.
> --
>
> Key: SPARK-23880
> URL: https://issues.apache.org/jira/browse/SPARK-23880
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> val df = spark.range(100L)
>   .filter('id > 1000)
>   .orderBy('id.desc)
>   .cache()
> {code}
> This triggers a job while the cache should be lazy. The problem is that, when 
> creating `InMemoryRelation`, we build the RDD, which calls 
> `SparkPlan.execute` and may trigger jobs, like sampling job for range 
> partitioner, or broadcast job.
> We should create the RDD at physical phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23748) Support select from temp tables

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23748:


Assignee: Apache Spark

> Support select from temp tables
> ---
>
> Key: SPARK-23748
> URL: https://issues.apache.org/jira/browse/SPARK-23748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> As reported in the dev list, the following currently fails:
>  
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
> jdf.createOrReplaceTempView("table")
>  
> val resultdf = spark.sql("select * from table")
> resultdf.writeStream.outputMode("append").format("console").option("truncate",
>  false).trigger(Trigger.Continuous("1 second")).start()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23748) Support select from temp tables

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23748:


Assignee: (was: Apache Spark)

> Support select from temp tables
> ---
>
> Key: SPARK-23748
> URL: https://issues.apache.org/jira/browse/SPARK-23748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> As reported in the dev list, the following currently fails:
>  
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
> jdf.createOrReplaceTempView("table")
>  
> val resultdf = spark.sql("select * from table")
> resultdf.writeStream.outputMode("append").format("console").option("truncate",
>  false).trigger(Trigger.Continuous("1 second")).start()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23748) Support select from temp tables

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431609#comment-16431609
 ] 

Apache Spark commented on SPARK-23748:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21017

> Support select from temp tables
> ---
>
> Key: SPARK-23748
> URL: https://issues.apache.org/jira/browse/SPARK-23748
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> As reported in the dev list, the following currently fails:
>  
> val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
> "earliest").load();
> jdf.createOrReplaceTempView("table")
>  
> val resultdf = spark.sql("select * from table")
> resultdf.writeStream.outputMode("append").format("console").option("truncate",
>  false).trigger(Trigger.Continuous("1 second")).start()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc

2018-04-09 Thread Yogesh Tewari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431600#comment-16431600
 ] 

Yogesh Tewari commented on SPARK-23733:
---

SPARK-23732 and SPARK-23733 are similar in nature. Will leave it to your better 
judgement. (y)

Cheers.

> Broken link to java source code in Spark Scala api Scaladoc
> ---
>
> Key: SPARK-23733
> URL: https://issues.apache.org/jira/browse/SPARK-23733
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Priority: Trivial
>  Labels: build, documentation, scaladocs
>
> Java source code link in Spark api scaladoc is broken.
> The relative path expression "€\{FILE_PATH}.scala" in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has 
> ".scala" hardcoded in the end. If I try to access the source link on 
> [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.2.0/core/src/main/java/org/apache/spark/api/java/function/Function2.java.scala]
> This is coming from /project/SparkBuild.scala :
> Line # 720
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc

2018-04-09 Thread Yogesh Tewari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431599#comment-16431599
 ] 

Yogesh Tewari commented on SPARK-23732:
---

SPARK-23732 and SPARK-23733 are similar in nature. Will leave it to your better 
judgement. (y)

Cheers.

> Broken link to scala source code in Spark Scala api Scaladoc
> 
>
> Key: SPARK-23732
> URL: https://issues.apache.org/jira/browse/SPARK-23732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Priority: Trivial
>  Labels: build, documentation, scaladocs
>
> Scala source code link in Spark api scaladoc is broken.
> Turns out instead of the relative path to the scala files the 
> "€\{FILE_PATH}.scala" expression in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is 
> generating the absolute path from the developers computer. In this case, if I 
> try to access the source link on 
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala]
> where "/Users/sameera/dev/spark" portion of the URL is coming from the 
> developers macos home folder.
> There seems to be no change in the code responsible for generating this path 
> during the build in /project/SparkBuild.scala :
> Line # 252:
> {code:java}
> scalacOptions in Compile ++= Seq(
> s"-target:jvm-${scalacJVMVersion.value}",
> "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required 
> for relative source links in scaladoc
> ),
> {code}
> Line # 726
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
> It seems more like a developers dev environment issue.
> I was successfully able to reproduce this in my dev environment. Environment 
> details attached. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431590#comment-16431590
 ] 

Christopher Hoshino-Fish commented on SPARK-23946:
--

[~hyukjin.kwon] thanks for the feedback!

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431585#comment-16431585
 ] 

Hyukjin Kwon commented on SPARK-23929:
--

Unless there's a strong reason for going ahead with mapping by name, I would 
propose to fix the documentation for now.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431584#comment-16431584
 ] 

Hyukjin Kwon commented on SPARK-23929:
--

I think we already use positional-based approach and released out ..

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431582#comment-16431582
 ] 

Hyukjin Kwon commented on SPARK-23733:
--

Let me leave this resolved as a duplicate but please let me know if I 
misunderstood.

> Broken link to java source code in Spark Scala api Scaladoc
> ---
>
> Key: SPARK-23733
> URL: https://issues.apache.org/jira/browse/SPARK-23733
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Priority: Trivial
>  Labels: build, documentation, scaladocs
>
> Java source code link in Spark api scaladoc is broken.
> The relative path expression "€\{FILE_PATH}.scala" in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has 
> ".scala" hardcoded in the end. If I try to access the source link on 
> [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.2.0/core/src/main/java/org/apache/spark/api/java/function/Function2.java.scala]
> This is coming from /project/SparkBuild.scala :
> Line # 720
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23737.
--
Resolution: Duplicate

Please don't reopen this. It's resolved as a duplicate.

> Scala API documentation leads to nonexistent pages for sources
> --
>
> Key: SPARK-23737
> URL: https://issues.apache.org/jira/browse/SPARK-23737
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> h3. Steps to reproduce:
>  # Go to [Scala API 
> homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]].
>  # Click "Source: package.scala"
> h3. Result:
> The link leads to nonexistent page: 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala]
> h3. Expected result:
> The link leads to proper page:
> [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23946.
--
Resolution: Duplicate

Let me leave this resolved as a duplicate but please let me know if I 
misunderstood.

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431578#comment-16431578
 ] 

Hyukjin Kwon commented on SPARK-23946:
--

and fix version too which is usually set when it's actually fixed.

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23733) Broken link to java source code in Spark Scala api Scaladoc

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23733.
--
Resolution: Duplicate

> Broken link to java source code in Spark Scala api Scaladoc
> ---
>
> Key: SPARK-23733
> URL: https://issues.apache.org/jira/browse/SPARK-23733
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Priority: Trivial
>  Labels: build, documentation, scaladocs
>
> Java source code link in Spark api scaladoc is broken.
> The relative path expression "€\{FILE_PATH}.scala" in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] has 
> ".scala" hardcoded in the end. If I try to access the source link on 
> [https://spark.apache.org/docs/1.6.3/api/scala/index.html#org.apache.spark.api.java.function.Function2],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.2.0/core/src/main/java/org/apache/spark/api/java/function/Function2.java.scala]
> This is coming from /project/SparkBuild.scala :
> Line # 720
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431576#comment-16431576
 ] 

Hyukjin Kwon commented on SPARK-23946:
--

and fix version too which is usually set when it's actually fixed.

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23946:
-
Fix Version/s: (was: 2.3.1)

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431575#comment-16431575
 ] 

Hyukjin Kwon commented on SPARK-23732:
--

Let's leave SPARK-23733 resolved as a duplicate. Seems the root cause the same.

> Broken link to scala source code in Spark Scala api Scaladoc
> 
>
> Key: SPARK-23732
> URL: https://issues.apache.org/jira/browse/SPARK-23732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Priority: Trivial
>  Labels: build, documentation, scaladocs
>
> Scala source code link in Spark api scaladoc is broken.
> Turns out instead of the relative path to the scala files the 
> "€\{FILE_PATH}.scala" expression in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is 
> generating the absolute path from the developers computer. In this case, if I 
> try to access the source link on 
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala]
> where "/Users/sameera/dev/spark" portion of the URL is coming from the 
> developers macos home folder.
> There seems to be no change in the code responsible for generating this path 
> during the build in /project/SparkBuild.scala :
> Line # 252:
> {code:java}
> scalacOptions in Compile ++= Seq(
> s"-target:jvm-${scalacJVMVersion.value}",
> "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required 
> for relative source links in scaladoc
> ),
> {code}
> Line # 726
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
> It seems more like a developers dev environment issue.
> I was successfully able to reproduce this in my dev environment. Environment 
> details attached. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23946:
-
Target Version/s:   (was: 2.3.1)

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431573#comment-16431573
 ] 

Hyukjin Kwon commented on SPARK-23946:
--

Don't set the target version that's usually reserved for a committer. I think 
it's a duplicate for SPARK-23732

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23942) PySpark's collect doesn't trigger QueryExecutionListener

2018-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23942:
-
Component/s: PySpark

> PySpark's collect doesn't trigger QueryExecutionListener
> 
>
> Key: SPARK-23942
> URL: https://issues.apache.org/jira/browse/SPARK-23942
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> For example, if you have an custom query execution listener:
> {code}
> package org.apache.spark.sql
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.execution.QueryExecution
> import org.apache.spark.sql.util.QueryExecutionListener
> class TestQueryExecutionListener extends QueryExecutionListener with Logging {
>   override def onSuccess(funcName: String, qe: QueryExecution, durationNs: 
> Long): Unit = {
> logError("Look at me! I'm 'onSuccess'")
>   }
>   override def onFailure(funcName: String, qe: QueryExecution, exception: 
> Exception): Unit = { }
> }
> {code}
> and set "spark.sql.queryExecutionListeners  
> org.apache.spark.sql.TestQueryExecutionListener",
> {code}
> >>> sql("SELECT * FROM range(1)").collect()
> [Row(id=0)]
> {code}
> {code}
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> >>> sql("SELECT * FROM range(1)").toPandas()
>id
> 0   0
> {code}
> Seems other actions like show and etc fine in Scala side too:
> {code}
> >>> sql("SELECT * FROM range(1)").show()
> 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 
> 'onSuccess'
> +---+
> | id|
> +---+
> |  0|
> +---+
> {code}
> {code}
> scala> sql("SELECT * FROM range(1)").collect()
> 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 
> 'onSuccess'
> res1: Array[org.apache.spark.sql.Row] = Array([0])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431555#comment-16431555
 ] 

Yuming Wang commented on SPARK-23946:
-

You are right. I'm working on this.

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23527) Error with spark-submit and kerberos with TLS-enabled Hadoop cluster

2018-04-09 Thread Ron Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431514#comment-16431514
 ] 

Ron Gonzalez commented on SPARK-23527:
--

My admin team has resolved it, and now I get this same problem when I wrap 
org.apache.spark.deploy.SparkSubmit and invoke it directly using Java. 

Is there some kind of jar files or configuration that I need to add into 
classpath?

> Error with spark-submit and kerberos with TLS-enabled Hadoop cluster
> 
>
> Key: SPARK-23527
> URL: https://issues.apache.org/jira/browse/SPARK-23527
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
> Environment: core-site.xml
> 
>     hadoop.security.key.provider.path
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
> hdfs-site.xml
> 
>     dfs.encryption.key.provider.uri
>     kms://ht...@host1.domain.com;host2.domain.com:16000/kms
> 
>Reporter: Ron Gonzalez
>Priority: Critical
>
> For current configuration of our enterprise cluster, I submit using 
> spark-submit:
> ./spark-submit --master yarn --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --conf 
> spark.yarn.jars=hdfs:/user/user1/spark/lib/*.jar 
> ../examples/jars/spark-examples_2.11-2.2.1.jar 10
> I am getting the following problem:
>  
> 18/02/27 21:03:48 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 
> 3351181 for svchdc236d on ha-hdfs:nameservice1
> Exception in thread "main" java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: host1.domain.com;host2.domain.com
>  at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:825)
>  at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:781)
>  at 
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
>  at 
> org.apache.spark.deploy.yarn.security.HadoopFSCredentialProvider$$anonfun$obtainCredentials$1.apply(HadoopFSCredentialProvider.scala:52)
>  
> If I get rid of the other host for the properties so instead of 
> kms://ht...@host1.domain.com;host2.domain.com:16000/kms, I convert it to:
> kms://ht...@host1.domain.com:16000/kms
> it fails with a different error:
> java.io.IOException: javax.net.ssl.SSLHandshakeException: 
> sun.security.validator.ValidatorException: PKIX path building failed: 
> sun.security.provider.certpath.SunCertPathBuilderException: unable to find 
> valid certification path to requested target
> If I do the same thing using spark 1.6, it works so it seems like a 
> regression...
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23944:
--
Fix Version/s: (was: 2.4.0)

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23947:


Assignee: (was: Apache Spark)

> Add hashUTF8String convenience method to hasher classes
> ---
>
> Key: SPARK-23947
> URL: https://issues.apache.org/jira/browse/SPARK-23947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
> generate cleaner code for hashing {{UTF8String}}. No change in behavior 
> otherwise.
> Although with the introduction of SPARK-10399, the code size for hashing 
> {{UTF8String}} is already smaller, it's still good to extract a separate 
> function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23947:


Assignee: Apache Spark

> Add hashUTF8String convenience method to hasher classes
> ---
>
> Key: SPARK-23947
> URL: https://issues.apache.org/jira/browse/SPARK-23947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Minor
>
> Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
> generate cleaner code for hashing {{UTF8String}}. No change in behavior 
> otherwise.
> Although with the introduction of SPARK-10399, the code size for hashing 
> {{UTF8String}} is already smaller, it's still good to extract a separate 
> function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431431#comment-16431431
 ] 

Apache Spark commented on SPARK-23947:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/21016

> Add hashUTF8String convenience method to hasher classes
> ---
>
> Key: SPARK-23947
> URL: https://issues.apache.org/jira/browse/SPARK-23947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
> generate cleaner code for hashing {{UTF8String}}. No change in behavior 
> otherwise.
> Although with the introduction of SPARK-10399, the code size for hashing 
> {{UTF8String}} is already smaller, it's still good to extract a separate 
> function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23947) Add hashUTF8String convenience method to hasher classes

2018-04-09 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23947:


 Summary: Add hashUTF8String convenience method to hasher classes
 Key: SPARK-23947
 URL: https://issues.apache.org/jira/browse/SPARK-23947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


Add {{hashUTF8String()}} to the hasher classes to allow Spark SQL codegen to 
generate cleaner code for hashing {{UTF8String}}. No change in behavior 
otherwise.

Although with the introduction of SPARK-10399, the code size for hashing 
{{UTF8String}} is already smaller, it's still good to extract a separate 
function in the hasher classes so that the generated code can stay clean.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-23945:
-
Description: 
In SQL you can filter rows based on the result of a subquery:
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 

  was:
In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 


> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23946:

Shepherd:   (was: Sameer Agarwal)
Target Version/s: 2.3.1

> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Hoshino-Fish updated SPARK-23946:
-
Description: 
Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for the 
source code
 
[https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
 click on the Source link and it goes to:
 
[https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]

  was:
Currently the 2.3.0 scaladocs point towards Sameer's github for the source code
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$
click on the Source link and it goes to:
https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala


> 2.3.0 and Latest ScalaDocs are linked to the wrong source code
> --
>
> Key: SPARK-23946
> URL: https://issues.apache.org/jira/browse/SPARK-23946
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Christopher Hoshino-Fish
>Priority: Major
>  Labels: doc-impacting, docs-missing
> Fix For: 2.3.1
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the 2.3.0 and Latest scaladocs point towards Sameer's github for 
> the source code
>  
> [https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$]
>  click on the Source link and it goes to:
>  
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-04-09 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431403#comment-16431403
 ] 

Bruce Robbins commented on SPARK-23715:
---

I've been convinced this is worth fixing, at least for String input values, 
since a user was actually seeing wrong results despite specifying a datetime 
value with a UTC timezone.

One way to fix this is to create a new expression type for converting string 
values to timestamp values. The Analyzer would place this expression as a left 
child of FromUTCTimestamp, if needed. This new expression type would be more 
aware of FromUTCTimestamp's expectations than a general purpose Cast expression 
(for example, it could reject string datetime values that contain an explicit 
timezone).

Any opinions?

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact that the FromUTCTimestamp expression, despite 
> its name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a

[jira] [Created] (SPARK-23946) 2.3.0 and Latest ScalaDocs are linked to the wrong source code

2018-04-09 Thread Christopher Hoshino-Fish (JIRA)

Christopher Hoshino-Fish created SPARK-23946:


 Summary: 2.3.0 and Latest ScalaDocs are linked to the wrong source 
code
 Key: SPARK-23946
 URL: https://issues.apache.org/jira/browse/SPARK-23946
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.3.0
Reporter: Christopher Hoshino-Fish
 Fix For: 2.3.1


Currently the 2.3.0 scaladocs point towards Sameer's github for the source code
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.functions$
click on the Source link and it goes to:
https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23944:


Assignee: Apache Spark

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431388#comment-16431388
 ] 

Apache Spark commented on SPARK-23944:
--

User 'ludatabricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/21015

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23944:


Assignee: (was: Apache Spark)

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2018-04-09 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-23945:


 Summary: Column.isin() should accept a single-column DataFrame as 
input
 Key: SPARK-23945
 URL: https://issues.apache.org/jira/browse/SPARK-23945
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Nicholas Chammas


In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
SELECT name
FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
.where(
~col('name').isin(
table2.select('name')
)
)
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23941) Mesos task failed on specific spark app name

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23941:


Assignee: (was: Apache Spark)

> Mesos task failed on specific spark app name
> 
>
> Key: SPARK-23941
> URL: https://issues.apache.org/jira/browse/SPARK-23941
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.1, 2.3.0
> Environment: OS: Ubuntu 16.0.4
> Spark: 2.3.0
> Mesos: 1.5.0
>Reporter: bounkong khamphousone
>Priority: Major
>
> It seems to be a bug related to spark's MesosClusterDispatcher. In order to 
> reproduce the bug, you need to have mesos and mesos dispatcher running.
> I'm currently running mesos 1.5 and spark 2.3.0 (tried with 2.2.1 as well).
> If you launch the following program:
>  
> {code:java}
> spark-submit --master mesos://127.0.1.1:7077 --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --name "my favorite task (myId = 123-456)" 
> /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar 100
> {code}
> , then the task fails with the following output :
>  
> {code:java}
> I0409 11:00:35.360352 22726 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110035-0004\/runs\/8ac20902-74e1-45c4-9ab6-c52a79940189","user":"tiboun"}
> I0409 11:00:35.363119 22726 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.363143 22726 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:00:35.363168 22726 fetcher.cpp:225] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> W0409 11:00:35.366839 22726 fetcher.cpp:330] Copying instead of extracting 
> resource from URI with 'extract' flag, because it does not seem to be an 
> archive: /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> I0409 11:00:35.366873 22726 fetcher.cpp:603] Fetched 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar' to 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.366878 22726 fetcher.cpp:608] Successfully fetched all URIs 
> into 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189'
> I0409 11:00:35.438725 22733 exec.cpp:162] Version: 1.5.0
> I0409 11:00:35.440770 22734 exec.cpp:236] Executor registered on agent 
> 0262246c-14a3-4408-9b74-5e3b65dc1344-S0
> I0409 11:00:35.441388 22733 executor.cpp:171] Received SUBSCRIBED event
> I0409 11:00:35.441586 22733 executor.cpp:175] Subscribed executor on 
> tiboun-Dell-Precision-M3800
> I0409 11:00:35.441643 22733 executor.cpp:171] Received LAUNCH event
> I0409 11:00:35.441767 22733 executor.cpp:638] Starting task 
> driver-20180409110035-0004
> I0409 11:00:35.445050 22733 executor.cpp:478] Running 
> '/usr/libexec/mesos/mesos-containerizer launch '
> I0409 11:00:35.445770 22733 executor.cpp:651] Forked command at 22743
> sh: 1: Syntax error: "(" unexpected
> I0409 11:00:35.538661 22736 executor.cpp:938] Command exited with status 2 
> (pid: 22743)
> I0409 11:00:36.541016 22739 process.cpp:887] Failed to accept socket: future 
> discarded
> {code}
> If you remove the parentheses, you get the following result:
>  
> {code:java}
> I0409 11:03:02.023701 23085 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110301-0006\/runs\/f887c0ab-b48f-4382-850c-383c1c944269","user":"tiboun"}
> I0409 11:03:02.028268 23085 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:03:02.028302 23085 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:03:02.028336 23085 fetcher.cpp:225] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> W0409 11:03:02.031209

[jira] [Assigned] (SPARK-23941) Mesos task failed on specific spark app name

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23941:


Assignee: Apache Spark

> Mesos task failed on specific spark app name
> 
>
> Key: SPARK-23941
> URL: https://issues.apache.org/jira/browse/SPARK-23941
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.1, 2.3.0
> Environment: OS: Ubuntu 16.0.4
> Spark: 2.3.0
> Mesos: 1.5.0
>Reporter: bounkong khamphousone
>Assignee: Apache Spark
>Priority: Major
>
> It seems to be a bug related to spark's MesosClusterDispatcher. In order to 
> reproduce the bug, you need to have mesos and mesos dispatcher running.
> I'm currently running mesos 1.5 and spark 2.3.0 (tried with 2.2.1 as well).
> If you launch the following program:
>  
> {code:java}
> spark-submit --master mesos://127.0.1.1:7077 --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --name "my favorite task (myId = 123-456)" 
> /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar 100
> {code}
> , then the task fails with the following output :
>  
> {code:java}
> I0409 11:00:35.360352 22726 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110035-0004\/runs\/8ac20902-74e1-45c4-9ab6-c52a79940189","user":"tiboun"}
> I0409 11:00:35.363119 22726 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.363143 22726 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:00:35.363168 22726 fetcher.cpp:225] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> W0409 11:00:35.366839 22726 fetcher.cpp:330] Copying instead of extracting 
> resource from URI with 'extract' flag, because it does not seem to be an 
> archive: /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> I0409 11:00:35.366873 22726 fetcher.cpp:603] Fetched 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar' to 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.366878 22726 fetcher.cpp:608] Successfully fetched all URIs 
> into 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189'
> I0409 11:00:35.438725 22733 exec.cpp:162] Version: 1.5.0
> I0409 11:00:35.440770 22734 exec.cpp:236] Executor registered on agent 
> 0262246c-14a3-4408-9b74-5e3b65dc1344-S0
> I0409 11:00:35.441388 22733 executor.cpp:171] Received SUBSCRIBED event
> I0409 11:00:35.441586 22733 executor.cpp:175] Subscribed executor on 
> tiboun-Dell-Precision-M3800
> I0409 11:00:35.441643 22733 executor.cpp:171] Received LAUNCH event
> I0409 11:00:35.441767 22733 executor.cpp:638] Starting task 
> driver-20180409110035-0004
> I0409 11:00:35.445050 22733 executor.cpp:478] Running 
> '/usr/libexec/mesos/mesos-containerizer launch '
> I0409 11:00:35.445770 22733 executor.cpp:651] Forked command at 22743
> sh: 1: Syntax error: "(" unexpected
> I0409 11:00:35.538661 22736 executor.cpp:938] Command exited with status 2 
> (pid: 22743)
> I0409 11:00:36.541016 22739 process.cpp:887] Failed to accept socket: future 
> discarded
> {code}
> If you remove the parentheses, you get the following result:
>  
> {code:java}
> I0409 11:03:02.023701 23085 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110301-0006\/runs\/f887c0ab-b48f-4382-850c-383c1c944269","user":"tiboun"}
> I0409 11:03:02.028268 23085 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:03:02.028302 23085 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:03:02.028336 23085 fetcher.cpp:225] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
>

[jira] [Commented] (SPARK-23941) Mesos task failed on specific spark app name

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431337#comment-16431337
 ] 

Apache Spark commented on SPARK-23941:
--

User 'tiboun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21014

> Mesos task failed on specific spark app name
> 
>
> Key: SPARK-23941
> URL: https://issues.apache.org/jira/browse/SPARK-23941
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Submit
>Affects Versions: 2.2.1, 2.3.0
> Environment: OS: Ubuntu 16.0.4
> Spark: 2.3.0
> Mesos: 1.5.0
>Reporter: bounkong khamphousone
>Priority: Major
>
> It seems to be a bug related to spark's MesosClusterDispatcher. In order to 
> reproduce the bug, you need to have mesos and mesos dispatcher running.
> I'm currently running mesos 1.5 and spark 2.3.0 (tried with 2.2.1 as well).
> If you launch the following program:
>  
> {code:java}
> spark-submit --master mesos://127.0.1.1:7077 --deploy-mode cluster --class 
> org.apache.spark.examples.SparkPi --name "my favorite task (myId = 123-456)" 
> /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar 100
> {code}
> , then the task fails with the following output :
>  
> {code:java}
> I0409 11:00:35.360352 22726 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110035-0004\/runs\/8ac20902-74e1-45c4-9ab6-c52a79940189","user":"tiboun"}
> I0409 11:00:35.363119 22726 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.363143 22726 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:00:35.363168 22726 fetcher.cpp:225] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> W0409 11:00:35.366839 22726 fetcher.cpp:330] Copying instead of extracting 
> resource from URI with 'extract' flag, because it does not seem to be an 
> archive: /home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> I0409 11:00:35.366873 22726 fetcher.cpp:603] Fetched 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar' to 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189/spark-examples_2.11-2.3.0.jar'
> I0409 11:00:35.366878 22726 fetcher.cpp:608] Successfully fetched all URIs 
> into 
> '/var/lib/mesos/slaves/0262246c-14a3-4408-9b74-5e3b65dc1344-S0/frameworks/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014/executors/driver-20180409110035-0004/runs/8ac20902-74e1-45c4-9ab6-c52a79940189'
> I0409 11:00:35.438725 22733 exec.cpp:162] Version: 1.5.0
> I0409 11:00:35.440770 22734 exec.cpp:236] Executor registered on agent 
> 0262246c-14a3-4408-9b74-5e3b65dc1344-S0
> I0409 11:00:35.441388 22733 executor.cpp:171] Received SUBSCRIBED event
> I0409 11:00:35.441586 22733 executor.cpp:175] Subscribed executor on 
> tiboun-Dell-Precision-M3800
> I0409 11:00:35.441643 22733 executor.cpp:171] Received LAUNCH event
> I0409 11:00:35.441767 22733 executor.cpp:638] Starting task 
> driver-20180409110035-0004
> I0409 11:00:35.445050 22733 executor.cpp:478] Running 
> '/usr/libexec/mesos/mesos-containerizer launch '
> I0409 11:00:35.445770 22733 executor.cpp:651] Forked command at 22743
> sh: 1: Syntax error: "(" unexpected
> I0409 11:00:35.538661 22736 executor.cpp:938] Command exited with status 2 
> (pid: 22743)
> I0409 11:00:36.541016 22739 process.cpp:887] Failed to accept socket: future 
> discarded
> {code}
> If you remove the parentheses, you get the following result:
>  
> {code:java}
> I0409 11:03:02.023701 23085 fetcher.cpp:551] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/tiboun","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"extract":true,"value":"\/home\/tiboun\/tools\/spark\/examples\/jars\/spark-examples_2.11-2.3.0.jar"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/0262246c-14a3-4408-9b74-5e3b65dc1344-S0\/frameworks\/edff1a6f-38c6-46e0-a3c1-62a8fbfc2b5d-0014\/executors\/driver-20180409110301-0006\/runs\/f887c0ab-b48f-4382-850c-383c1c944269","user":"tiboun"}
> I0409 11:03:02.028268 23085 fetcher.cpp:450] Fetching URI 
> '/home/tiboun/tools/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> I0409 11:03:02.028302 23085 fetcher.cpp:291] Fetching directly into the 
> sandbox directory
> I0409 11:03:02.028336 23085 fetcher.cpp:225] Fetching URI 
>

[jira] [Assigned] (SPARK-23874) Upgrade apache/arrow to 0.9.0

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23874:


Assignee: Apache Spark  (was: Bryan Cutler)

> Upgrade apache/arrow to 0.9.0
> -
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.9.0

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431287#comment-16431287
 ] 

Apache Spark commented on SPARK-23874:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21013

> Upgrade apache/arrow to 0.9.0
> -
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23874) Upgrade apache/arrow to 0.9.0

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23874:


Assignee: Bryan Cutler  (was: Apache Spark)

> Upgrade apache/arrow to 0.9.0
> -
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22919) Bump Apache httpclient versions

2018-04-09 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431198#comment-16431198
 ] 

Steve Loughran commented on SPARK-22919:


going to highlight this appears to break Spark & hadoop-aws 2.8; related to 
[AWS SDK|https://github.com/aws/aws-sdk-java/issues/1032]. Not ideal; means you 
need to stay on hadoop 2.7.x or move to Hadoop 2.9+ for the s3a stuff to work

> Bump Apache httpclient versions
> ---
>
> Key: SPARK-22919
> URL: https://issues.apache.org/jira/browse/SPARK-22919
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Minor
> Fix For: 2.3.0
>
>
> I would like to bump the PATCH versions of both the Apache httpclient Apache 
> httpcore. I use the SparkTC Stocator library for connecting to an object 
> store, and I would align the versions to reduce java version mismatches. 
> Furthermore it is good to bump these versions since they fix stability and 
> performance issues:
> https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
> https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431188#comment-16431188
 ] 

Li Jin edited comment on SPARK-23929 at 4/9/18 8:39 PM:


I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out 
column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
id = pdf.id
vs = # 
    return pd.DataFrame([[id] + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the 
choice is somewhat arbitrary now. But I am also not sure if one is strictly 
better. [~omri374] in what case would you have out of order return value in 
your UDF? I am trying to see if that's more common.


was (Author: icexelloss):
I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out 
column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
id = pdf.id
vs = # 
    return pd.DataFrame([id] + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the 
choice is somewhat arbitrary now. But I am also not sure if one is strictly 
better. [~omri374] in what case would you have out of order return value in 
your UDF? I am trying to see if that's more common.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431188#comment-16431188
 ] 

Li Jin edited comment on SPARK-23929 at 4/9/18 8:39 PM:


I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out 
column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
id = pdf.id
vs = # 
    return pd.DataFrame([id] + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the 
choice is somewhat arbitrary now. But I am also not sure if one is strictly 
better. [~omri374] in what case would you have out of order return value in 
your UDF? I am trying to see if that's more common.


was (Author: icexelloss):
I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out 
column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
id = pdf.id
vs = # 
    return pd.DataFrame([id + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the 
choice is somewhat arbitrary now. But I am also not sure if one is strictly 
better. [~omri374] in what case would you have out of order return value in 
your UDF? I am trying to see if that's more common.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Lu Wang (JIRA)

Lu Wang created SPARK-23944:
---

 Summary: Add Param set functions to LSHModel types
 Key: SPARK-23944
 URL: https://issues.apache.org/jira/browse/SPARK-23944
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.3.0
Reporter: Lu Wang
 Fix For: 2.4.0


2 param set methods ( setInputCol, setOutputCol) are added to the two LSHModel 
types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431188#comment-16431188
 ] 

Li Jin commented on SPARK-23929:


I think there are pros and cons for both matching by position and by name.

Match by position give the user the flexibility of not needing to spell out 
column names in the udf. e.g.
{code:java}
@pandas_udf("id long, v double, v1 double", PandasUDFType.GROUPED_MAP)  
def normalize(pdf):
id = pdf.id
vs = # 
    return pd.DataFrame([id + vs])
{code}
Match by name give the user the flexibility of reorder columns. Admittedly, the 
choice is somewhat arbitrary now. But I am also not sure if one is strictly 
better. [~omri374] in what case would you have out of order return value in 
your UDF? I am trying to see if that's more common.

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:32 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto commented on SPARK-23890:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [[instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com/] to get around Spark 2's restriction...halp!  Don't 
make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen

2018-04-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431100#comment-16431100
 ] 

Maciej Bryński commented on SPARK-16996:


[~ste...@apache.org]
Are you prepared to lot of problems in HDP3 ?
{quote}
ACID-Based Tables Enabled by Default

ACID properties of Hive facilitate database transactions. ACID (which stands 
Atomicity, Consistency, Isolation, and Durability) is turned on for Hive tables 
by default starting with this HDP release, which means Hive tables do not 
require special flags or configurations to accept updates (in particular 
configurations and bucketing).
{quote}
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/bk_hive-performance-tuning/content/ch_wn-hptg.html

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Critical
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [[instantiating a JDBC connection to 
Hive|https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala]|http://example.com/]
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431099#comment-16431099
 ] 

Andrew Otto edited comment on SPARK-23890 at 4/9/18 7:31 PM:
-

Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com/] to get around Spark 2's restriction...halp!  Don't 
make us do this!  :)

 

 


was (Author: ottomata):
Hah! As a temporary workaround, we are [instantiating a JDBC connection to 
Hive|http://example.com]https://gerrit.wikimedia.org/r/#/c/425084/2/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/DataFrameToHive.scala
 to get around Spark 2's restriction...halp!  Don't make us do this!  :)

 

 

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14681.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20786
[https://github.com/apache/spark/pull/20786]

> Provide label/impurity stats for spark.ml decision tree nodes
> -
>
> Key: SPARK-14681
> URL: https://issues.apache.org/jira/browse/SPARK-14681
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, spark.ml decision trees provide all node info except for the 
> aggregated stats about labels and impurities.  This task is to provide those 
> publicly.  We need to choose a good API for it, so we should discuss the 
> design on this issue before implementing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14681:
-

Assignee: Weichen Xu

> Provide label/impurity stats for spark.ml decision tree nodes
> -
>
> Key: SPARK-14681
> URL: https://issues.apache.org/jira/browse/SPARK-14681
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> Currently, spark.ml decision trees provide all node info except for the 
> aggregated stats about labels and impurities.  This task is to provide those 
> publicly.  We need to choose a good API for it, so we should discuss the 
> design on this issue before implementing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23890:


Assignee: Apache Spark

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Assignee: Apache Spark
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431082#comment-16431082
 ] 

Apache Spark commented on SPARK-23890:
--

User 'ottomata' has created a pull request for this issue:
https://github.com/apache/spark/pull/21012

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23890:


Assignee: (was: Apache Spark)

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23890) Hive ALTER TABLE CHANGE COLUMN for struct type no longer works

2018-04-09 Thread Andrew Otto (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Otto updated SPARK-23890:

External issue URL: https://github.com/apache/spark/pull/21012

> Hive ALTER TABLE CHANGE COLUMN for struct type no longer works
> --
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Otto
>Priority: Major
>
> As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE 
> CHANGE COLUMN commands to Hive.  This restriction was loosened in 
> [https://github.com/apache/spark/pull/12714] to allow for those commands if 
> they only change the column comment.
> Wikimedia has been evolving Parquet backed Hive tables with data originally 
> from JSON events by adding newly found columns to the Hive table schema, via 
> a Spark job we call 'Refine'.  We do this by recursively merging an input 
> DataFrame schema with a Hive table DataFrame schema, finding new fields, and 
> then issuing an ALTER TABLE statement to add the columns.  However, because 
> we allow for nested data types in the incoming JSON data, we make extensive 
> use of struct type fields.  In order to add newly detected fields in a nested 
> data type, we must alter the struct column and append the nested struct 
> field.  This requires CHANGE COLUMN that alters the column type.  In reality, 
> the 'type' of the column is not changing, it just just a new field being 
> added to the struct, but to SQL, this looks like a type change.
> We were about to upgrade to Spark 2 but this new restriction in SQL DDL that 
> can be sent to Hive will block us.  I believe this is fixable by adding an 
> exception in 
> [command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
>  to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and 
> destination type are both struct types, and the destination type only adds 
> new fields.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly

2018-04-09 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431079#comment-16431079
 ] 

Joseph K. Bradley commented on SPARK-21005:
---

I don't actually see why this is a problem: If a feature is categorical, we 
should not silently convert it to continuous.  To use a high-arity categorical 
feature in a decision tree, one should convert it to a different representation 
first, such as hashing to a set of bins with HashingTF.

That said, I do think we should clarify this behavior in the VectorIndexer 
docstring.  I know it's been a long time since you sent your PR, but would you 
want to update it to simply update the docs?  If you're busy now, I'd be happy 
to take it over though.  Thanks!

> VectorIndexerModel does not prepare output column field correctly
> -
>
> Key: SPARK-21005
> URL: https://issues.apache.org/jira/browse/SPARK-21005
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Chen Lin
>Priority: Major
>
> From my understanding through reading the documentation,  VectorIndexer 
> decides which features should be categorical based on the number of distinct 
> values, where features with at most maxCategories are declared categorical. 
> Meanwhile, those features which exceed maxCategories are declared continuous. 
> Currently, VectorIndexerModel works all right with a dataset which has empty 
> schema. However, when VectorIndexerModel is transforming on a dataset with 
> `ML_ATTR` metadata, it may not output the expected result. For example, a 
> feature with nominal attribute which has distinct values exceeding 
> maxCategorie will not be treated as a continuous feature as we expected but 
> still a categorical feature. Thus, it may cause all the tree-based algorithms 
> (like Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree 
> requires maxBins (= $maxPossibleBins) to be at least as large as the number 
> of values in each categorical feature, but categorical feature $maxCategory 
> has $maxCategoriesPerFeature values. Considering remove this and other 
> categorical features with a large number of values, or add more training 
> examples.".
> Correct me if my understanding is wrong.
> I will submit a PR soon to resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22856) Add wrapper for codegen output and nullability

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22856.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.4.0

> Add wrapper for codegen output and nullability
> --
>
> Key: SPARK-22856
> URL: https://issues.apache.org/jira/browse/SPARK-22856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> The codegen output of {{Expression}}, {{ExprCode}}, now encapsulates only 
> strings of output value ({{value}}) and nullability ({{isNull}}). It makes 
> difficulty for us to know what the output really is. I think it is better if 
> we can add wrappers for the value and nullability that let us to easily know 
> that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009
 ] 

Kazuaki Ishizaki edited comment on SPARK-23923 at 4/9/18 6:36 PM:
--

I will work for this.


was (Author: kiszk):
I am working for this.

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23816) FetchFailedException when killing speculative task

2018-04-09 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23816.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0
   2.2.2

Issue resolved by pull request 20987
[https://github.com/apache/spark/pull/20987]

> FetchFailedException when killing speculative task
> --
>
> Key: SPARK-23816
> URL: https://issues.apache.org/jira/browse/SPARK-23816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Assignee: Imran Rashid
>Priority: Major
>  Labels: speculation
> Fix For: 2.2.2, 2.4.0, 2.3.1
>
>
> When spark trying to kill speculative tasks because of another attempt has 
> already success, sometimes the task throws 
> "org.apache.spark.shuffle.FetchFailedException: Error in opening 
> FileSegmentManagedBuffer" and the whole stage will fail.
> Other active stages will also fail with error 
> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle" Then I checked the log in failed executor, there is not 
> error like "MetadataFetchFailedException". So they just failed with no error.
> {code:java}
> 18/03/26 23:12:09 INFO Executor: Executor is trying to kill task 2879.1 in 
> stage 4.0 (TID 13023), reason: another attempt succeeded
> 18/03/26 23:12:09 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in opening 
> FileSegmentManagedBuffer{file=/hadoop02/yarn/local/usercache/pp_risk_grs_datamart_batch/appcache/application_1521504416249_116088/blockmgr-754a22fd-e8d6-4478-bcf8-f1d95f07f4a2/0c/shuffle_24_10_0.data,
>  offset=263687568, length=87231}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:401)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:104)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:103)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at

[jira] [Commented] (SPARK-23921) High-order function: array_sort(x) → array

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431037#comment-16431037
 ] 

Kazuaki Ishizaki commented on SPARK-23921:
--

I am working for this

> High-order function: array_sort(x) → array
> --
>
> Key: SPARK-23921
> URL: https://issues.apache.org/jira/browse/SPARK-23921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Sorts and returns the array x. The elements of x must be orderable. Null 
> elements will be placed at the end of the returned array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23816) FetchFailedException when killing speculative task

2018-04-09 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23816:
--

Assignee: Imran Rashid

> FetchFailedException when killing speculative task
> --
>
> Key: SPARK-23816
> URL: https://issues.apache.org/jira/browse/SPARK-23816
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Assignee: Imran Rashid
>Priority: Major
>  Labels: speculation
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> When spark trying to kill speculative tasks because of another attempt has 
> already success, sometimes the task throws 
> "org.apache.spark.shuffle.FetchFailedException: Error in opening 
> FileSegmentManagedBuffer" and the whole stage will fail.
> Other active stages will also fail with error 
> "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle" Then I checked the log in failed executor, there is not 
> error like "MetadataFetchFailedException". So they just failed with no error.
> {code:java}
> 18/03/26 23:12:09 INFO Executor: Executor is trying to kill task 2879.1 in 
> stage 4.0 (TID 13023), reason: another attempt succeeded
> 18/03/26 23:12:09 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in opening 
> FileSegmentManagedBuffer{file=/hadoop02/yarn/local/usercache/pp_risk_grs_datamart_batch/appcache/application_1521504416249_116088/blockmgr-754a22fd-e8d6-4478-bcf8-f1d95f07f4a2/0c/shuffle_24_10_0.data,
>  offset=263687568, length=87231}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:401)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:104)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:103)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:164)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>   at java.io.InputStream.skip(InputStream.java:224)
>   at 
> org.spark_project.guava.io.ByteStreams.skipFully(ByteStreams.java:755)
>

[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name

2018-04-09 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431014#comment-16431014
 ] 

Bryan Cutler commented on SPARK-23929:
--

cc [~icexelloss]

> pandas_udf schema mapped by position and not by name
> 
>
> Key: SPARK-23929
> URL: https://issues.apache.org/jira/browse/SPARK-23929
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: PySpark
> Spark 2.3.0
>  
>Reporter: Omri
>Priority: Major
>
> The return struct of a pandas_udf should be mapped to the provided schema by 
> name. Currently it's not the case.
> Consider these two examples, where the only change is the order of the fields 
> in the provided schema struct:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show() 
> {code}
> and this one:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))  
> @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
> def normalize(pdf):
>     v = pdf.v
>     return pdf.assign(v=(v - v.mean()) / v.std())
> df.groupby("id").apply(normalize).show()
> {code}
> The results should be the same but they are different:
> For the first code:
> {code:java}
> +---+---+
> |  v| id|
> +---+---+
> |1.0|  0|
> |1.0|  0|
> |2.0|  0|
> |2.0|  0|
> |2.0|  1|
> +---+---+
> {code}
> For the second code:
> {code:java}
> +---+---+
> | id|  v|
> +---+---+
> |  1|-0.7071067811865475|
> |  1| 0.7071067811865475|
> |  2|-0.8320502943378437|
> |  2|-0.2773500981126146|
> |  2| 1.1094003924504583|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004
 ] 

Kazuaki Ishizaki edited comment on SPARK-23919 at 4/9/18 6:19 PM:
--

I will work for this


was (Author: kiszk):
I am working for this.

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009
 ] 

Kazuaki Ishizaki commented on SPARK-23923:
--

I am working for this.

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004
 ] 

Kazuaki Ishizaki commented on SPARK-23919:
--

I am working for this.

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23924) High-order function: element_at

2018-04-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431002#comment-16431002
 ] 

Kazuaki Ishizaki commented on SPARK-23924:
--

I will work for this.

> High-order function: element_at
> ---
>
> Key: SPARK-23924
> URL: https://issues.apache.org/jira/browse/SPARK-23924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and 
> https://prestodb.io/docs/current/functions/map.html 
> * element_at(array, index) → E
> Returns element of array at given index. If index > 0, this function provides 
> the same functionality as the SQL-standard subscript operator ([]). If index 
> < 0, element_at accesses elements from the last to the first.
> * element_at(map, key) → V
> Returns value for given key, or NULL if the key is not contained in the map.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23881) Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23881.
-
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.4.0
   2.3.1

> Flaky test: JobCancellationSuite."interruptible iterator of shuffle reader"
> ---
>
> Key: SPARK-23881
> URL: https://issues.apache.org/jira/browse/SPARK-23881
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> The test JobCancellationSuite."interruptible iterator of shuffle reader" has 
> been flaky:
> *branch-2.3*
>  * 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark/JobCancellationSuite/interruptible_iterator_of_shuffle_reader/]
> *master*
>  * 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4301/testReport/junit/org.apache.spark/JobCancellationSuite/interruptible_iterator_of_shuffle_reader/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-09 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430922#comment-16430922
 ] 

Edwina Lu commented on SPARK-23206:
---

The doc wasn't very clear what the quantile values were for. Thanks for asking.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23906) Add UDF trunc(numeric)

2018-04-09 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23906:
---

Assignee: Yuming Wang

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuming Wang
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-09 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430836#comment-16430836
 ] 

Imran Rashid commented on SPARK-23206:
--

ah of course, that makes sense -- quantiles for the distribution across 
executors.  Sorry stupid question from me -- I was thinking about the 
timeseries of values from one (executor, stage) pair.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23916:


Assignee: Apache Spark

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430808#comment-16430808
 ] 

Apache Spark commented on SPARK-23916:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21011

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23916:


Assignee: (was: Apache Spark)

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-23906) Add UDF trunc(numeric)

2018-04-09 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23906:

Comment: was deleted

(was: Duplicate with SPARK-20754's {{TRUNC}}?)

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23925) High-order function: repeat(element, count) → array

2018-04-09 Thread Florent Pepin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430771#comment-16430771
 ] 

Florent Pepin commented on SPARK-23925:
---

Hi, I am new to Spark and I am interested in taking this one if that's okay

> High-order function: repeat(element, count) → array
> ---
>
> Key: SPARK-23925
> URL: https://issues.apache.org/jira/browse/SPARK-23925
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Repeat element for count times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-09 Thread Edwina Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430765#comment-16430765
 ] 

Edwina Lu commented on SPARK-23206:
---

[~irashid], this would be quantile values for peak executor memory usage (JVM 
used, execution, etc.) and other executor metrics across executors for a stage. 
It would give some idea of differences in memory usage for executors (for 
example if most executors are using 2G, but there are a couple using 10G). 
There is already information about skew at the task level with the taskSummary 
REST API (input, output and shuffle read/write), but an executor summary would 
show the effects of skew at the executor level.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-04-09 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430763#comment-16430763
 ] 

Xiao Li commented on SPARK-23206:
-

cc [~jiangxb1987] [~Gengliang.Wang] 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)

2018-04-09 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430761#comment-16430761
 ] 

Xiao Li commented on SPARK-23906:
-

[~q79969786] Could you link your PR to this JIRA?

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-09 Thread paul mackles (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Description: 
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 

  was:
Add a more robust health-check to MesosRestServer so that anyone who runs 
MesosClusterDispatcher as a marathon app can use it to check the health of the 
server:

[http://mesosphere.github.io/marathon/docs/health-checks.html]

Specifically, this check verifies that the MesosSchedulerDriver is still 
running as we have seen certain cases where it  dies (rather quietly) and the 
only way to revive it is a restart. With this health check, marathon will 
restart the dispatcher if the MesosSchedulerDriver stops running. 

The health check lives at the url "/health" and returns a 204 when the server 
is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
running).


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Two changes:
> First, a more robust 
> [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
> for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
> this check verifies that the MesosSchedulerDriver is still running as we have 
> seen certain cases where it stops (rather quietly) and the only way to revive 
> it is a restart. With this health check, marathon will restart the dispatcher 
> if the MesosSchedulerDriver stops running. The health check lives at the url 
> "/health" and returns a 204 when the server is healthy and a 503 when it is 
> not (e.g. the MesosSchedulerDriver stopped running).
> Second, a server status endpoint that replies with some basic metrics about 
> the server. The status endpoint resides at the url "/status" and responds 
> with:
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "server OK",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true
> }{code}
> As you can see, it includes a snapshot of the metrics/health of the 
> scheduler. Useful for quick debugging/troubleshooting/monitoring. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23906) Add UDF trunc(numeric)

2018-04-09 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430632#comment-16430632
 ] 

Yuming Wang commented on SPARK-23906:
-

Duplicate with SPARK-20754's {{TRUNC}}?

> Add UDF trunc(numeric)
> --
>
> Key: SPARK-23906
> URL: https://issues.apache.org/jira/browse/SPARK-23906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-14582
> We already have {{date_trunc}} and {{trunc}}. Need to discuss whether we 
> should introduce a new name or reuse {{trunc}} for truncating numbers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer

2018-04-09 Thread paul mackles (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Summary: Improve observability of MesosRestServer  (was: Add more specific 
health check to MesosRestServer)

> Improve observability of MesosRestServer
> 
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add a more robust health-check to MesosRestServer so that anyone who runs 
> MesosClusterDispatcher as a marathon app can use it to check the health of 
> the server:
> [http://mesosphere.github.io/marathon/docs/health-checks.html]
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher

2018-04-09 Thread paul mackles (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
-
Summary: Improve observability of MesosRestServer/MesosClusterDispatcher  
(was: Improve observability of MesosRestServer)

> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---
>
> Key: SPARK-23943
> URL: https://issues.apache.org/jira/browse/SPARK-23943
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Affects Versions: 2.2.1, 2.3.0
> Environment:  
>  
>Reporter: paul mackles
>Priority: Minor
> Fix For: 2.4.0
>
>
> Add a more robust health-check to MesosRestServer so that anyone who runs 
> MesosClusterDispatcher as a marathon app can use it to check the health of 
> the server:
> [http://mesosphere.github.io/marathon/docs/health-checks.html]
> Specifically, this check verifies that the MesosSchedulerDriver is still 
> running as we have seen certain cases where it  dies (rather quietly) and the 
> only way to revive it is a restart. With this health check, marathon will 
> restart the dispatcher if the MesosSchedulerDriver stops running. 
> The health check lives at the url "/health" and returns a 204 when the server 
> is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped 
> running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23900) format_number udf should take user specifed format as argument

2018-04-09 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23900:


Assignee: Apache Spark

> format_number udf should take user specifed format as argument
> --
>
> Key: SPARK-23900
> URL: https://issues.apache.org/jira/browse/SPARK-23900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-5370
> {noformat}
> Currently, format_number udf formats the number to #,###,###.##, but it 
> should also take a user specified format as optional input.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23900) format_number udf should take user specifed format as argument

2018-04-09 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430620#comment-16430620
 ] 

Apache Spark commented on SPARK-23900:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21010

> format_number udf should take user specifed format as argument
> --
>
> Key: SPARK-23900
> URL: https://issues.apache.org/jira/browse/SPARK-23900
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/HIVE-5370
> {noformat}
> Currently, format_number udf formats the number to #,###,###.##, but it 
> should also take a user specified format as optional input.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 199 matches

Mail list logo