[jira] [Assigned] (SPARK-24386) implement continuous processing coalesce(1)

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24386:


Assignee: Apache Spark

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24386) implement continuous processing coalesce(1)

2018-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511994#comment-16511994
 ] 

Apache Spark commented on SPARK-24386:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21560

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24386) implement continuous processing coalesce(1)

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24386:


Assignee: (was: Apache Spark)

> implement continuous processing coalesce(1)
> ---
>
> Key: SPARK-24386
> URL: https://issues.apache.org/jira/browse/SPARK-24386
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> [~marmbrus] suggested this as a good implementation checkpoint. If we do the 
> shuffle reader and writer correctly, it should be easy to make a custom 
> coalesce(1) execution for continuous processing using them, without having to 
> implement the logic for shuffle writers finding out where shuffle readers are 
> located. (The coalesce(1) can just get the RpcEndpointRef directly from the 
> reader and pass it to the writers.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2018-06-13 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511724#comment-16511724
 ] 

John Zhuge edited comment on SPARK-5152 at 6/14/18 3:21 AM:


SPARK-7169 alleviated this issue, however, still find this approach 
*spark.metrics.conf=s3://bucket/spark-metrics/graphite.properties* a little 
more simple and convenient. Compared to *spark.metrics.conf.** in SparkConf, a 
metrics config file can group the properties together, separate from the rest 
of the properties. In my case, there are 10 properties to config. It is also 
easy to swap out the config file for different users or different purposes, 
especially in self-serving environments. I wish spark-submit can accept 
multiple '--properties-file' options.

Pretty simple change. Let me know whether I can post a PR.
{code:java}
- case Some(f) => new FileInputStream(f)
+ case Some(f) =>
+   val hadoopPath = new Path(Utils.resolveURI(f))
+   Utils.getHadoopFileSystem(hadoopPath.toUri, new 
Configuration()).open(hadoopPath)
{code}
 


was (Author: jzhuge):
SPARK-7169 alleviated this issue, however, still find this approach 
*spark.metrics.conf=s3://bucket/spark-metrics/graphite.properties* a little 
more convenient and clean. Compared to *spark.metrics.conf.** in SparkConf, a 
metrics config file groups the properties together, separate from the rest of 
the Spark properties. In my case, there are 10 properties. It is easy to swap 
out the config file by different users or for different purposes, especially in 
a self-serving environment. I wish spark-submit can accept multiple 
'--properties-file' options.

The downside is this will add one more dependency on hadoop-client in 
spark-core, besides history server.

Pretty simple change. Let me know whether I can post an PR.
{code:java}
- case Some(f) => new FileInputStream(f)
+ case Some(f) =>
+   val hadoopPath = new Path(Utils.resolveURI(f))
+   Utils.getHadoopFileSystem(hadoopPath.toUri, new 
Configuration()).open(hadoopPath)
{code}
 

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> 

[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511924#comment-16511924
 ] 

Liang-Chi Hsieh commented on SPARK-24528:
-

Btw, I think the complete and reproducible examples should be: 

{code}
spark.sql("set spark.sql.shuffle.partitions=3")
val N = 1000
spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as 
t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, 
"key").sortBy("key", "t1").saveAsTable("a1")
spark.sql("select max(struct(t1, *)) from a1 group by key").explain

== Physical Plan ==
SortAggregate(key=[key#156L], functions=[max(named_struct(t1, t1#157L, key, 
key#156L, t1, t1#157L, t2, t2#158L))])
+- SortAggregate(key=[key#156L], functions=[partial_max(named_struct(t1, 
t1#157L, key, key#156L, t1, t1#157L, t2, t2#158L))])
   +- *(1) FileScan parquet default.a1[key#156L,t1#157L,t2#158L] Batched: true, 
Format: Parquet, Location: 
InMemoryFileIndex[file:/root/repos/spark-1/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct, SelectedBucketsCount: 3 out of 3

{code}

{code}
spark.sql("set spark.sql.shuffle.partitions=2")
val N = 1000
spark.range(N).selectExpr("id as key", "id % 2 as t1", "id % 3 as 
t2").repartition(col("key")).write.mode("overwrite").bucketBy(3, 
"key").sortBy("key", "t1").saveAsTable("a1")
spark.sql("select max(struct(t1, *)) from a1 group by key").explain

== Physical Plan ==
SortAggregate(key=[key#126L], functions=[max(named_struct(t1, t1#127L, key, 
key#126L, t1, t1#127L, t2, t2#128L))])
+- SortAggregate(key=[key#126L], functions=[partial_max(named_struct(t1, 
t1#127L, key, key#126L, t1, t1#127L, t2, t2#128L))])
   +- *(1) Sort [key#126L ASC NULLS FIRST], false, 0
  +- *(1) FileScan parquet default.a1[key#126L,t1#127L,t2#128L] Batched: 
true, Format: Parquet, Location: 
InMemoryFileIndex[file:/root/repos/spark-1/spark-warehouse/a1], 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct, SelectedBucketsCount: 3 out of 3
{code}

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p 

[jira] [Resolved] (SPARK-24546) InsertIntoDataSourceCommand make dataframe with wrong schema

2018-06-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24546.
--
  Resolution: Won't Fix
Target Version/s:   (was: 2.3.1)

> InsertIntoDataSourceCommand make dataframe with wrong schema
> 
>
> Key: SPARK-24546
> URL: https://issues.apache.org/jira/browse/SPARK-24546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0, 2.3.1, 3.0.0
>Reporter: yangz
>Priority: Major
>
> I have a hdfs table with schema 
> {code:java}
> hdfs_table(a int, b int, c int){code}
>  
> and a kudu table with schema
>  
> {code:java}
> kudu_table(b int primary key, a int,c int)
> {code}
>  
> When I have a sql insert, I got data misorder.
> {code:java}
> insert into kudu_table from hdfs_table
> {code}
> The reason to cause the problem is create data frame with wrong schema



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-13 Thread Ashwin K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin K updated SPARK-24540:
-
Description: 
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false")

                                                         .option("delimiter", 
", ")
                                                          .csv("C:\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple delimiters and presently 
we need to do a manual data clean up on the source/input file, which doesn't 
work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 

  was:
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false").option("delimiter", ", ")
                                                          .csv("C:test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple delimiters and presently 
we need to do a manual data clean up on the source/input file, which doesn't 
work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 


> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 

[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-06-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511876#comment-16511876
 ] 

Hyukjin Kwon commented on SPARK-22239:
--

Sure, please go ahead.

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24540:
-
Component/s: (was: Spark Core)
 SQL

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false").option("delimiter", ", ")
>                                                           .csv("C:test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511852#comment-16511852
 ] 

Joseph K. Bradley commented on SPARK-24467:
---

True, we would have to make the VectorAssembler inherit from Model.  Since 
VectorAssembler has public constructors and is not a final class, that would 
technically be a breaking change.  This might not work : (  ...until Spark 3.0. 
 That seems like a benign enough breaking change to put in a major Spark 
release.

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511845#comment-16511845
 ] 

Joseph K. Bradley commented on SPARK-15882:
---

I'm afraid I don't have time to prioritize this work now.  It'd be awesome if 
others could pick it up, but I'm having to step back a bit from this work at 
this time.

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24549) 32BitDecimalType and 64BitDecimalType support push down to the data sources

2018-06-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24549:

Issue Type: Improvement  (was: New Feature)

> 32BitDecimalType and 64BitDecimalType support push down to the data sources
> ---
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to the data sources

2018-06-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Issue Type: Improvement  (was: New Feature)

> ByteArrayDecimalType support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-13 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511804#comment-16511804
 ] 

Erik Erlandson commented on SPARK-24534:


I think this has potential use for customization beyond the openshift 
downstream. It allows derived images to leverage the apache spark base images 
in contexts outside of directly running the driver and executor processes.

> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23997) Configurable max number of buckets

2018-06-13 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511800#comment-16511800
 ] 

Fernando Pereira commented on SPARK-23997:
--

cc [~cloud_fan] [~tejasp]

This a pretty straightforward patch that has been sitting for 2 months... Could 
you please check?

Thanks

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Priority: Major
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23732) Broken link to scala source code in Spark Scala api Scaladoc

2018-06-13 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23732:
---
Fix Version/s: 2.2.2
   2.1.3

> Broken link to scala source code in Spark Scala api Scaladoc
> 
>
> Key: SPARK-23732
> URL: https://issues.apache.org/jira/browse/SPARK-23732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, Project Infra
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ~/spark/docs$ cat /etc/*release*
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=16.04
> DISTRIB_CODENAME=xenial
> DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
> NAME="Ubuntu"
> VERSION="16.04.4 LTS (Xenial Xerus)"
> ID=ubuntu
> ID_LIKE=debian
> PRETTY_NAME="Ubuntu 16.04.4 LTS"
> VERSION_ID="16.04"
> HOME_URL="http://www.ubuntu.com/;
> SUPPORT_URL="http://help.ubuntu.com/;
> BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
> VERSION_CODENAME=xenial
> UBUNTU_CODENAME=xenial
> {code}
> Using spark packaged sbt.
> Other versions:
> {code:java}
> ~/spark/docs$ ruby -v 
> ruby 2.3.1p112 (2016-04-26) [x86_64-linux-gnu] 
> ~/spark/docs$ gem -v 
> 2.5.2.1 
> ~/spark/docs$ jekyll -v 
> jekyll 3.7.3  
> ~/spark/docs$ java -version 
> java version "1.8.0_112" Java(TM) SE Runtime Environment (build 
> 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed 
> mode)
> {code}
>Reporter: Yogesh Tewari
>Assignee: Marcelo Vanzin
>Priority: Trivial
>  Labels: build, documentation, scaladocs
> Fix For: 2.1.3, 2.2.2, 2.3.2, 2.4.0
>
>
> Scala source code link in Spark api scaladoc is broken.
> Turns out instead of the relative path to the scala files the 
> "€\{FILE_PATH}.scala" expression in 
> [https://github.com/apache/spark/blob/master/project/SparkBuild.scala] is 
> generating the absolute path from the developers computer. In this case, if I 
> try to access the source link on 
> [https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulable],
>  it tries to take me to 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/Accumulable.scala]
> where "/Users/sameera/dev/spark" portion of the URL is coming from the 
> developers macos home folder.
> There seems to be no change in the code responsible for generating this path 
> during the build in /project/SparkBuild.scala :
> Line # 252:
> {code:java}
> scalacOptions in Compile ++= Seq(
> s"-target:jvm-${scalacJVMVersion.value}",
> "-sourcepath", (baseDirectory in ThisBuild).value.getAbsolutePath // Required 
> for relative source links in scaladoc
> ),
> {code}
> Line # 726
> {code:java}
> // Use GitHub repository for Scaladoc source links
> unidocSourceBase := s"https://github.com/apache/spark/tree/v${version.value};,
> scalacOptions in (ScalaUnidoc, unidoc) ++= Seq(
> "-groups", // Group similar methods together based on the @group annotation.
> "-skip-packages", "org.apache.hadoop"
> ) ++ (
> // Add links to sources when generating Scaladoc for a non-snapshot release
> if (!isSnapshot.value) {
> Opts.doc.sourceUrl(unidocSourceBase.value + "€{FILE_PATH}.scala")
> } else {
> Seq()
> }
> ){code}
>  
> It seems more like a developers dev environment issue.
> I was successfully able to reproduce this in my dev environment. Environment 
> details attached. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Shepherd:   (was: Joseph K. Bradley)

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3727) Trees and ensembles: More prediction functionality

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3727:
-
Shepherd:   (was: Joseph K. Bradley)

> Trees and ensembles: More prediction functionality
> --
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5362:
-
Shepherd:   (was: Joseph K. Bradley)

> Gradient and Optimizer to support generic output (instead of label) and data 
> batches
> 
>
> Key: SPARK-5362
> URL: https://issues.apache.org/jira/browse/SPARK-5362
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Gradient and Optimizer interfaces support data in form of 
> RDD[Double, Vector] which refers to label and features. This limits its 
> application to classification problems. For example, artificial neural 
> network demands Vector as output (instead of label: Double). Moreover, 
> current interface does not support data batches. I propose to replace label: 
> Double with output: Vector. It enables passing generic output instead of 
> label and also passing data and output batches stored in corresponding 
> vectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5556:
-
Shepherd:   (was: Joseph K. Bradley)

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>Priority: Major
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Shepherd:   (was: Joseph K. Bradley)

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Shepherd:   (was: Joseph K. Bradley)

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9120:
-
Shepherd:   (was: Joseph K. Bradley)

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Major
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8767) Abstractions for InputColParam, OutputColParam

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8767:
-
Shepherd:   (was: Joseph K. Bradley)

> Abstractions for InputColParam, OutputColParam
> --
>
> Key: SPARK-8767
> URL: https://issues.apache.org/jira/browse/SPARK-8767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I'd like to create Param subclasses for output and input columns.  These will 
> provide easier schema checking, which could even be done automatically in an 
> abstraction rather than in each class.  That should simplify things for 
> developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Target Version/s:   (was: 3.0.0)

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7424:
-
Shepherd:   (was: Joseph K. Bradley)

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Major
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21166) Automated ML persistence

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21166:
--
Shepherd:   (was: Joseph K. Bradley)

> Automated ML persistence
> 
>
> Key: SPARK-21166
> URL: https://issues.apache.org/jira/browse/SPARK-21166
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing the possibility of automating ML persistence.  
> Currently, custom save/load methods are written for every Model.  However, we 
> could design a mixin which provides automated persistence, inspecting model 
> data and Params and reading/writing (known types) automatically.  This was 
> brought up in discussions with developers behind 
> https://github.com/azure/mmlspark
> Some issues we will need to consider:
> * Providing generic mixin usable in most or all cases
> * Handling corner cases (strange Param types, etc.)
> * Backwards compatibility (loading models saved by old Spark versions)
> Because of backwards compatibility in particular, it may make sense to 
> implement testing for that first, before we try to address automated 
> persistence: [SPARK-15573]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14585) Provide accessor methods for Pipeline stages

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14585:
--
Shepherd:   (was: Joseph K. Bradley)

> Provide accessor methods for Pipeline stages
> 
>
> Key: SPARK-14585
> URL: https://issues.apache.org/jira/browse/SPARK-14585
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> It is currently hard to access particular stages in a Pipeline or 
> PipelineModel.  Some accessor methods would help.
> Scala:
> {code}
> class Pipeline {
>   /** Returns stage at index i in Pipeline */
>   def getStage[T <: PipelineStage](i: Int): T
>   /** Returns all stages of this type */
>   def getStagesOfType[T <: PipelineStage]: Array[T]
> }
> class PipelineModel {
>   /** Returns stage at index i in PipelineModel */
>   def getStage[T <: Transformer](i: Int): T
>   /**
>* Returns stage given its parent or generating instance in PipelineModel.
>* E.g., if this PipelineModel was created from a Pipeline containing a 
> stage 
>* {{myStage}}, then passing {{myStage}} to this method will return the
>* corresponding stage in this PipelineModel.
>*/
>   def getStage[T <: Transformer][implicit E <: PipelineStage](stage: E): T
>   /** Returns all stages of this type */
>   def getStagesOfType[T <: Transformer]: Array[T]
> }
> {code}
> These methods should not be recursive for now.  I.e., if a Pipeline A 
> contains another Pipeline B, then calling {{getStage}} on the outer Pipeline 
> A should not search for the stage within Pipeline B.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19591) Add sample weights to decision trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19591:
--
Shepherd:   (was: Joseph K. Bradley)

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Major
>
> Add sample weights to decision trees.  See [SPARK-9478] for details on the 
> design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9140) Replace TimeTracker by Stopwatch

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9140:
-
Shepherd:   (was: Joseph K. Bradley)

> Replace TimeTracker by Stopwatch
> 
>
> Key: SPARK-9140
> URL: https://issues.apache.org/jira/browse/SPARK-9140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We can replace TImeTracker in tree implementations by Stopwatch. The initial 
> PR could use local stopwatches only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15573) Backwards-compatible persistence for spark.ml

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15573:
--
Shepherd:   (was: Joseph K. Bradley)

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19498:
--
Shepherd:   (was: Joseph K. Bradley)

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24359:
--
Shepherd: Xiangrui Meng  (was: Joseph K. Bradley)

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few 

[jira] [Updated] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24097:
--
Shepherd:   (was: Joseph K. Bradley)

> Instruments improvements - RandomForest and GradientBoostedTree
> ---
>
> Key: SPARK-24097
> URL: https://issues.apache.org/jira/browse/SPARK-24097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Instruments improvements - RandomForest and GradientBoostedTree



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Shepherd:   (was: Joseph K. Bradley)

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24212) PrefixSpan in spark.ml: user guide section

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24212:
--
Shepherd:   (was: Joseph K. Bradley)

> PrefixSpan in spark.ml: user guide section
> --
>
> Key: SPARK-24212
> URL: https://issues.apache.org/jira/browse/SPARK-24212
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> See linked JIRA for the PrefixSpan API for which we need to write a user 
> guide page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14376) spark.ml parity for trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14376.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10817) ML abstraction umbrella

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-10817:
-

Assignee: (was: Joseph K. Bradley)

> ML abstraction umbrella
> ---
>
> Key: SPARK-10817
> URL: https://issues.apache.org/jira/browse/SPARK-10817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for discussing and creating ML abstractions.  This was 
> originally handled under [SPARK-1856] and [SPARK-3702], under which we 
> created the Pipelines API and some Developer APIs for classification and 
> regression.
> This umbrella is for future work, including:
> * Stabilizing the classification and regression APIs
> * Discussing traits vs. abstract classes for abstraction APIs
> * Creating other abstractions not yet covered (clustering, multilabel 
> prediction, etc.)
> Note that [SPARK-3702] still has useful discussion and design docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5572) LDA improvement listing

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-5572:


Assignee: (was: Joseph K. Bradley)

> LDA improvement listing
> ---
>
> Key: SPARK-5572
> URL: https://issues.apache.org/jira/browse/SPARK-5572
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing planned improvements to Latent Dirichlet 
> Allocation (LDA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4285) Transpose RDD[Vector] to column store for ML

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-4285:


Assignee: (was: Joseph K. Bradley)

> Transpose RDD[Vector] to column store for ML
> 
>
> Key: SPARK-4285
> URL: https://issues.apache.org/jira/browse/SPARK-4285
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> For certain ML algorithms, a column store is more efficient than a row store 
> (which is currently used everywhere).  E.g., deep decision trees can be 
> faster to train when partitioning by features.
> Proposal: Provide a method with the following API (probably in util/):
> ```
> def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
> ```
> The input Vectors will be data rows/instances, and the output Vectors will be 
> columns/features paired with column/feature indices.
> **Question**: Is it important to maintain matrix structure?  That is, should 
> output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5572) LDA improvement listing

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5572:
-
Shepherd:   (was: Joseph K. Bradley)

> LDA improvement listing
> ---
>
> Key: SPARK-5572
> URL: https://issues.apache.org/jira/browse/SPARK-5572
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing planned improvements to Latent Dirichlet 
> Allocation (LDA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7206) Gaussian Mixture Model (GMM) improvements

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-7206:


Assignee: (was: Joseph K. Bradley)

> Gaussian Mixture Model (GMM) improvements
> -
>
> Key: SPARK-7206
> URL: https://issues.apache.org/jira/browse/SPARK-7206
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing improvements for GMMs:
> * planned improvements
> * optional/experimental work
> * tests for verifying scalability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14376) spark.ml parity for trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511751#comment-16511751
 ] 

Joseph K. Bradley commented on SPARK-14376:
---

Thanks!  I'll close it.

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14604) Modify design of ML model summaries

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14604:
-

Assignee: (was: Joseph K. Bradley)

> Modify design of ML model summaries
> ---
>
> Key: SPARK-14604
> URL: https://issues.apache.org/jira/browse/SPARK-14604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Several spark.ml models now have summaries containing evaluation metrics and 
> training info:
> * LinearRegressionModel
> * LogisticRegressionModel
> * GeneralizedLinearRegressionModel
> These summaries have unfortunately been added in an inconsistent way.  I 
> propose to reorganize them to have:
> * For each model, 1 summary (without training info) and 1 training summary 
> (with info from training).  The non-training summary can be produced for a 
> new dataset via {{evaluate}}.
> * A summary should not store the model itself as a public field.
> * A summary should provide a transient reference to the dataset used to 
> produce the summary.
> This task will involve reorganizing the GLM summary (which lacks a 
> training/non-training distinction) and deprecating the model method in the 
> LinearRegressionSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2018-06-13 Thread David McLennan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511746#comment-16511746
 ] 

David McLennan commented on SPARK-19609:


This feature would be extremely useful in making external data lookups more 
efficient.  For example, you have a stream of data coming in with a window of 
10,000 messages.  You need to join each message with reference data on external 
services to enrich it (for example accounts and products).  Today, you would 
either have to pull the entire external data sources into the executors 
(expensive on all sides - even the small datasets are many 10's of gigabyres), 
or lookup the external datasets key by key on a per message basis, which is 
very chatty from a communication perspective.  If this feature is implemented, 
it could reduce the amount of data transfer significantly, if the cardinality 
of the join keys is low (i.e. you might have 10,000 messages, but they 
reference only 15 unique accounts and 50 unique products.)  It would also 
relieve the author of the burden of having to implement something which does 
this themselves - they could just register the dataframes, run a sql context 
ontop of it, and go.

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Major
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22666) Spark datasource for image format

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511745#comment-16511745
 ] 

Joseph K. Bradley commented on SPARK-22666:
---

Side note: The Java library we use for reading images does not support as many 
image types as PIL, which we used in the image loader in 
https://github.com/databricks/spark-deep-learning  This should not block this 
task, but it'd be worth poking around to see if there are alternative libraries 
we could use from Java.

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511730#comment-16511730
 ] 

Dongjoon Hyun edited comment on SPARK-24530 at 6/13/18 10:28 PM:
-

Hi, [~mengxr] .

I got the following locally. It generated correctly.  
!image-2018-06-13-15-15-51-025.png!

However, as you pointed out, I found that the following status. Some docs are 
broken.

2.1.x

(O) 
[https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(O) 
[https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]

2.2.x
(O) 
[https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]

2.3.x
(O) 
[https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]


was (Author: dongjoon):
Hi, [~mengxr] .

I got the following locally. It generated correctly. 
!image-2018-06-13-15-15-51-025.png!

However, as you pointed out, I found that the following status. Some latest 
docs are broken.

2.1.x

(O) 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(O) 
https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

2.2.x
(O) 
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

2.3.x
(O) 
https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24530:
--
Attachment: image-2018-06-13-15-15-51-025.png

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5152) Let metrics.properties file take an hdfs:// path

2018-06-13 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511724#comment-16511724
 ] 

John Zhuge commented on SPARK-5152:
---

SPARK-7169 alleviated this issue, however, still find this approach 
*spark.metrics.conf=s3://bucket/spark-metrics/graphite.properties* a little 
more convenient and clean. Compared to *spark.metrics.conf.** in SparkConf, a 
metrics config file groups the properties together, separate from the rest of 
the Spark properties. In my case, there are 10 properties. It is easy to swap 
out the config file by different users or for different purposes, especially in 
a self-serving environment. I wish spark-submit can accept multiple 
'--properties-file' options.

The downside is this will add one more dependency on hadoop-client in 
spark-core, besides history server.

Pretty simple change. Let me know whether I can post an PR.
{code:java}
- case Some(f) => new FileInputStream(f)
+ case Some(f) =>
+   val hadoopPath = new Path(Utils.resolveURI(f))
+   Utils.getHadoopFileSystem(hadoopPath.toUri, new 
Configuration()).open(hadoopPath)
{code}
 

> Let metrics.properties file take an hdfs:// path
> 
>
> Key: SPARK-5152
> URL: https://issues.apache.org/jira/browse/SPARK-5152
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Ryan Williams
>Priority: Major
>
> From my reading of [the 
> code|https://github.com/apache/spark/blob/06dc4b5206a578065ebbb6bb8d54246ca007397f/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L53],
>  the {{spark.metrics.conf}} property must be a path that is resolvable on the 
> local filesystem of each executor.
> Running a Spark job with {{--conf 
> spark.metrics.conf=hdfs://host1.domain.com/path/metrics.properties}} logs 
> many errors (~1 per executor, presumably?) like:
> {code}
> 15/01/08 13:20:57 ERROR metrics.MetricsConfig: Error loading configure file
> java.io.FileNotFoundException: hdfs:/host1.domain.com/path/metrics.properties 
> (No such file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> org.apache.spark.metrics.MetricsConfig.initialize(MetricsConfig.scala:53)
> at 
> org.apache.spark.metrics.MetricsSystem.(MetricsSystem.scala:92)
> at 
> org.apache.spark.metrics.MetricsSystem$.createMetricsSystem(MetricsSystem.scala:218)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:329)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:181)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:131)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:60)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> {code}
> which seems consistent with the idea that it's looking on the local 
> filesystem and not parsing the "scheme" portion of the URL.
> Letting all executors get their {{metrics.properties}} files from one 
> location on HDFS would be an improvement, right?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-13 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24530:
--
Description: 
I generated python docs from master locally using `make html`. However, the 
generated html doc doesn't render class docs correctly. I attached the 
screenshot from Spark 2.3 docs and master docs generated on my local. Not sure 
if this is because my local setup.

cc: [~dongjoon] Could you help verify?

 

The followings are our released doc status. Some recent docs seems to be broken.

*2.1.x*

(O) 
[https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(O) 
[https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]

*2.2.x*
(O) 
[https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]

*2.3.x*
(O) 
[https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
(X) 
[https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]

  was:
I generated python docs from master locally using `make html`. However, the 
generated html doc doesn't render class docs correctly. I attached the 
screenshot from Spark 2.3 docs and master docs generated on my local. Not sure 
if this is because my local setup.

cc: [~dongjoon] Could you help verify?



> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-13 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511730#comment-16511730
 ] 

Dongjoon Hyun commented on SPARK-24530:
---

Hi, [~mengxr] .

I got the following locally. It generated correctly. 
!image-2018-06-13-15-15-51-025.png!

However, as you pointed out, I found that the following status. Some latest 
docs are broken.

2.1.x

(O) 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(O) 
https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

2.2.x
(O) 
https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

2.3.x
(O) 
https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression
(X) 
https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-13 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24531.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.4.0
>
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511695#comment-16511695
 ] 

Jiang Xingbo edited comment on SPARK-24552 at 6/13/18 9:47 PM:
---

IIUC stageAttemptId + taskAttemptNumber shall probably define a unique task 
attempt, and it carries enough information to know how many failed attempts you 
had previously.


was (Author: jiangxb1987):
IIUC stageAttemptId + taskAttemptId shall probably define a unique task 
attempt, and it carries enough information to know how many failed attempts you 
had previously.

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511695#comment-16511695
 ] 

Jiang Xingbo commented on SPARK-24552:
--

IIUC stageAttemptId + taskAttemptId shall probably define a unique task 
attempt, and it carries enough information to know how many failed attempts you 
had previously.

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-06-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24554) Add MapType Support for Arrow in PySpark

2018-06-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511691#comment-16511691
 ] 

Bryan Cutler commented on SPARK-24554:
--

There still is work to be done to add a Map logical type to Arrow C++/Python 
and Java.  Then finally Arrow integration tests (linked) can be completed.  At 
that point we can add this to Spark.

> Add MapType Support for Arrow in PySpark
> 
>
> Key: SPARK-24554
> URL: https://issues.apache.org/jira/browse/SPARK-24554
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>
> Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
> functionality in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24554) Add MapType Support for Arrow in PySpark

2018-06-13 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-24554:


 Summary: Add MapType Support for Arrow in PySpark
 Key: SPARK-24554
 URL: https://issues.apache.org/jira/browse/SPARK-24554
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.3.1
Reporter: Bryan Cutler


Add support for MapType in Arrow related classes in Scala/Java and pyarrow 
functionality in Python.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511674#comment-16511674
 ] 

Imran Rashid commented on SPARK-24552:
--

I wouldn't call this a bug in the scheduler, though I agree its definitely 
confusing and not clearly documented at all.  I think this has always been the 
meaning of the attempt number.  It might be confusing for other cases to change 
its meaning now -- eg. its useful to know how many failed attempts you had 
within one taskset, and some code out there might be using max(attemptNumber) 
now as a proxy for that.

For sure we should improve the documentation.

[~markhamstra] [~tgraves] [~jiangxb1987] for some more opinions on the 
scheduler side.

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511655#comment-16511655
 ] 

Apache Spark commented on SPARK-24525:
--

User 'mukulmurthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21559

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24525:


Assignee: (was: Apache Spark)

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Priority: Major
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24525) Provide an option to limit MemorySink memory usage

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24525:


Assignee: Apache Spark

> Provide an option to limit MemorySink memory usage
> --
>
> Key: SPARK-24525
> URL: https://issues.apache.org/jira/browse/SPARK-24525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Major
>
> MemorySink stores stream results in memory and is mostly used for testing and 
> displaying streams, but for large streams, this can OOM the driver. We should 
> add an option to limit the number of rows and the total size of a memory sink 
> and not add any new data once either limit is hit. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24553) Job UI redirect causing http 302 error

2018-06-13 Thread Steven Kallman (JIRA)
Steven Kallman created SPARK-24553:
--

 Summary: Job UI redirect causing http 302 error
 Key: SPARK-24553
 URL: https://issues.apache.org/jira/browse/SPARK-24553
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1, 2.3.0, 2.2.1
Reporter: Steven Kallman


When on spark UI port of 4040 jobs or stages tab, the href links for the 
individual jobs or stages are missing a '/' before the '?id' this causes a 
redirect to the address with a '/' which is breaking the use of a reverse proxy

 

localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2

localhost:4040/stages/stage?id=3=0 --> 
localhost:4040/stages/stage/?id=3=0

 

Will submit pull request with proposed fix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24415) Stage page aggregated executor metrics wrong when failures

2018-06-13 Thread Ankur Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511618#comment-16511618
 ] 

Ankur Gupta commented on SPARK-24415:
-

I am planning to work on this JIRA

> Stage page aggregated executor metrics wrong when failures 
> ---
>
> Key: SPARK-24415
> URL: https://issues.apache.org/jira/browse/SPARK-24415
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png
>
>
> Running with spark 2.3 on yarn and having task failures and blacklisting, the 
> aggregated metrics by executor are not correct.  In my example it should have 
> 2 failed tasks but it only shows one.    Note I tested with master branch to 
> verify its not fixed.
> I will attach screen shot.
> To reproduce:
> $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client 
> --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" 
> --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf 
> "spark.blacklist.stage.maxFailedExecutorsPerNode=1"  --conf 
> "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf 
> "spark.blacklist.killBlacklistedExecutors=true"
> import org.apache.spark.SparkEnv 
> sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt 
> >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad 
> executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer

2018-06-13 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-24235:


Assignee: Jose Torres

> create the top-of-task RDD sending rows to the remote buffer
> 
>
> Key: SPARK-24235
> URL: https://issues.apache.org/jira/browse/SPARK-24235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> Note that after 
> [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this
>  will need to be responsible for incrementing its task's EpochTracker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24235) create the top-of-task RDD sending rows to the remote buffer

2018-06-13 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24235.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21428
[https://github.com/apache/spark/pull/21428]

> create the top-of-task RDD sending rows to the remote buffer
> 
>
> Key: SPARK-24235
> URL: https://issues.apache.org/jira/browse/SPARK-24235
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> Note that after 
> [https://github.com/apache/spark/pull/21239|https://github.com/apache/spark/pull/21239],this
>  will need to be responsible for incrementing its task's EpochTracker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24552:


Assignee: Apache Spark

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511593#comment-16511593
 ] 

Apache Spark commented on SPARK-24552:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21558

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24552:


Assignee: (was: Apache Spark)

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511590#comment-16511590
 ] 

Ryan Blue commented on SPARK-24552:
---

cc [~vanzin], [~henryr], [~cloud_fan]

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt numbers are reused when stages are retried

2018-06-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-24552:
--
Summary: Task attempt numbers are reused when stages are retried  (was: 
Task attempt ids are reused when stages are retried)

> Task attempt numbers are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt ids are reused when stages are retried

2018-06-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-24552:
--
Description: 
When stages are retried due to shuffle failures, task attempt numbers are 
reused. This causes a correctness bug in the v2 data sources write path.

Data sources (both the original and v2) pass the task attempt to writers so 
that writers can use the attempt number to track and clean up data from failed 
or speculative attempts. In the v2 docs for DataWriterFactory, the attempt 
number's javadoc states that "Implementations can use this attempt number to 
distinguish writers of different task attempts."

When two attempts of a stage use the same (partition, attempt) pair, two tasks 
can create the same data and attempt to commit. The commit coordinator prevents 
both from committing and will abort the attempt that finishes last. When using 
the (partition, attempt) pair to track data, the aborted task may delete data 
associated with the (partition, attempt) pair. If that happens, the data for 
the task that committed is also deleted as well, which is a correctness bug.

For a concrete example, I have a data source that creates files in place named 
with {{part---.}}. Because these files are 
written in place, both tasks create the same file and the one that is aborted 
deletes the file, leading to data corruption when the file is added to the 
table.

  was:
When stages are retried due to shuffle failures, task attempt ids are reused. 
This causes a correctness bug in the v2 data sources write path.

Data sources (both the original and v2) pass the task attempt to writers so 
that writers can use the attempt number to track and clean up data from failed 
or speculative attempts. In the v2 docs for DataWriterFactory, the attempt 
number's javadoc states that "Implementations can use this attempt number to 
distinguish writers of different task attempts."

When two attempts of a stage use the same (partition, attempt) pair, two tasks 
can create the same data and attempt to commit. The commit coordinator prevents 
both from committing and will abort the attempt that finishes last. When using 
the (partition, attempt) pair to track data, the aborted task may delete data 
associated with the (partition, attempt) pair. If that happens, the data for 
the task that committed is also deleted as well, which is a correctness bug.

For a concrete example, I have a data source that creates files in place named 
with {{part---.}}. Because these files are 
written in place, both tasks create the same file and the one that is aborted 
deletes the file, leading to data corruption when the file is added to the 
table.


> Task attempt ids are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt numbers are 
> reused. This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24552) Task attempt ids are reused when stages are retried

2018-06-13 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-24552:
-

 Summary: Task attempt ids are reused when stages are retried
 Key: SPARK-24552
 URL: https://issues.apache.org/jira/browse/SPARK-24552
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Ryan Blue


When stages are retried due to shuffle failures, task attempt ids are reused. 
This causes a correctness bug in the v2 data sources write path.

Data sources (both the original and v2) pass the task attempt to writers so 
that writers can use the attempt number to track and clean up data from failed 
or speculative attempts. In the v2 docs for DataWriterFactory, the attempt 
number's javadoc states that "Implementations can use this attempt number to 
distinguish writers of different task attempts."

When two attempts of a stage use the same (partition, attempt) pair, two tasks 
can create the same data and attempt to commit. The commit coordinator prevents 
both from committing and will abort the attempt that finishes last. When using 
the (partition, attempt) pair to track data, the aborted task may delete data 
associated with the (partition, attempt) pair. If that happens, the data for 
the task that committed is also deleted as well, which is a correctness bug.

For a concrete example, I have a data source that creates files in place named 
with part---.. Because these files are 
written in place, both tasks create the same file and the one that is aborted 
deletes the file, leading to data corruption when the file is added to the 
table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24552) Task attempt ids are reused when stages are retried

2018-06-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-24552:
--
Description: 
When stages are retried due to shuffle failures, task attempt ids are reused. 
This causes a correctness bug in the v2 data sources write path.

Data sources (both the original and v2) pass the task attempt to writers so 
that writers can use the attempt number to track and clean up data from failed 
or speculative attempts. In the v2 docs for DataWriterFactory, the attempt 
number's javadoc states that "Implementations can use this attempt number to 
distinguish writers of different task attempts."

When two attempts of a stage use the same (partition, attempt) pair, two tasks 
can create the same data and attempt to commit. The commit coordinator prevents 
both from committing and will abort the attempt that finishes last. When using 
the (partition, attempt) pair to track data, the aborted task may delete data 
associated with the (partition, attempt) pair. If that happens, the data for 
the task that committed is also deleted as well, which is a correctness bug.

For a concrete example, I have a data source that creates files in place named 
with {{part---.}}. Because these files are 
written in place, both tasks create the same file and the one that is aborted 
deletes the file, leading to data corruption when the file is added to the 
table.

  was:
When stages are retried due to shuffle failures, task attempt ids are reused. 
This causes a correctness bug in the v2 data sources write path.

Data sources (both the original and v2) pass the task attempt to writers so 
that writers can use the attempt number to track and clean up data from failed 
or speculative attempts. In the v2 docs for DataWriterFactory, the attempt 
number's javadoc states that "Implementations can use this attempt number to 
distinguish writers of different task attempts."

When two attempts of a stage use the same (partition, attempt) pair, two tasks 
can create the same data and attempt to commit. The commit coordinator prevents 
both from committing and will abort the attempt that finishes last. When using 
the (partition, attempt) pair to track data, the aborted task may delete data 
associated with the (partition, attempt) pair. If that happens, the data for 
the task that committed is also deleted as well, which is a correctness bug.

For a concrete example, I have a data source that creates files in place named 
with part---.. Because these files are 
written in place, both tasks create the same file and the one that is aborted 
deletes the file, leading to data corruption when the file is added to the 
table.


> Task attempt ids are reused when stages are retried
> ---
>
> Key: SPARK-24552
> URL: https://issues.apache.org/jira/browse/SPARK-24552
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Blue
>Priority: Major
>
> When stages are retried due to shuffle failures, task attempt ids are reused. 
> This causes a correctness bug in the v2 data sources write path.
> Data sources (both the original and v2) pass the task attempt to writers so 
> that writers can use the attempt number to track and clean up data from 
> failed or speculative attempts. In the v2 docs for DataWriterFactory, the 
> attempt number's javadoc states that "Implementations can use this attempt 
> number to distinguish writers of different task attempts."
> When two attempts of a stage use the same (partition, attempt) pair, two 
> tasks can create the same data and attempt to commit. The commit coordinator 
> prevents both from committing and will abort the attempt that finishes last. 
> When using the (partition, attempt) pair to track data, the aborted task may 
> delete data associated with the (partition, attempt) pair. If that happens, 
> the data for the task that committed is also deleted as well, which is a 
> correctness bug.
> For a concrete example, I have a data source that creates files in place 
> named with {{part---.}}. Because these 
> files are written in place, both tasks create the same file and the one that 
> is aborted deletes the file, leading to data corruption when the file is 
> added to the table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-13 Thread Ohad Raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511578#comment-16511578
 ] 

Ohad Raviv commented on SPARK-24528:


I think the 2nd point better suits my usecase. i'll try to look into it.

thanks.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511533#comment-16511533
 ] 

Wenchen Fan commented on SPARK-24528:
-

I have 2 ideas:
1. provide an option to let Spark only produce one file per bucket, if users 
are willing to write slower but read faster. But we need to carefully design 
what we should do for appending data.
2. implement a special merge-sort operator to sort the files in one bucket 
faster.

Both of them are not trivial, feel free to write a design doc if you want to 
drive this project.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-06-13 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511495#comment-16511495
 ] 

Li Jin commented on SPARK-22239:


[~hyukjin.kwon] I actually don't think this Jira is done. The PR only resolves 
the unbounded window case, do you mind if I edit the Jira to reflect what's 
done and create a new Jira for rolling window case?

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-13 Thread Ohad Raviv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511483#comment-16511483
 ] 

Ohad Raviv commented on SPARK-24528:


I understand the tradeoff, the question is how could we leverage the local file 
sorting. I'm sure the extra sort adds some significant overhead.. we still have 
to read all the data to memory and spill, etc.

if we could push-down the sorting already to the DataSourceScanExec - instead 
of reading the files one after one we could merge stream the by the right order,

I'm sure it would be much more effective.

by that I'm trying to imitate HBase - the way it dedupes by key.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24439:


Assignee: Apache Spark

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24439:


Assignee: (was: Apache Spark)

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511482#comment-16511482
 ] 

Apache Spark commented on SPARK-24439:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21557

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24439) Add distanceMeasure to BisectingKMeans in PySpark

2018-06-13 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-24439:
---
Priority: Minor  (was: Major)

> Add distanceMeasure to BisectingKMeans in PySpark
> -
>
> Key: SPARK-24439
> URL: https://issues.apache.org/jira/browse/SPARK-24439
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-23412 added distanceMeasure to 
> BisectingKMeans. We will do the same for PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511452#comment-16511452
 ] 

Wenchen Fan commented on SPARK-24528:
-

It's a different problem.

Spark makes a tradeoff for bucketed tables: it allows multiple files for one 
bucket to speed up the write side(no shuffle needed during writing). The 
drawback is, we can't guarantee the ordering during reading.

BTW it's not that bad to have an extra sort. Each file is sorted, so the 
sorting is faster than normal ones.

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24539) HistoryServer does not display metrics from tasks that complete after stage failure

2018-06-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511365#comment-16511365
 ] 

Marcelo Vanzin commented on SPARK-24539:


>From a previous chat with [~tgraves] this sounds like SPARK-24415.

> HistoryServer does not display metrics from tasks that complete after stage 
> failure
> ---
>
> Key: SPARK-24539
> URL: https://issues.apache.org/jira/browse/SPARK-24539
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Major
>
> I noticed that task metrics for completed tasks with a stage failure do not 
> show up in the new history server.  I have a feeling this is because all of 
> the tasks succeeded *after* the stage had been failed (so they were 
> completions from a "zombie" taskset).  The task metrics (eg. the shuffle read 
> size & shuffle write size) do not show up at all, either in the task table, 
> the executor table, or the overall stage summary metrics.  (they might not 
> show up in the job summary page either, but in the event logs I have, there 
> is another successful stage attempt after this one, and that is the only 
> thing which shows up in the jobs page.)  If you get task details from the api 
> endpoint (eg. 
> http://[host]:[port]/api/v1/applications/[app-id]/stages/[stage-id]/[stage-attempt])
>  then you can see the successful tasks and all the metrics
> Unfortunately the event logs I have are huge and I don't have a small repro 
> handy, but I hope that description is enough to go on.
> I loaded the event logs I have in the SHS from spark 2.2 and they appear fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-13 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Gawęda updated SPARK-24548:
--
Comment: was deleted

(was: IMHO names should be distinct, in other cases it's hard to query for 
nested field)

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children

2018-06-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24500.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> UnsupportedOperationException when trying to execute Union plan with Stream 
> of children
> ---
>
> Key: SPARK-24500
> URL: https://issues.apache.org/jira/browse/SPARK-24500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>
> To reproduce:
> {code}
> import org.apache.spark.sql.catalyst.plans.logical._
> def range(i: Int) = Range(1, i, 1, 1)
> val union = Union(Stream(range(3), range(5), range(7)))
> spark.sessionState.planner.plan(union).next().execute()
> {code}
> produces
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> {code}
> The SparkPlan looks like this:
> {code}
> :- Range (1, 3, step=1, splits=1)
> :- PlanLater Range (1, 5, step=1, splits=Some(1))
> +- PlanLater Range (1, 7, step=1, splits=Some(1))
> {code}
> So not all of it was planned (some PlanLater still in there).
> This appears to be a longstanding issue.
> I traced it to the use of var in TreeNode.
> For example in mapChildren:
> {code}
> case args: Traversable[_] => args.map {
>   case arg: TreeNode[_] if containsChild(arg) =>
> val newChild = f(arg.asInstanceOf[BaseType])
> if (!(newChild fastEquals arg)) {
>   changed = true
> {code}
> If args is a Stream then changed will never be set here, ultimately causing 
> the method to return the original plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24551) Add Integration tests for Secrets

2018-06-13 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24551:
---

 Summary: Add Integration tests for Secrets
 Key: SPARK-24551
 URL: https://issues.apache.org/jira/browse/SPARK-24551
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1
Reporter: Stavros Kontopoulos


Current 
[suite|https://github.com/apache/spark/blob/7703b46d2843db99e28110c4c7ccf60934412504/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala]
 needs to be expanded covering secrets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of Kubernetes specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Summary: Registration of Kubernetes specific metrics  (was: Registration of 
K8s specific metrics)

> Registration of Kubernetes specific metrics
> ---
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts, back-off time when 
> requesting executors etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Add support for Kubernetes specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Summary: Add support for Kubernetes specific metrics  (was: Registration of 
Kubernetes specific metrics)

> Add support for Kubernetes specific metrics
> ---
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts, back-off time when 
> requesting executors etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Description: Spark by default offers a specific set of metrics for 
monitoring. It is possible to add platform specific metrics or enhance the 
existing ones if it makes sense. Here is an example for 
[mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts, back-off time when requesting 
executors etc...  (was: Spark by default offers a specific set of metrics for 
monitoring. It is possible to add platform specific metrics or enhance the 
existing ones if it makes sense. Here is an example for 
[mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts etc...)

> Registration of K8s specific metrics
> 
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts, back-off time when 
> requesting executors etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Issue Type: New Feature  (was: Bug)

> Registration of K8s specific metrics
> 
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts, back-off time when 
> requesting executors etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Description: Spark by default offers a specific set of metrics for 
monitoring. It is possible to add platform specific metrics or enhance the 
existing ones if it makes sense. Here is an example for 
[mesos|https://github.com/apache/spark/pull/21516] Some example of metrics that 
could be added are: number of blacklisted nodes, utilization of executors per 
node, evicted pods and driver restarts etc...  (was: Spark by default offers a 
specific set of metrics for monitoring. It is possible to add platform specific 
metrics or enhance the existing ones if it makes sense. Here is an example for 
[mesos|[https://github.com/apache/spark/pull/21516]] Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts etc...)

> Registration of K8s specific metrics
> 
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516] Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Description: Spark by default offers a specific set of metrics for 
monitoring. It is possible to add platform specific metrics or enhance the 
existing ones if it makes sense. Here is an example for 
[mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts etc...  (was: Spark by default 
offers a specific set of metrics for monitoring. It is possible to add platform 
specific metrics or enhance the existing ones if it makes sense. Here is an 
example for [mesos|https://github.com/apache/spark/pull/21516] Some example of 
metrics that could be added are: number of blacklisted nodes, utilization of 
executors per node, evicted pods and driver restarts etc...)

> Registration of K8s specific metrics
> 
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|https://github.com/apache/spark/pull/21516]. Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24550:

Description: Spark by default offers a specific set of metrics for 
monitoring. It is possible to add platform specific metrics or enhance the 
existing ones if it makes sense. Here is an example for 
[mesos|[https://github.com/apache/spark/pull/21516]] Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts etc...  (was: Spark by default 
offers a specific set of metrics for monitoring. It is possible to add platform 
specific metrics or enhance the existing ones if it makes sense. Here is an 
example for [mesos|[https://github.com/apache/spark/pull/21516].] Some example 
of metrics that could be added are: number of blacklisted nodes, utilization of 
executors per node, evicted pods and driver restarts etc...)

> Registration of K8s specific metrics
> 
>
> Key: SPARK-24550
> URL: https://issues.apache.org/jira/browse/SPARK-24550
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark by default offers a specific set of metrics for monitoring. It is 
> possible to add platform specific metrics or enhance the existing ones if it 
> makes sense. Here is an example for 
> [mesos|[https://github.com/apache/spark/pull/21516]] Some example of metrics 
> that could be added are: number of blacklisted nodes, utilization of 
> executors per node, evicted pods and driver restarts etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24550) Registration of K8s specific metrics

2018-06-13 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24550:
---

 Summary: Registration of K8s specific metrics
 Key: SPARK-24550
 URL: https://issues.apache.org/jira/browse/SPARK-24550
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.1
Reporter: Stavros Kontopoulos


Spark by default offers a specific set of metrics for monitoring. It is 
possible to add platform specific metrics or enhance the existing ones if it 
makes sense. Here is an example for 
[mesos|[https://github.com/apache/spark/pull/21516].] Some example of metrics 
that could be added are: number of blacklisted nodes, utilization of executors 
per node, evicted pods and driver restarts etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24479:


Assignee: Arun Mahadevan

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mingjie Tang
>Assignee: Arun Mahadevan
>Priority: Major
>  Labels: feature
> Fix For: 2.4.0
>
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24479) Register StreamingQueryListener in Spark Conf

2018-06-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24479.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21504
[https://github.com/apache/spark/pull/21504]

> Register StreamingQueryListener in Spark Conf 
> --
>
> Key: SPARK-24479
> URL: https://issues.apache.org/jira/browse/SPARK-24479
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mingjie Tang
>Assignee: Arun Mahadevan
>Priority: Major
>  Labels: feature
> Fix For: 2.4.0
>
>
> Users need to register their own StreamingQueryListener into 
> StreamingQueryManager, the similar function is provided as EXTRA_LISTENERS 
> and QUERY_EXECUTION_LISTENERS. 
> We propose to provide STREAMING_QUERY_LISTENER Conf for user to register 
> their own listener. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24548) JavaPairRDD to Dataset in SPARK generates ambiguous results

2018-06-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-24548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511056#comment-16511056
 ] 

Tomasz Gawęda commented on SPARK-24548:
---

IMHO names should be distinct, in other cases it's hard to query for nested 
field

> JavaPairRDD to Dataset in SPARK generates ambiguous results
> 
>
> Key: SPARK-24548
> URL: https://issues.apache.org/jira/browse/SPARK-24548
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.3.0
> Environment: Using Windows 10, on 64bit machine with 16G of ram.
>Reporter: Jackson
>Priority: Major
>
> I have data in below JavaPairRDD :
> {quote}JavaPairRDD> MY_RDD;
> {quote}
> I tried using below code:
> {quote}Encoder>> encoder2 =
> Encoders.tuple(Encoders.STRING(), 
> Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
> Dataset newDataSet = 
> spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
> newDataSet.printSchema();
> {quote}
> {{root}}
> {{ |-- value1: string (nullable = true)}}
> {{ |-- value2: struct (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> {{ | |-- value: string (nullable = true)}}
> But after creating a StackOverflow question 
> ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark;),
>  i got to know that values in tuple should have distinguish field names, 
> where in this case its generating same name. Cause of this I cannot select 
> specific column under value2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-24549) 32BitDecimalType and 64BitDecimalType support push down to the data sources

2018-06-13 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24549:

Comment: was deleted

(was: I'm working on this)

> 32BitDecimalType and 64BitDecimalType support push down to the data sources
> ---
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24549) 32BitDecimalType and 64BitDecimalType support push down to the data sources

2018-06-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24549:


Assignee: (was: Apache Spark)

> 32BitDecimalType and 64BitDecimalType support push down to the data sources
> ---
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24549) 32BitDecimalType and 64BitDecimalType support push down to the data sources

2018-06-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511049#comment-16511049
 ] 

Apache Spark commented on SPARK-24549:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21556

> 32BitDecimalType and 64BitDecimalType support push down to the data sources
> ---
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >