[jira] [Created] (SPARK-32585) Support scala enumeration in ScalaReflection

2020-08-10 Thread ulysses you (Jira)
ulysses you created SPARK-32585:
---

 Summary: Support scala enumeration in ScalaReflection
 Key: SPARK-32585
 URL: https://issues.apache.org/jira/browse/SPARK-32585
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175237#comment-17175237
 ] 

Apache Spark commented on SPARK-32584:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29402

> Exclude _images and _sources that are generated by Sphinx in Jekyll build
> -
>
> Key: SPARK-32584
> URL: https://issues.apache.org/jira/browse/SPARK-32584
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png
>
>
> For _images directory, after SPARK-31851, now we added some images to use 
> within the pages built by Sphinx. It copies and images into _images 
> directory. Later when Jeykill builds, the underscore directories are ignored 
> by default which ends up with missing image in the main doc. I attached the 
> image in the JIRA.
> For _sources directory, see 
> https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32584:


Assignee: Apache Spark

> Exclude _images and _sources that are generated by Sphinx in Jekyll build
> -
>
> Key: SPARK-32584
> URL: https://issues.apache.org/jira/browse/SPARK-32584
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png
>
>
> For _images directory, after SPARK-31851, now we added some images to use 
> within the pages built by Sphinx. It copies and images into _images 
> directory. Later when Jeykill builds, the underscore directories are ignored 
> by default which ends up with missing image in the main doc. I attached the 
> image in the JIRA.
> For _sources directory, see 
> https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32584:


Assignee: (was: Apache Spark)

> Exclude _images and _sources that are generated by Sphinx in Jekyll build
> -
>
> Key: SPARK-32584
> URL: https://issues.apache.org/jira/browse/SPARK-32584
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png
>
>
> For _images directory, after SPARK-31851, now we added some images to use 
> within the pages built by Sphinx. It copies and images into _images 
> directory. Later when Jeykill builds, the underscore directories are ignored 
> by default which ends up with missing image in the main doc. I attached the 
> image in the JIRA.
> For _sources directory, see 
> https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32584:
-
Description: 
For _images directory, after SPARK-31851, now we added some images to use 
within the pages built by Sphinx. It copies and images into _images directory. 
Later when Jeykill builds, the underscore directories are ignored by default 
which ends up with missing image in the main doc. I attached the image in the 
JIRA.

For _sources directory, see 
https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. 

  was:
After SPARK-31851, now we added some images to use within the pages built by 
Sphinx. It copies and images into _images directory. Later when Jeykill builds, 
the underscore directories are ignored by default which ends up with missing 
image in the main doc.

I attached the image in the JIRA.


> Exclude _images and _sources that are generated by Sphinx in Jekyll build
> -
>
> Key: SPARK-32584
> URL: https://issues.apache.org/jira/browse/SPARK-32584
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png
>
>
> For _images directory, after SPARK-31851, now we added some images to use 
> within the pages built by Sphinx. It copies and images into _images 
> directory. Later when Jeykill builds, the underscore directories are ignored 
> by default which ends up with missing image in the main doc. I attached the 
> image in the JIRA.
> For _sources directory, see 
> https://github.com/sphinx-contrib/sphinx-pretty-searchresults#source-links. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32584:
-
Attachment: Screen Shot 2020-08-11 at 1.52.46 PM.png

> Exclude _images and _sources that are generated by Sphinx in Jekyll build
> -
>
> Key: SPARK-32584
> URL: https://issues.apache.org/jira/browse/SPARK-32584
> Project: Spark
>  Issue Type: Documentation
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Attachments: Screen Shot 2020-08-11 at 1.52.46 PM.png
>
>
> After SPARK-31851, now we added some images to use within the pages built by 
> Sphinx. It copies and images into _images directory. Later when Jeykill 
> builds, the underscore directories are ignored by default which ends up with 
> missing image in the main doc.
> I attached the image in the JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32584) Exclude _images and _sources that are generated by Sphinx in Jekyll build

2020-08-10 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32584:


 Summary: Exclude _images and _sources that are generated by Sphinx 
in Jekyll build
 Key: SPARK-32584
 URL: https://issues.apache.org/jira/browse/SPARK-32584
 Project: Spark
  Issue Type: Documentation
  Components: docs
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


After SPARK-31851, now we added some images to use within the pages built by 
Sphinx. It copies and images into _images directory. Later when Jeykill builds, 
the underscore directories are ignored by default which ends up with missing 
image in the main doc.

I attached the image in the JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Ayrat Sadreev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175219#comment-17175219
 ] 

Ayrat Sadreev commented on SPARK-32580:
---

If you select values from one of the arrays, it works

{code:scala}
scala> val df5 = df3.select(df3.col("value.sample"))
df5: org.apache.spark.sql.DataFrame = [sample: decimal(1,1)]

scala> df5.show()
+--+
|sample|
+--+
|   0.1|
+--+


scala> val df6 = df3.select(df3.col("item.item_id"))
df6: org.apache.spark.sql.DataFrame = [item_id: int]

scala> df6.show()
+---+
|item_id|
+---+
|  1|
+---+

{code}

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
> Attachments: ExplodeTest.java, data.json
>
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> 

[jira] [Resolved] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32583.
--
Resolution: Invalid

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32583:
--

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175212#comment-17175212
 ] 

Hyukjin Kwon commented on SPARK-32583:
--

[~FelixKJose], let's interact in the mailing list or stackoverflow before 
filing it as an issue. I am sure you could have a better answer there.
This looks like definitely not an issue in Spark. You can take a look for the 
main code of Spark and see how the tests are written.

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175205#comment-17175205
 ] 

Rohit Mishra commented on SPARK-32583:
--

cc [~hyukjin.kwon]. 

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175203#comment-17175203
 ] 

Lantao Jin commented on SPARK-32582:


Maybe we could offer a new interface to break out in one iteration when 
mergeSchema is false. I am not sure.
{code}
  def inferSchema(
  sparkSession: SparkSession,
  options: Map[String, String],
  f: (FileIndex) => Seq[FileStatus]): Option[StructType]
{code}

Do you already have any fixing? PR is welcome.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175202#comment-17175202
 ] 

Felix Kizhakkel Jose commented on SPARK-32583:
--

[~rohitmishr1484] It would have been nice if you could have pointed me to a 
solution.

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Rohit Mishra (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra resolved SPARK-32583.
--
Resolution: Not A Problem

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Rohit Mishra (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175200#comment-17175200
 ] 

Rohit Mishra commented on SPARK-32583:
--

[~FelixKJose], Questions are usually answered using either Stackoverflow or 
User-mail list. Please look at this link for more 
info-[https://spark.apache.org/community.html]. This task will be marked as 
resolved here for now but you can reopen it if you want. Thanks.

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175197#comment-17175197
 ] 

Lantao Jin commented on SPARK-32582:


I see. The implementation of {{inferSchema}} method depends on the underlay 
file format. Even for Orc, we still need all files since the given Orc files 
can have different schemas and we want to get a merged schema.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32216) Remove redundant ProjectExec

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32216:
---

Assignee: Allison Wang

> Remove redundant ProjectExec
> 
>
> Key: SPARK-32216
> URL: https://issues.apache.org/jira/browse/SPARK-32216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently Spark executed plan can have redundant `ProjectExec` node. For 
> example: 
> After Filter: 
> {code:java}
> == Physical Plan ==
> *(1) Project [a#14L, b#15L, c#16, key#17] 
> +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
> The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
> exactly the same as filter's output.
> Before Aggregate:
> {code:java}
> == Physical Plan ==
> *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
> output=[sum_a#39L, key#17, last_b#41L])
> +- Exchange hashpartitioning(key#17, 5), true, [id=#77]
>+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
> partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
>   +- *(1) Project [key#17, a#14L, b#15L]
>  +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
> +- *(1) ColumnarToRow
>+- FileScan parquet [a#14L,b#15L,key#17] {code}
> The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
> doesn't require child plan's output to be in a specific order.
>  
> In general, a project is redundant when
>  # It has the same output attributes and order as its child's output when 
> ordering of these attributes is required.
>  # It has the same output attributes as its child's output when attribute 
> output ordering is not required.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32216) Remove redundant ProjectExec

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32216.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29031
[https://github.com/apache/spark/pull/29031]

> Remove redundant ProjectExec
> 
>
> Key: SPARK-32216
> URL: https://issues.apache.org/jira/browse/SPARK-32216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently Spark executed plan can have redundant `ProjectExec` node. For 
> example: 
> After Filter: 
> {code:java}
> == Physical Plan ==
> *(1) Project [a#14L, b#15L, c#16, key#17] 
> +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
>    +- *(1) ColumnarToRow
>       +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code}
> The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is 
> exactly the same as filter's output.
> Before Aggregate:
> {code:java}
> == Physical Plan ==
> *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], 
> output=[sum_a#39L, key#17, last_b#41L])
> +- Exchange hashpartitioning(key#17, 5), true, [id=#77]
>+- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), 
> partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
>   +- *(1) Project [key#17, a#14L, b#15L]
>  +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
> +- *(1) ColumnarToRow
>+- FileScan parquet [a#14L,b#15L,key#17] {code}
> The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate 
> doesn't require child plan's output to be in a specific order.
>  
> In general, a project is redundant when
>  # It has the same output attributes and order as its child's output when 
> ordering of these attributes is required.
>  # It has the same output attributes as its child's output when attribute 
> output ordering is not required.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32400) Test coverage of HiveScripTransformationExec

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32400:


Assignee: (was: Apache Spark)

> Test coverage of HiveScripTransformationExec
> 
>
> Key: SPARK-32400
> URL: https://issues.apache.org/jira/browse/SPARK-32400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32400) Test coverage of HiveScripTransformationExec

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32400:


Assignee: Apache Spark

> Test coverage of HiveScripTransformationExec
> 
>
> Key: SPARK-32400
> URL: https://issues.apache.org/jira/browse/SPARK-32400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32400) Test coverage of HiveScripTransformationExec

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175167#comment-17175167
 ] 

Apache Spark commented on SPARK-32400:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29401

> Test coverage of HiveScripTransformationExec
> 
>
> Key: SPARK-32400
> URL: https://issues.apache.org/jira/browse/SPARK-32400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Jarred Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175165#comment-17175165
 ] 

Jarred Li commented on SPARK-32582:
---

The performance I mentioned here is not the read file, but "LIST" the files. 
For example, one table have 1000 partitions,  the files in that 1000 partitions 
are listed first. However only one file is read for schema inference.  The 
"LIST" operation is time consumping especially for object store such as S3.

 

See the list files code: 
[https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300]

 

 

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32543) Remove arrow::as_tibble usage in SparkR

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32543:
-
Fix Version/s: 3.0.1

> Remove arrow::as_tibble usage in SparkR
> ---
>
> Key: SPARK-32543
> URL: https://issues.apache.org/jira/browse/SPARK-32543
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.1, 3.1.0
>
>
> We increased the minimal version of Arrow R version to 1.0.0 at SPARK-32452, 
> and Arrow R 0.14 dropped {{as_tibble}}. We can remove the usage in SparkR:
> {code}
> pkg/R/DataFrame.R:  # Arrow drops `as_tibble` since 0.14.0, 
> see ARROW-5190.
> pkg/R/DataFrame.R:  if (exists("as_tibble", envir = 
> asNamespace("arrow"))) {
> pkg/R/DataFrame.R:
> as.data.frame(arrow::as_tibble(arrowTable), stringsAsFactors = 
> stringsAsFactors)
> pkg/R/deserialize.R:# Arrow drops `as_tibble` since 0.14.0, see 
> ARROW-5190.
> pkg/R/deserialize.R:useAsTibble <- exists("as_tibble", envir = 
> asNamespace("arrow"))
> pkg/R/deserialize.R:  as_tibble <- get("as_tibble", envir = 
> asNamespace("arrow"))
> pkg/R/deserialize.R:  lapply(batches, function(batch) 
> as.data.frame(as_tibble(batch)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175143#comment-17175143
 ] 

Lantao Jin commented on SPARK-32582:


{code}
files.toIterator.map(file => readSchema(file.getPath, conf, 
ignoreCorruptFiles)).collectFirst
{code}
{{collectFirst()}} will break out when its iterator found a matching value.
{code}
  def collectFirst[B](pf: PartialFunction[A, B]): Option[B] = {
for (x <- self.toIterator) { // make sure to use an iterator or `seq`
  if (pf isDefinedAt x)
return Some(pf(x))
}
None
  }
{code}
So in most cases, it just reads only one file.

> Spark SQL Infer Schema Performance
> --
>
> Key: SPARK-32582
> URL: https://issues.apache.org/jira/browse/SPARK-32582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Jarred Li
>Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, 
> however only one of the file is used to read schema informaiton. The 
> performance is impacted due to list all the files in the table when the 
> number of partitions is larger.
>  
> See the code in 
> "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
>  all the files in the table are input, however only one of the file's schema 
> is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2020-08-10 Thread Tianchen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175117#comment-17175117
 ] 

Tianchen Zhang commented on SPARK-25299:


Hi [~mcheah], by following the history of this remote storage project, we know 
that the community's initial attempt was to resolve the limitations of the 
shuffle service. Does it imply that it's recommended to have custom remote 
storage implementation to avoid any limitations that a shuffle service may 
cause? But for Spark on K8S we also notice the experimental option of enabling 
shuffle service as a DaemonSet.  So we want to make sure what we are doing is 
in the right direction with the community. Thanks.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-32583:


 Summary: PySpark Structured Streaming Testing Support
 Key: SPARK-32583
 URL: https://issues.apache.org/jira/browse/SPARK-32583
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Felix Kizhakkel Jose


Hello,

I have investigated a lot but couldn't get any help or resource on how to 
+{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
(ingesting from Kafka topics to S3) and how to build  Continuous Integration 
(CI)/Continuous Deployment (CD).

1. Is it possible to test (unit test, integration test) pyspark structured 
stream?

2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32528:
--
Fix Version/s: 3.0.1

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> In tests, we usually call `plan.analyze` to get the analyzed plan and test 
> analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan 
> is resolved, which may surprise the test writers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32571) yarnClient.killApplication(appId) is never called

2020-08-10 Thread A Tester (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175096#comment-17175096
 ] 

A Tester commented on SPARK-32571:
--

[~viirya] , if the application is a spark-streaming application, I can 
understand that... but for batch-job style applications, it's typically 
desirable to have the life-cycle of the spark application on the cluster to be 
synchronous to the life-cycle of the batch job itself (i.e. a job being 
scheduled by Apache Airflow, Autosys, etc.).  The existing (although 
unreachable) code leads me to believe the same was the original intent.

While Client mode would work, in my use case, it's a major benefit to run in 
cluster mode since the resources for the Driver will come from the cluster 
instead of the box the job is launched from.

However, an implementation triggered via some spark option would also be 
acceptable.  What do you think?

I have only ever used spark with yarn, not sure if the same signal propagation 
would be useful for other options (kubernetes, standalone, etc.) 

Something like 
||Property Name||Default||Meaning||Since Version||
|spark.yarn.propigateKillSignal|{{False}}|If enabled and the job is running in 
Cluster mode, if the spark-submit process is killed gracefully, then the job 
running in yarn will also be killed.|...|

 

> yarnClient.killApplication(appId) is never called
> -
>
> Key: SPARK-32571
> URL: https://issues.apache.org/jira/browse/SPARK-32571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.0, 3.0.0
>Reporter: A Tester
>Priority: Major
>
> *Problem Statement:* 
> When an application is submitted using spark-submit in cluster mode using 
> yarn, the spark application continues to run on the cluster, even if 
> spark-submit itself has been requested to shutdown (Ctrl-C/SIGTERM/etc.)
> While there is code inside org.apache.spark.deploy.yarn.Client.scala that 
> would lead you to believe the spark application on the cluster will shut 
> down, this code is not currently reachable.
> Example of behavior:
> spark-submit ...
>  or kill -15 
> spark-submit itself dies
> job can still be found running on the cluster
>  
> *Expectation:*
> When spark-submit is in monitoring a yarn app and spark-submit itself is 
> requested to shutdown (SIGTERM, HUP,etc.), it should call 
> yarnClient.killApplication(appId) so that the actual spark application 
> running on the cluster is killed.
>  
>  
> *Proposal*
> There is already a shutdown hook registered which cleans up temp files.  
> Could this be extended to call yarnClient.killApplication? 
> I believe the default behavior should be to request yarn to kill the 
> application, however I can imagine use cases where you may still want it to 
> run.  So facilitate these use cases, an option should be provided to skip 
> this hook.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32577) Fix the config value for shuffled hash join in test in-joins.sql

2020-08-10 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-32577:
-
Parent: SPARK-32461
Issue Type: Sub-task  (was: Improvement)

> Fix the config value for shuffled hash join in test in-joins.sql
> 
>
> Key: SPARK-32577
> URL: https://issues.apache.org/jira/browse/SPARK-32577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> The config value for enabling shuffled hash join is wrong in test 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/in-joins.sql#L8-L10],
>  which should not be spark.sql.autoBroadcastJoinThreshold=-1. Created this 
> Jira to fix this. This is a very minor change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-08-10 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175052#comment-17175052
 ] 

Sean R. Owen commented on SPARK-32530:
--

I speak for myself, but, I think the cost is viewed as very high. Which is 
valid; maintaining every change across 3 languages has proved to be a lot of 
work. As I say, I'm not even sure of the future of the R bindings; despite 
hopes that these would be equally maintained, they just aren't at parity, even. 

I think the argument would have to be: this opens up Spark to a class of users 
that can't really otherwise use it. JVM users use Spark directly (or Java, more 
rarely), Python users can use Pyspark; R users for now can use SparkR or 
sparklyr, which is a separate project. I am not sure how many users use Kotlin, 
but can't use Java or Scala?

I fully trust the intentions of any group that says "we will maintain our 
contribution" despite the inevitable fact that it doesn't always work out that 
way in OSS. If the Kotlin bindings are in Spark, notionally, they have to be 
updated by anyone touching any APIs, etc. That just will never only fall on one 
group of maintainers.

OK, so you say, it's fine if they lag a bit, differ a bit. If some degree of 
'lag' is OK, then I think that neuters the argument, in the short term, for 
pushing it into Spark now. Just let it succeed as-is, as a third-party project. 
As you say, .NET is 'harder' to run this way but its bindings have done OK 
separately.

What goes wrong if it stays in its current form?

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How 

[jira] [Commented] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-08-10 Thread Pasha Finkeshteyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175010#comment-17175010
 ] 

Pasha Finkeshteyn commented on SPARK-32530:
---

Hi [~srowen], we agree with you that it’s good to have more first-class 
language bindings for Spark. We also agree with you that if Kotlin is part of 
Apache Spark, it becomes more 'official' and trusted.
Of course, as with anything, there’s a maintenance cost, but it isn’t fair to 
compare it to .NET bindings, Python or R. After all, Kotlin is a JVM language 
and compatibility comes way easier.
Still, there’s maintenance. As we say in the ticket, we are committed to 
maintaining, improving and evolving the API based on feedback from both Spark 
and Kotlin communities, and we intend to fully support it. I can understand 
that this may not be enough, but we hope that with time and growing community, 
the value of integrating Kotlin bindings into Apache Spark becomes more 
apparent. 
It would be great if you could shed some light on how does the Spark project's 
management committee measure the value vs cost, and what would be the tipping 
point.



> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174989#comment-17174989
 ] 

Apache Spark commented on SPARK-32528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29400

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.1.0
>
>
> In tests, we usually call `plan.analyze` to get the analyzed plan and test 
> analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan 
> is resolved, which may surprise the test writers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174988#comment-17174988
 ] 

Apache Spark commented on SPARK-32528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29400

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.1.0
>
>
> In tests, we usually call `plan.analyze` to get the analyzed plan and test 
> analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan 
> is resolved, which may surprise the test writers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32469) ApplyColumnarRulesAndInsertTransitions should be idempotent

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32469.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29273
[https://github.com/apache/spark/pull/29273]

> ApplyColumnarRulesAndInsertTransitions should be idempotent
> ---
>
> Key: SPARK-32469
> URL: https://issues.apache.org/jira/browse/SPARK-32469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>
> It's good hygiene to keep rules idempotent, even if we know the rule is going 
> to be run only once. This is for future-proof.
> ApplyColumnarRulesAndInsertTransitions can add columnar-to-row and 
> row-to-columnar operators repeatedly, we should fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32403) SCRIP TRANSFORM Extract common method from process row to avoid repeated judgement

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32403.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29199
[https://github.com/apache/spark/pull/29199]

> SCRIP TRANSFORM Extract common method from process row to avoid repeated 
> judgement
> --
>
> Key: SPARK-32403
> URL: https://issues.apache.org/jira/browse/SPARK-32403
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32403) SCRIP TRANSFORM Extract common method from process row to avoid repeated judgement

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32403:
---

Assignee: angerszhu

> SCRIP TRANSFORM Extract common method from process row to avoid repeated 
> judgement
> --
>
> Key: SPARK-32403
> URL: https://issues.apache.org/jira/browse/SPARK-32403
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32409:
--
Issue Type: Documentation  (was: Bug)

> Document the dependency between spark.metrics.staticSources.enabled and 
> JVMSource registration
> --
>
> Key: SPARK-32409
> URL: https://issues.apache.org/jira/browse/SPARK-32409
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> SPARK-29654 has introduced the configuration 
> `spark.metrics.register.static.sources.enabled`.
>  The current implementation of SPARK-29654 and SPARK-25277 have, as side 
> effect, that when {{spark.metrics.register.static.sources.enabled}} is set to 
> false, the registration of JVM Source is also ignored, even if requested with 
> {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}}
>  A PR is proposed to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32409:
-

Assignee: Luca Canali

> Document the dependency between spark.metrics.staticSources.enabled and 
> JVMSource registration
> --
>
> Key: SPARK-32409
> URL: https://issues.apache.org/jira/browse/SPARK-32409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> SPARK-29654 has introduced the configuration 
> `spark.metrics.register.static.sources.enabled`.
>  The current implementation of SPARK-29654 and SPARK-25277 have, as side 
> effect, that when {{spark.metrics.register.static.sources.enabled}} is set to 
> false, the registration of JVM Source is also ignored, even if requested with 
> {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}}
>  A PR is proposed to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32409.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29203
[https://github.com/apache/spark/pull/29203]

> Document the dependency between spark.metrics.staticSources.enabled and 
> JVMSource registration
> --
>
> Key: SPARK-32409
> URL: https://issues.apache.org/jira/browse/SPARK-32409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> SPARK-29654 has introduced the configuration 
> `spark.metrics.register.static.sources.enabled`.
>  The current implementation of SPARK-29654 and SPARK-25277 have, as side 
> effect, that when {{spark.metrics.register.static.sources.enabled}} is set to 
> false, the registration of JVM Source is also ignored, even if requested with 
> {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}}
>  A PR is proposed to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32235) Kubernetes Configuration to set Service Account to Executors

2020-08-10 Thread Pedro Rossi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pedro Rossi resolved SPARK-32235.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue was solved by SPARK-30122, but the same was not released yet

> Kubernetes Configuration to set Service Account to Executors
> 
>
> Key: SPARK-32235
> URL: https://issues.apache.org/jira/browse/SPARK-32235
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Pedro Rossi
>Priority: Minor
> Fix For: 3.1.0
>
>
> Some cloud providers use Service Accounts to provide resource authorization 
> (one example is described here 
> [https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/)]
>  and for this we need to be able to set Service Accounts to the executors.
> My idea for development of this feature would be to have a configuration like 
> "spark.kubernetes.authenticate.executor.serviceAccountName" in order to set 
> the executors Service Account, this way it could be possible to allow only 
> certain accesses to the driver and others to the executors or the same access 
> (user's choice).
> I am creating this issue so the maintainers can write opinions first, but I 
> intend to create a pull request to address this issue also.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174361#comment-17174361
 ] 

Takeshi Yamamuro commented on SPARK-32580:
--

I made the given query simpler;
{code}
scala> import org.apache.spark.sql.functions
scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 
'values', array(named_struct('sample', 0.1 data")
scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data")))
scala> val df3 = df2.withColumn("value", 
functions.explode(df2.col("item.values")))
scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample"))
scala> df4.show()
20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: _gen_alias_26#26
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.
{code}

Looks like this issue only happens in branch-3.0. It passed in 
master/branch-2.4.

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
> Attachments: ExplodeTest.java, data.json
>
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> 

[jira] [Comment Edited] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174361#comment-17174361
 ] 

Takeshi Yamamuro edited comment on SPARK-32580 at 8/10/20, 2:38 PM:


I made the given query simpler;
{code}
scala> import org.apache.spark.sql.functions
scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 
'values', array(named_struct('sample', 0.1 data")
scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data")))
scala> val df3 = df2.withColumn("value", 
functions.explode(df2.col("item.values")))
scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample"))
scala> df4.show()
20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: _gen_alias_26#26
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.
{code}

Looks like this issue only happens in v3.0.0. It passed in master/v2.4.6.


was (Author: maropu):
I made the given query simpler;
{code}
scala> import org.apache.spark.sql.functions
scala> val df1 = spark.range(1).selectExpr("array(named_struct('item_id', 1, 
'values', array(named_struct('sample', 0.1 data")
scala> val df2 = df1.withColumn("item", functions.explode(df1.col("data")))
scala> val df3 = df2.withColumn("value", 
functions.explode(df2.col("item.values")))
scala> val df4 = df3.select(df3.col("item.item_id"), df3.col("value.sample"))
scala> df4.show()
20/08/10 22:53:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: _gen_alias_26#26
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.
{code}

Looks like this issue only happens in branch-3.0. It passed in 
master/branch-2.4.

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
> Attachments: ExplodeTest.java, data.json
>
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = 

[jira] [Assigned] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32517:
-

Assignee: Dongjoon Hyun

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32517) Add StorageLevel.DISK_ONLY_3

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32517.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29331
[https://github.com/apache/spark/pull/29331]

> Add StorageLevel.DISK_ONLY_3
> 
>
> Key: SPARK-32517
> URL: https://issues.apache.org/jira/browse/SPARK-32517
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to add `StorageLevel.DISK_ONLY_3` as a built-in StorageLevel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance

2020-08-10 Thread Jarred Li (Jira)
Jarred Li created SPARK-32582:
-

 Summary: Spark SQL Infer Schema Performance
 Key: SPARK-32582
 URL: https://issues.apache.org/jira/browse/SPARK-32582
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Jarred Li


When infer schema is enabled, it tries to list all the files in the table, 
however only one of the file is used to read schema informaiton. The 
performance is impacted due to list all the files in the table when the number 
of partitions is larger.

 

See the code in 
"[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88];,
 all the files in the table are input, however only one of the file's schema is 
used to infer schema.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32554) Remove the words "experimental" in the k8s document

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32554:
--
Description: 
To make users understood more correctly about the current development status of 
the k8s scheduler, this ticket targets at updating the k8s document in the 
primary branch/branch-3.0;

BEFORE:
{code:java}
 The Kubernetes scheduler is currently experimental. In future versions, there 
may be behavioral changes around 
configuration, container images and entrypoints.{code}
AFTER:
{code:java}
{code}
This comes from a thread in the spark-dev mailing list: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]

  was:
To make users understood more correctly about the current development status of 
the k8s scheduler, this ticket targets at updating the k8s document in the 
primary branch/branch-3.0;

BEFORE:
{code:java}
 The Kubernetes scheduler is currently experimental. In future versions, there 
may be behavioral changes around 
configuration, container images and entrypoints.{code}
AFTER:
{code:java}
 The Kubernetes scheduler is currently experimental. The most basic parts are 
getting stable, but Dynamic
Resource Allocation and External Shuffle Service need to be available before we 
officially announce GA for it.{code}
This comes from a thread in the spark-dev mailing list: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]


> Remove the words "experimental" in the k8s document
> ---
>
> Key: SPARK-32554
> URL: https://issues.apache.org/jira/browse/SPARK-32554
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> To make users understood more correctly about the current development status 
> of the k8s scheduler, this ticket targets at updating the k8s document in the 
> primary branch/branch-3.0;
> BEFORE:
> {code:java}
>  The Kubernetes scheduler is currently experimental. In future versions, 
> there may be behavioral changes around 
> configuration, container images and entrypoints.{code}
> AFTER:
> {code:java}
> {code}
> This comes from a thread in the spark-dev mailing list: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32554) Remove the words "experimental" in the k8s document

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32554:
--
Summary: Remove the words "experimental" in the k8s document  (was: Update 
the k8s document according to the current development status)

> Remove the words "experimental" in the k8s document
> ---
>
> Key: SPARK-32554
> URL: https://issues.apache.org/jira/browse/SPARK-32554
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> To make users understood more correctly about the current development status 
> of the k8s scheduler, this ticket targets at updating the k8s document in the 
> primary branch/branch-3.0;
> BEFORE:
> {code:java}
>  The Kubernetes scheduler is currently experimental. In future versions, 
> there may be behavioral changes around 
> configuration, container images and entrypoints.{code}
> AFTER:
> {code:java}
>  The Kubernetes scheduler is currently experimental. The most basic parts are 
> getting stable, but Dynamic
> Resource Allocation and External Shuffle Service need to be available before 
> we officially announce GA for it.{code}
> This comes from a thread in the spark-dev mailing list: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32554) Update the k8s document according to the current development status

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32554:
-

Assignee: Takeshi Yamamuro

> Update the k8s document according to the current development status
> ---
>
> Key: SPARK-32554
> URL: https://issues.apache.org/jira/browse/SPARK-32554
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> To make users understood more correctly about the current development status 
> of the k8s scheduler, this ticket targets at updating the k8s document in the 
> primary branch/branch-3.0;
> BEFORE:
> {code:java}
>  The Kubernetes scheduler is currently experimental. In future versions, 
> there may be behavioral changes around 
> configuration, container images and entrypoints.{code}
> AFTER:
> {code:java}
>  The Kubernetes scheduler is currently experimental. The most basic parts are 
> getting stable, but Dynamic
> Resource Allocation and External Shuffle Service need to be available before 
> we officially announce GA for it.{code}
> This comes from a thread in the spark-dev mailing list: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32554) Update the k8s document according to the current development status

2020-08-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32554.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29368
[https://github.com/apache/spark/pull/29368]

> Update the k8s document according to the current development status
> ---
>
> Key: SPARK-32554
> URL: https://issues.apache.org/jira/browse/SPARK-32554
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.1.0
>
>
> To make users understood more correctly about the current development status 
> of the k8s scheduler, this ticket targets at updating the k8s document in the 
> primary branch/branch-3.0;
> BEFORE:
> {code:java}
>  The Kubernetes scheduler is currently experimental. In future versions, 
> there may be behavioral changes around 
> configuration, container images and entrypoints.{code}
> AFTER:
> {code:java}
>  The Kubernetes scheduler is currently experimental. The most basic parts are 
> getting stable, but Dynamic
> Resource Allocation and External Shuffle Service need to be available before 
> we officially announce GA for it.{code}
> This comes from a thread in the spark-dev mailing list: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-k8s-is-still-experimental-td29942.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Ayrat Sadreev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayrat Sadreev updated SPARK-32580:
--
Attachment: ExplodeTest.java

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
> Attachments: ExplodeTest.java, data.json
>
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Ayrat Sadreev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayrat Sadreev updated SPARK-32580:
--
Attachment: data.json

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
> Attachments: ExplodeTest.java, data.json
>
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174303#comment-17174303
 ] 

Takeshi Yamamuro edited comment on SPARK-32580 at 8/10/20, 1:26 PM:


Please do not use "Blocker" in the priority and it is reserved for committers. 
Anyway, thanks for the report.


was (Author: maropu):
Please do not use "Blocker" in the priority and it is reserved by committers. 
Anyway, thanks for the report.

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174303#comment-17174303
 ] 

Takeshi Yamamuro commented on SPARK-32580:
--

Please do not use "Blocker" in the priority and it is reserved by committers. 
Anyway, thanks for the report.

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32580:
-
Component/s: (was: Spark Core)

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32580:
-
Priority: Major  (was: Blocker)

> Issue accessing a column values after 'explode' function
> 
>
> Key: SPARK-32580
> URL: https://issues.apache.org/jira/browse/SPARK-32580
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Ayrat Sadreev
>Priority: Major
>
> An exception occurs when trying to flatten double nested arrays
> The schema is
> {code:none}
> root
>  |-- data: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- item_id: string (nullable = true)
>  |||-- timestamp: string (nullable = true)
>  |||-- values: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- sample: double (nullable = true)
> {code}
> The target schema is
> {code:none}
> root
>  |-- item_id: string (nullable = true)
>  |-- timestamp: string (nullable = true)
>  |-- sample: double (nullable = true)
> {code}
> The code (in Java)
> {code:java}
> package com.skf.streamer.spark;
> import java.util.concurrent.TimeoutException;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class ExplodeTest {
>public static void main(String[] args) throws TimeoutException {
>   SparkConf conf = new SparkConf()
>  .setAppName("SimpleApp")
>  .set("spark.scheduler.mode", "FAIR")
>  .set("spark.master", "local[1]")
>  .set("spark.sql.streaming.checkpointLocation", "checkpoint");
>   SparkSession spark = SparkSession.builder()
>  .config(conf)
>  .getOrCreate();
>   Dataset d0 = spark
>  .read()
>  .format("json")
>  .option("multiLine", "true")
>  .schema(getSchema())
>  .load("src/test/resources/explode/data.json");
>   d0.printSchema();
>   d0 = d0.withColumn("item", functions.explode(d0.col("data")));
>   d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
>   d0.printSchema();
>   d0 = d0.select(
>  d0.col("item.item_id"),
>  d0.col("item.timestamp"),
>  d0.col("value.sample")
>   );
>   d0.printSchema();
>   d0.show(); // Failes
>   spark.stop();
>}
>private static StructType getSchema() {
>   StructField[] level2Fields = {
>  DataTypes.createStructField("sample", DataTypes.DoubleType, false),
>   };
>   StructField[] level1Fields = {
>  DataTypes.createStructField("item_id", DataTypes.StringType, false),
>  DataTypes.createStructField("timestamp", DataTypes.StringType, 
> false),
>  DataTypes.createStructField("values", 
> DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
>   };
>   StructField[] fields = {
>  DataTypes.createStructField("data", 
> DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
>   };
>   return DataTypes.createStructType(fields);
>}
> }
> {code}
> The data file
> {code:json}
> {
>   "data": [
> {
>   "item_id": "item_1",
>   "timestamp": "2020-07-01 12:34:89",
>   "values": [
> {
>   "sample": 1.1
> },
> {
>   "sample": 1.2
> }
>   ]
> },
> {
>   "item_id": "item_2",
>   "timestamp": "2020-07-02 12:34:89",
>   "values": [
> {
>   "sample": 2.2
> }
>   ]
> }
>   ]
> }
> {code}
> Dataset.show() method fails with an exception
> {code:none}
> Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
> [_gen_alias_28#28,_gen_alias_29#29]
>   at scala.sys.package$.error(package.scala:30)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 37 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32581:


Assignee: Apache Spark

> update duration property for live ui application list and application apis
> --
>
> Key: SPARK-32581
> URL: https://issues.apache.org/jira/browse/SPARK-32581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Zhen Li
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: oldapi.JPG, updatedapiJPG.JPG
>
>
> "duration" property in response from application list and application APIs of 
> live UI is always "0". we want to let these two APIs return correct value, 
> same with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174283#comment-17174283
 ] 

Apache Spark commented on SPARK-32581:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29399

> update duration property for live ui application list and application apis
> --
>
> Key: SPARK-32581
> URL: https://issues.apache.org/jira/browse/SPARK-32581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Zhen Li
>Priority: Trivial
> Attachments: oldapi.JPG, updatedapiJPG.JPG
>
>
> "duration" property in response from application list and application APIs of 
> live UI is always "0". we want to let these two APIs return correct value, 
> same with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32581:


Assignee: (was: Apache Spark)

> update duration property for live ui application list and application apis
> --
>
> Key: SPARK-32581
> URL: https://issues.apache.org/jira/browse/SPARK-32581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Zhen Li
>Priority: Trivial
> Attachments: oldapi.JPG, updatedapiJPG.JPG
>
>
> "duration" property in response from application list and application APIs of 
> live UI is always "0". we want to let these two APIs return correct value, 
> same with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174286#comment-17174286
 ] 

Apache Spark commented on SPARK-32581:
--

User 'zhli1142015' has created a pull request for this issue:
https://github.com/apache/spark/pull/29399

> update duration property for live ui application list and application apis
> --
>
> Key: SPARK-32581
> URL: https://issues.apache.org/jira/browse/SPARK-32581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Zhen Li
>Priority: Trivial
> Attachments: oldapi.JPG, updatedapiJPG.JPG
>
>
> "duration" property in response from application list and application APIs of 
> live UI is always "0". we want to let these two APIs return correct value, 
> same with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Zhen Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhen Li updated SPARK-32581:

Attachment: updatedapiJPG.JPG
oldapi.JPG

> update duration property for live ui application list and application apis
> --
>
> Key: SPARK-32581
> URL: https://issues.apache.org/jira/browse/SPARK-32581
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Zhen Li
>Priority: Trivial
> Attachments: oldapi.JPG, updatedapiJPG.JPG
>
>
> "duration" property in response from application list and application APIs of 
> live UI is always "0". we want to let these two APIs return correct value, 
> same with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32581) update duration property for live ui application list and application apis

2020-08-10 Thread Zhen Li (Jira)
Zhen Li created SPARK-32581:
---

 Summary: update duration property for live ui application list and 
application apis
 Key: SPARK-32581
 URL: https://issues.apache.org/jira/browse/SPARK-32581
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Zhen Li


"duration" property in response from application list and application APIs of 
live UI is always "0". we want to let these two APIs return correct value, same 
with "*Total Uptime*" in live UI's job page



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174263#comment-17174263
 ] 

Takeshi Yamamuro edited comment on SPARK-29767 at 8/10/20, 11:48 AM:
-

I checked the given query and I think your union query looks incorrect. All the 
branches (master/branch-3.0/branch-2.4) have the same error;
{code:java}
: org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the same number of columns, but the first table has 5 columns and the 
second table has 20 columns;; {code}
Actually, the schemas are different;
{code:java}
>>> base_df.printSchema()
root
 |-- hour: long (nullable = true)
 |-- title: string (nullable = true)
 |-- __deleted: string (nullable = true)
 |-- status: string (nullable = true)
 |-- transformationid: string (nullable = true)
 |-- roomid: string (nullable = true)
 |-- day: long (nullable = true)
 |-- notes: string (nullable = true)
 |-- nunitsfromaudit: long (nullable = true)
 |-- ts_ms: long (nullable = true)
 |-- liability: string (nullable = true)
 |-- _class: string (nullable = true)
 |-- month: long (nullable = true)
 |-- updatedate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- year: long (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- brandname: string (nullable = true)
 |    |-- perunitpricefromaudit: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- actualperunitprice: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- itemtype: string (nullable = true)
 |    |-- roomamenityid: long (nullable = true)
 |-- createddate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- actualunits: long (nullable = true)
 |-- description: string (nullable = true)


>>> inc_df.printSchema()
root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _class: string (nullable = true)
 |-- roomid: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- brandname: string (nullable = true)
 |    |-- perunitpricefromaudit: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- actualperunitprice: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- itemtype: string (nullable = true)
 |    |-- roomamenityid: long (nullable = true)
 |-- inserted_at: string (nullable = true)

{code}

NOTE: It is the best to show us a simpler query to reproduce a issue. So, if 
possible, its better to remove unnecessary parts, e.g., columns, for saving 
developer's time.


was (Author: maropu):
I checked the given query and I think your union query looks incorrect. All the 
branches (master/branch-3.0/branch-2.4) have the same error;
{code:java}
: org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the same number of columns, but the first table has 5 columns and the 
second table has 20 columns;; {code}
Actually, the schema are different;
{code:java}
>>> base_df.printSchema()
root
 |-- hour: long (nullable = true)
 |-- title: string (nullable = true)
 |-- __deleted: string (nullable = true)
 |-- status: string (nullable = true)
 |-- transformationid: string (nullable = true)
 |-- roomid: string (nullable = true)
 |-- day: long (nullable = true)
 |-- notes: string (nullable = true)
 |-- nunitsfromaudit: long (nullable = true)
 |-- ts_ms: long (nullable = true)
 |-- liability: string (nullable = true)
 |-- _class: string (nullable = true)
 |-- month: long (nullable = true)
 |-- updatedate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- year: long (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- brandname: string (nullable = true)
 |    |-- perunitpricefromaudit: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- actualperunitprice: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- itemtype: string (nullable = true)
 |    |-- roomamenityid: long (nullable = true)
 |-- createddate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- actualunits: long (nullable = true)
 |-- description: string 

[jira] [Commented] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

2020-08-10 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174263#comment-17174263
 ] 

Takeshi Yamamuro commented on SPARK-29767:
--

I checked the given query and I think your union query looks incorrect. All the 
branches (master/branch-3.0/branch-2.4) have the same error;
{code:java}
: org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the same number of columns, but the first table has 5 columns and the 
second table has 20 columns;; {code}
Actually, the schema are different;
{code:java}
>>> base_df.printSchema()
root
 |-- hour: long (nullable = true)
 |-- title: string (nullable = true)
 |-- __deleted: string (nullable = true)
 |-- status: string (nullable = true)
 |-- transformationid: string (nullable = true)
 |-- roomid: string (nullable = true)
 |-- day: long (nullable = true)
 |-- notes: string (nullable = true)
 |-- nunitsfromaudit: long (nullable = true)
 |-- ts_ms: long (nullable = true)
 |-- liability: string (nullable = true)
 |-- _class: string (nullable = true)
 |-- month: long (nullable = true)
 |-- updatedate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- year: long (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- brandname: string (nullable = true)
 |    |-- perunitpricefromaudit: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- actualperunitprice: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- itemtype: string (nullable = true)
 |    |-- roomamenityid: long (nullable = true)
 |-- createddate: struct (nullable = true)
 |    |-- date: long (nullable = true)
 |-- actualunits: long (nullable = true)
 |-- description: string (nullable = true)


>>> inc_df.printSchema()
root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- _class: string (nullable = true)
 |-- roomid: string (nullable = true)
 |-- item: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- brandname: string (nullable = true)
 |    |-- perunitpricefromaudit: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- actualperunitprice: struct (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |    |-- amount: string (nullable = true)
 |    |-- category: string (nullable = true)
 |    |-- itemtype: string (nullable = true)
 |    |-- roomamenityid: long (nullable = true)
 |-- inserted_at: string (nullable = true)

{code}

NOTE: It is the best to show us a simpler query to reproduce a issue. So, if 
possible, its better to remove unnecessary parts, e.g., columns, for saving 
developer's time.

> Core dump happening on executors while doing simple union of Data Frames
> 
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
>Reporter: Udit Mehrotra
>Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, 
> part-0-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet, test.py
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell 
> and PySpark, resulting in executor contains doing a *core dump* and existing 
> with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
> (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure 
> (executor 11 exited caused by one of the running tasks) Reason: Container 
> from a bad node: container_1572981097605_0021_01_77 on host: 
> ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_1572981097605_0021_01_77
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted 
> LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native"
>  

[jira] [Created] (SPARK-32580) Issue accessing a column values after 'explode' function

2020-08-10 Thread Ayrat Sadreev (Jira)
Ayrat Sadreev created SPARK-32580:
-

 Summary: Issue accessing a column values after 'explode' function
 Key: SPARK-32580
 URL: https://issues.apache.org/jira/browse/SPARK-32580
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Ayrat Sadreev


An exception occurs when trying to flatten double nested arrays

The schema is
{code:none}
root
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- item_id: string (nullable = true)
 |||-- timestamp: string (nullable = true)
 |||-- values: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- sample: double (nullable = true)
{code}
The target schema is
{code:none}
root
 |-- item_id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- sample: double (nullable = true)
{code}
The code (in Java)
{code:java}
package com.skf.streamer.spark;

import java.util.concurrent.TimeoutException;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class ExplodeTest {

   public static void main(String[] args) throws TimeoutException {

  SparkConf conf = new SparkConf()
 .setAppName("SimpleApp")
 .set("spark.scheduler.mode", "FAIR")
 .set("spark.master", "local[1]")
 .set("spark.sql.streaming.checkpointLocation", "checkpoint");

  SparkSession spark = SparkSession.builder()
 .config(conf)
 .getOrCreate();

  Dataset d0 = spark
 .read()
 .format("json")
 .option("multiLine", "true")
 .schema(getSchema())
 .load("src/test/resources/explode/data.json");

  d0.printSchema();

  d0 = d0.withColumn("item", functions.explode(d0.col("data")));
  d0 = d0.withColumn("value", functions.explode(d0.col("item.values")));
  d0.printSchema();
  d0 = d0.select(
 d0.col("item.item_id"),
 d0.col("item.timestamp"),
 d0.col("value.sample")
  );

  d0.printSchema();

  d0.show(); // Failes

  spark.stop();
   }

   private static StructType getSchema() {
  StructField[] level2Fields = {
 DataTypes.createStructField("sample", DataTypes.DoubleType, false),
  };

  StructField[] level1Fields = {
 DataTypes.createStructField("item_id", DataTypes.StringType, false),
 DataTypes.createStructField("timestamp", DataTypes.StringType, false),
 DataTypes.createStructField("values", 
DataTypes.createArrayType(DataTypes.createStructType(level2Fields)), false)
  };

  StructField[] fields = {
 DataTypes.createStructField("data", 
DataTypes.createArrayType(DataTypes.createStructType(level1Fields)), false)
  };

  return DataTypes.createStructType(fields);
   }
}
{code}
The data file
{code:json}
{
  "data": [
{
  "item_id": "item_1",
  "timestamp": "2020-07-01 12:34:89",
  "values": [
{
  "sample": 1.1
},
{
  "sample": 1.2
}
  ]
},
{
  "item_id": "item_2",
  "timestamp": "2020-07-02 12:34:89",
  "values": [
{
  "sample": 2.2
}
  ]
}
  ]
}
{code}
Dataset.show() method fails with an exception
{code:none}
Caused by: java.lang.RuntimeException: Couldn't find _gen_alias_30#30 in 
[_gen_alias_28#28,_gen_alias_29#29]
at scala.sys.package$.error(package.scala:30)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.$anonfun$applyOrElse$1(BoundAttribute.scala:81)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 37 more
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32290:

Priority: Major  (was: Minor)

> NotInSubquery SingleColumn Optimize
> ---
>
> Key: SPARK-32290
> URL: https://issues.apache.org/jira/browse/SPARK-32290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very 
> very time consuming. For example, I've done TPCH benchmark lately, Query 16 
> almost took half of the entire TPCH 22Query execution Time. So i proposed 
> that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
> single column in following pattern.
> {code:java}
> case _@Or(
> _@EqualTo(leftAttr: AttributeReference, rightAttr: 
> AttributeReference),
> _@IsNull(
>   _@EqualTo(_: AttributeReference, _: AttributeReference)
> )
>   )
> {code}
> if buildSide rows is small enough, we can change build side data into a 
> HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
> takes only 30s to finish. 
> But this optimize only works on single column not in subquery, so i am here 
> to seek advise whether the community need this update or not. I will do the 
> pull request first, if the community member thought it's hack, it's fine to 
> just ignore this request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32409) Document the dependency between spark.metrics.staticSources.enabled and JVMSource registration

2020-08-10 Thread Luca Canali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-32409:

Summary: Document the dependency between 
spark.metrics.staticSources.enabled and JVMSource registration  (was: Remove 
dependency between spark.metrics.staticSources.enabled and JVMSource 
registration)

> Document the dependency between spark.metrics.staticSources.enabled and 
> JVMSource registration
> --
>
> Key: SPARK-32409
> URL: https://issues.apache.org/jira/browse/SPARK-32409
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> SPARK-29654 has introduced the configuration 
> `spark.metrics.register.static.sources.enabled`.
>  The current implementation of SPARK-29654 and SPARK-25277 have, as side 
> effect, that when {{spark.metrics.register.static.sources.enabled}} is set to 
> false, the registration of JVM Source is also ignored, even if requested with 
> {{spark.metrics.conf.*.source.jvm.class=org.apache.spark.metrics.source.JvmSource}}
>  A PR is proposed to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32573) Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32573:

Description: 
In SPARK-32290, we introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. These new HashedRelation 
could be applied to other scenario for performance improvements.
 * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
 * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can 
convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since 
in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in 
streamedSide.

This Patch including two changes.
 * using EmptyHashedRelation to do fast stop for common anti join as well
 * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a 
EmptyHashedRelationWithAllNullKeys

  was:
In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. But as for a improvement, 
EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as 
for in AQE, we can even eliminate anti join when we knew that buildSide is 
empty.

 

This Patch including two changes.

In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
well

In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
ShuffleWriteRecord is 0

 

Summary: Anti Join Improvement with EmptyHashedRelation and 
EmptyHashedRelationWithAllNullKeys  (was: Eliminate Anti Join when BuildSide is 
Empty)

> Anti Join Improvement with EmptyHashedRelation and 
> EmptyHashedRelationWithAllNullKeys
> -
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In SPARK-32290, we introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. These new HashedRelation 
> could be applied to other scenario for performance improvements.
>  * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
>  * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can 
> convert NAAJ to a Empty LocalRelation to skip meaningless data iteration 
> since in Single-Key NAAJ, if null key exists in BuildSide, will drop all 
> records in streamedSide.
> This Patch including two changes.
>  * using EmptyHashedRelation to do fast stop for common anti join as well
>  * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a 
> EmptyHashedRelationWithAllNullKeys



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32576) Support PostgreSQL `bpchar` type and array of char type

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174152#comment-17174152
 ] 

Apache Spark commented on SPARK-32576:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29397

> Support PostgreSQL `bpchar` type and array of char type
> ---
>
> Key: SPARK-32576
> URL: https://issues.apache.org/jira/browse/SPARK-32576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jakub Korzeniowski
>Assignee: Jakub Korzeniowski
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Attempting to read the following table:
> {code:java}
> CREATE TABLE test_table (
>   test_column char(64)[]
> )
> {code}
> results in the following exception:
> {code:java}
> java.sql.SQLException: Unsupported type ARRAY
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:256)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
> {code}
> non-array and varchar equivalents are fine.
>  
> I've tracked it down to an internal function of the postgres dialect, that 
> accounts for the special 1-byte char, but doesn't deal with different length 
> ones, which postgres represents as bpchar: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L60-L61].
> Relevant driver code can be found here: 
> [https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/TypeInfoCache.java#L85-L87]
>  
> I'll submit a fix shortly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32576) Support PostgreSQL `bpchar` type and array of char type

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174151#comment-17174151
 ] 

Apache Spark commented on SPARK-32576:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29397

> Support PostgreSQL `bpchar` type and array of char type
> ---
>
> Key: SPARK-32576
> URL: https://issues.apache.org/jira/browse/SPARK-32576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jakub Korzeniowski
>Assignee: Jakub Korzeniowski
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Attempting to read the following table:
> {code:java}
> CREATE TABLE test_table (
>   test_column char(64)[]
> )
> {code}
> results in the following exception:
> {code:java}
> java.sql.SQLException: Unsupported type ARRAY
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getCatalystType(JdbcUtils.scala:256)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$getSchema$1(JdbcUtils.scala:321)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
> {code}
> non-array and varchar equivalents are fine.
>  
> I've tracked it down to an internal function of the postgres dialect, that 
> accounts for the special 1-byte char, but doesn't deal with different length 
> ones, which postgres represents as bpchar: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L60-L61].
> Relevant driver code can be found here: 
> [https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/TypeInfoCache.java#L85-L87]
>  
> I'll submit a fix shortly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174129#comment-17174129
 ] 

Apache Spark commented on SPARK-32579:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29396

> Implement JDBCScan/ScanBuilder/WriteBuilder
> ---
>
> Key: SPARK-32579
> URL: https://issues.apache.org/jira/browse/SPARK-32579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC 
> implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174128#comment-17174128
 ] 

Apache Spark commented on SPARK-32579:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29396

> Implement JDBCScan/ScanBuilder/WriteBuilder
> ---
>
> Key: SPARK-32579
> URL: https://issues.apache.org/jira/browse/SPARK-32579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC 
> implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32579:


Assignee: Apache Spark

> Implement JDBCScan/ScanBuilder/WriteBuilder
> ---
>
> Key: SPARK-32579
> URL: https://issues.apache.org/jira/browse/SPARK-32579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC 
> implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32579:


Assignee: (was: Apache Spark)

> Implement JDBCScan/ScanBuilder/WriteBuilder
> ---
>
> Key: SPARK-32579
> URL: https://issues.apache.org/jira/browse/SPARK-32579
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC 
> implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32579) Implement JDBCScan/ScanBuilder/WriteBuilder

2020-08-10 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-32579:
--

 Summary: Implement JDBCScan/ScanBuilder/WriteBuilder
 Key: SPARK-32579
 URL: https://issues.apache.org/jira/browse/SPARK-32579
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add JDBCScan, JDBCScanBuilder and JDBCWriteBuilder to Datasource V2 JDBC 
implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32518) CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds of resources

2020-08-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174113#comment-17174113
 ] 

Apache Spark commented on SPARK-32518:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29395

> CoarseGrainedSchedulerBackend.maxNumConcurrentTasks should consider all kinds 
> of resources
> --
>
> Key: SPARK-32518
> URL: https://issues.apache.org/jira/browse/SPARK-32518
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, CoarseGrainedSchedulerBackend.maxNumConcurrentTasks only considers 
> the CPU for the max concurrent tasks. This can cause the application to hang 
> when a barrier stage requires extra custom resources but the cluster doesn't 
> have enough corresponding resources. Because, without the checking for other 
> custom resources in maxNumConcurrentTasks, the barrier stage can be submitted 
> to the TaskSchedulerImpl. But the TaskSchedulerImpl can not launch tasks for 
> the barrier stage due to the insufficient task slots calculated by 
> calculateAvailableSlots(which does check all kinds of resources). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32191) Migration Guide

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32191.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29385
[https://github.com/apache/spark/pull/29385]

> Migration Guide
> ---
>
> Key: SPARK-32191
> URL: https://issues.apache.org/jira/browse/SPARK-32191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32191) Migration Guide

2020-08-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32191:


Assignee: L. C. Hsieh

> Migration Guide
> ---
>
> Key: SPARK-32191
> URL: https://issues.apache.org/jira/browse/SPARK-32191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
>
> Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32578) PageRank not sending the correct values in Pergel sendMessage

2020-08-10 Thread Shay Elbaz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shay Elbaz updated SPARK-32578:
---
Description: 
The core sendMessage method is incorrect:
{code:java}
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
 if (edge.srcAttr._2 > tol) {
   Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
  // *** THIS ^ ***
 } else {
   Iterator.empty
 }
}{code}
 

Instead of using the source PR value, it's using the PR delta (2nd tuple arg). 
This is not the documented behavior, nor a valid PR algorithm AFAIK.

This is a 7 years old code, all versions affected.

 

 

  was:
The core sendMessage method is incorrect:
{code:java}
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
 if (edge.srcAttr._2 > tol) {
   Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
 //  THIS ^
 } else {
   Iterator.empty
 }
}{code}
 

Instead of sending the source PR value, it sends the PR delta. This is not the 
documented behavior, nor a valid PR algorithm AFAIK.

 

This is a 7 years old code, all versions affected.

 

 


> PageRank not sending the correct values in Pergel sendMessage
> -
>
> Key: SPARK-32578
> URL: https://issues.apache.org/jira/browse/SPARK-32578
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: Shay Elbaz
>Priority: Major
>
> The core sendMessage method is incorrect:
> {code:java}
> def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
>  if (edge.srcAttr._2 > tol) {
>Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
>   // *** THIS ^ ***
>  } else {
>Iterator.empty
>  }
> }{code}
>  
> Instead of using the source PR value, it's using the PR delta (2nd tuple 
> arg). This is not the documented behavior, nor a valid PR algorithm AFAIK.
> This is a 7 years old code, all versions affected.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32578) PageRank not sending the correct values in Pergel sendMessage

2020-08-10 Thread Shay Elbaz (Jira)
Shay Elbaz created SPARK-32578:
--

 Summary: PageRank not sending the correct values in Pergel 
sendMessage
 Key: SPARK-32578
 URL: https://issues.apache.org/jira/browse/SPARK-32578
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 3.0.0, 2.4.0, 2.3.0
Reporter: Shay Elbaz


The core sendMessage method is incorrect:
{code:java}
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
 if (edge.srcAttr._2 > tol) {
   Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
 //  THIS ^
 } else {
   Iterator.empty
 }
}{code}
 

Instead of sending the source PR value, it sends the PR delta. This is not the 
documented behavior, nor a valid PR algorithm AFAIK.

 

This is a 7 years old code, all versions affected.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32337) Show initial plan in AQE plan tree string

2020-08-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32337:
---

Assignee: Allison Wang

> Show initial plan in AQE plan tree string
> -
>
> Key: SPARK-32337
> URL: https://issues.apache.org/jira/browse/SPARK-32337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently the tree string for {{AdaptiveSparkPlanExec}} only shows the 
> current physical plan. It would be helpful to also show the initial plan to 
> see the plan change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org