[jira] [Commented] (SPARK-26093) Read Avro: ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource

2018-11-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690413#comment-16690413
 ] 

Hyukjin Kwon commented on SPARK-26093:
--

Please follow https://spark.apache.org/docs/latest/sql-data-sources-avro.html

> Read Avro: ClassNotFoundException: 
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
> -
>
> Key: SPARK-26093
> URL: https://issues.apache.org/jira/browse/SPARK-26093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
> Scala 2.11.12
> Java 1.8.0_181
>Reporter: Dagang Wei
>Priority: Major
>
> I downloaded and unpacked spark-2.4.0-bin-hadoop2.7.tgz to my Linux, then I 
> followed [Read Avro 
> files|https://docs.databricks.com/spark/latest/data-sources/read-avro.html] 
> to read a local Avro file in spark-shell:
> $ bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0
> ...
> version 2.4.0
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
> ...
> scala> 
> import com.databricks.spark.avro._
> scala> 
> val df = spark.read.avro("file:///.../foo.avro")
> java.lang.ClassNotFoundException: Failed to find data source: 
> org.apache.spark.sql.avro.AvroFileFormat. Please find packages at 
> http://spark.apache.org/third-party-projects.html
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>  at 
> com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
>  at 
> com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
>  ... 51 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
>  at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>  at scala.util.Try$.apply(Try.scala:192)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>  at scala.util.Try.orElse(Try.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
>  ... 55 more
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26093) Read Avro: ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource

2018-11-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26093.
--
Resolution: Not A Problem

> Read Avro: ClassNotFoundException: 
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
> -
>
> Key: SPARK-26093
> URL: https://issues.apache.org/jira/browse/SPARK-26093
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
> Scala 2.11.12
> Java 1.8.0_181
>Reporter: Dagang Wei
>Priority: Major
>
> I downloaded and unpacked spark-2.4.0-bin-hadoop2.7.tgz to my Linux, then I 
> followed [Read Avro 
> files|https://docs.databricks.com/spark/latest/data-sources/read-avro.html] 
> to read a local Avro file in spark-shell:
> $ bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0
> ...
> version 2.4.0
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)
> ...
> scala> 
> import com.databricks.spark.avro._
> scala> 
> val df = spark.read.avro("file:///.../foo.avro")
> java.lang.ClassNotFoundException: Failed to find data source: 
> org.apache.spark.sql.avro.AvroFileFormat. Please find packages at 
> http://spark.apache.org/third-party-projects.html
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>  at 
> com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
>  at 
> com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
>  ... 51 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
>  at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>  at scala.util.Try$.apply(Try.scala:192)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>  at scala.util.Try.orElse(Try.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
>  ... 55 more
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690399#comment-16690399
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

It should be great if we can know what condition and code could reproduce this 
error - should be reopen-able at any point if we're clear on that.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26097:


Assignee: (was: Apache Spark)

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Major
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690382#comment-16690382
 ] 

Apache Spark commented on SPARK-26097:
--

User 'idanz' has created a pull request for this issue:
https://github.com/apache/spark/pull/23067

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Major
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26097:


Assignee: Apache Spark

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Assignee: Apache Spark
>Priority: Major
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Idan Zalzberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Idan Zalzberg updated SPARK-26097:
--
Attachment: image (8).png

> Show partitioning details in DAG UI
> ---
>
> Key: SPARK-26097
> URL: https://issues.apache.org/jira/browse/SPARK-26097
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Idan Zalzberg
>Priority: Major
> Attachments: image (8).png
>
>
> We run complex SQL queries using Spark SQL, often we have to tackle a join 
> skew or incorrect partition count. The problem is that while the Spark UI 
> shows the existence of the problem and what *stage* it is part of, it's hard 
> to infer back to the original SQL query that was given (e.g. what is the 
> specific join operation that is actually skewed).
> One way to resolve this is to relate the Exchange nodes in the DAG to the 
> partitioning that they represent, this is actually a trivial change in code 
> (less than one line) that we believe can greatly benefit the research of 
> performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26097) Show partitioning details in DAG UI

2018-11-16 Thread Idan Zalzberg (JIRA)
Idan Zalzberg created SPARK-26097:
-

 Summary: Show partitioning details in DAG UI
 Key: SPARK-26097
 URL: https://issues.apache.org/jira/browse/SPARK-26097
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: Idan Zalzberg


We run complex SQL queries using Spark SQL, often we have to tackle a join skew 
or incorrect partition count. The problem is that while the Spark UI shows the 
existence of the problem and what *stage* it is part of, it's hard to infer 
back to the original SQL query that was given (e.g. what is the specific join 
operation that is actually skewed).
One way to resolve this is to relate the Exchange nodes in the DAG to the 
partitioning that they represent, this is actually a trivial change in code 
(less than one line) that we believe can greatly benefit the research of 
performance issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26043:


Assignee: Apache Spark

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: release-notes
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690373#comment-16690373
 ] 

Sean Owen commented on SPARK-26026:
---

Disregard my previous comment. It was because it caused the 2.12 build to fail, 
but the error was really:

{code}
/Users/seanowen/Documents/spark_2.12/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java:37:
 error: inner classes cannot be classfile annotations
  public @interface Stable {};
^
/Users/seanowen/Documents/spark_2.12/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java:47:
 error: inner classes cannot be classfile annotations
  public @interface Evolving {};
^
/Users/seanowen/Documents/spark_2.12/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java:57:
 error: inner classes cannot be classfile annotations
  public @interface Unstable {};
{code}

OK, I guess we need to make those annotations top-level classes, which should 
be OK in Spark 3.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26026:
--
Comment: was deleted

(was: Ah, I think it was this:
{code}
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/AlphaComponent.java:33:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface AlphaComponent {}
  ^
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/DeveloperApi.java:36:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface DeveloperApi {}
  ^
...
{code}

It may be that we have to port the annotations to make it work.)

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688997#comment-16688997
 ] 

Sean Owen edited comment on SPARK-26026 at 11/17/18 4:31 AM:
-

Ah, I think it was this:
{code}
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/AlphaComponent.java:33:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface AlphaComponent {}
  ^
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/DeveloperApi.java:36:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface DeveloperApi {}
  ^
...
{code}

It may be that we have to port the annotations to make it work.


was (Author: srowen):
Ah, I think it was this:
{code}
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/AlphaComponent.java:33:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface AlphaComponent {}
  ^
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/DeveloperApi.java:36:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface DeveloperApi {}
  ^
...
{code}

It may be that we have to port the annotations to make it work.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26043:


Assignee: (was: Apache Spark)

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>  Labels: release-notes
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690370#comment-16690370
 ] 

Apache Spark commented on SPARK-26043:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23066

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>  Labels: release-notes
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26043:
--
Priority: Minor  (was: Major)

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>  Labels: release-notes
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16647) sparksql1.6.2 on yarn with hive metastore1.0.0 thows "alter_table_with_cascade" exception

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16647.
---
Resolution: Won't Fix

Spark 1.x is long since not supported, so I'd only pursue this if it is 
reproducible on 2.2+, ideally master.

> sparksql1.6.2 on yarn with hive metastore1.0.0 thows 
> "alter_table_with_cascade" exception
> -
>
> Key: SPARK-16647
> URL: https://issues.apache.org/jira/browse/SPARK-16647
> Project: Spark
>  Issue Type: Bug
>Reporter: zhangshuxin
>Priority: Major
>
> my spark version is 1.6.2(1.5.2,1.5.0) and hive version is 1.0.0
> when i execute some sql like 'create table tbl1 as select * from tbl2' or 
> 'insert overwrite table tabl1 select * from tbl2',i get the following 
> exception
> 16/07/20 10:14:13 WARN metastore.RetryingMetaStoreClient: MetaStoreClient 
> lost connection. Attempting to reconnect.
> org.apache.thrift.TApplicationException: Invalid method name: 
> 'alter_table_with_cascade'
> at 
> org.apache.thrift.TApplicationException.read(TApplicationException.java:111)
> at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_alter_table_with_cascade(ThriftHiveMetastore.java:1374)
> at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.alter_table_with_cascade(ThriftHiveMetastore.java:1358)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table(HiveMetaStoreClient.java:340)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table(SessionHiveMetaStoreClient.java:251)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
> at com.sun.proxy.$Proxy27.alter_table(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:496)
> at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> 

[jira] [Assigned] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26090:


Assignee: Apache Spark  (was: Sean Owen)

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>  Labels: release-notes
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Lots of work in the Kinesis examples to update and avoid deprecation
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690213#comment-16690213
 ] 

Apache Spark commented on SPARK-26090:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23065

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Lots of work in the Kinesis examples to update and avoid deprecation
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26090:


Assignee: Sean Owen  (was: Apache Spark)

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Lots of work in the Kinesis examples to update and avoid deprecation
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690212#comment-16690212
 ] 

Apache Spark commented on SPARK-26090:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23065

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Lots of work in the Kinesis examples to update and avoid deprecation
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26033) Break large ml/tests.py files into smaller files

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690187#comment-16690187
 ] 

Apache Spark commented on SPARK-26033:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/23063

> Break large ml/tests.py files into smaller files
> 
>
> Key: SPARK-26033
> URL: https://issues.apache.org/jira/browse/SPARK-26033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26033) Break large ml/tests.py files into smaller files

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26033:


Assignee: Bryan Cutler  (was: Apache Spark)

> Break large ml/tests.py files into smaller files
> 
>
> Key: SPARK-26033
> URL: https://issues.apache.org/jira/browse/SPARK-26033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26033) Break large ml/tests.py files into smaller files

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26033:


Assignee: Apache Spark  (was: Bryan Cutler)

> Break large ml/tests.py files into smaller files
> 
>
> Key: SPARK-26033
> URL: https://issues.apache.org/jira/browse/SPARK-26033
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690141#comment-16690141
 ] 

Apache Spark commented on SPARK-8288:
-

User 'drewrobb' has created a pull request for this issue:
https://github.com/apache/spark/pull/23062

> ScalaReflection should also try apply methods defined in companion objects 
> when inferring schema from a Product type
> 
>
> Key: SPARK-8288
> URL: https://issues.apache.org/jira/browse/SPARK-8288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Priority: Major
>
> This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
> issue).
> My comment on that issue quoted below:
> {quote}
> ...  The reason of this exception is that, the Scala code Scrooge generates 
> is actually a trait extending {{Product}}:
> {code}
> trait Junk
>   extends ThriftStruct
>   with scala.Product2[Long, String]
>   with java.io.Serializable
> {code}
> while Spark expects a case class, something like:
> {code}
> case class Junk(junkID: Long, junkString: String)
> {code}
> The key difference here is that the latter case class version has a 
> constructor whose arguments can be transformed into fields of the DataFrame 
> schema.  The exception was thrown because Spark can't find such a constructor 
> from trait {{Junk}}.
> {quote}
> We can make {{ScalaReflection}} try {{apply}} methods in companion objects, 
> so that trait types generated by Scrooge can also be used for Spark SQL 
> schema inference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26095.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23061
[https://github.com/apache/spark/pull/23061]

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 3.0.0
>
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26095:
--

Assignee: Marcelo Vanzin

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
> Fix For: 3.0.0
>
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26096) k8s integration tests should run R tests

2018-11-16 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26096:
--

 Summary: k8s integration tests should run R tests
 Key: SPARK-26096
 URL: https://issues.apache.org/jira/browse/SPARK-26096
 Project: Spark
  Issue Type: Task
  Components: Kubernetes, Tests
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


Noticed while debugging a completely separate things.

- the jenkins job doesn't enable the SparkR profile
- KubernetesSuite doesn't include the RTestsSuite trait

even if you fix those two, it seems the tests are broken:

{noformat}
[info] - Run SparkR on simple dataframe.R example *** FAILED *** (2 minutes, 3 
seconds)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:308)
[info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:307)
[info]   at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
[info]   at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.runSparkApplicationAndVerifyCompletion(KubernetesSuite.scala:274)
[info]   at 
org.apache.spark.deploy.k8s.integrationtest.RTestsSuite.$anonfun$$init$$1(RTestsSuite.scala:26)
[info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26092.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 3.0.0
   2.4.1

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690123#comment-16690123
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

That user said he has seen this error 4-5 times, and just rerunning same code 
makes it disappear.



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690075#comment-16690075
 ] 

Imran Rashid commented on SPARK-26094:
--

when playing around with this, I noticed another difference -- {{fs.create()}} 
accepts relative paths, and {{fs.createFile()}} requires absolute files.  When 
I tried with a relative file, I got

{noformat}
java.lang.IllegalArgumentException: Pathname floop/blah from floop/blah is not 
a valid DFS filename.
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:233)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$10.doCall(DistributedFileSystem.java:563)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$10.doCall(DistributedFileSystem.java:560)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:581)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$800(DistributedFileSystem.java:121)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$HdfsDataOutputStreamBuilder.build(DistributedFileSystem.java:3026)
  ... 53 elided
{noformat}

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690076#comment-16690076
 ] 

Imran Rashid commented on SPARK-26094:
--

cc [~xiaochen]

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25871) Streaming WAL should not use hdfs erasure coding, regardless of FS defaults

2018-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690039#comment-16690039
 ] 

Imran Rashid commented on SPARK-25871:
--

this introduced a regression, SPARK-26094

> Streaming WAL should not use hdfs erasure coding, regardless of FS defaults
> ---
>
> Key: SPARK-25871
> URL: https://issues.apache.org/jira/browse/SPARK-25871
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> The {{FileBasedWriteAheadLogWriter}} expects the output stream for the WAL to 
> support {{hflush()}}, but hdfs erasure coded files do not support that.
> https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html#Limitations
> otherwise you get exceptions like:
> {noformat}
> 17/10/17 17:31:34 ERROR executor.Executor: Exception in task 0.2 in stage 6.0 
> (TID 85)
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://quasar-yxckyb-1.vpc.cloudera.com:8020/tmp/__spark__a10be3a3-85ec-4d4f-8782-a4760df4cc4c/88657/checkpoints/receivedData/0/log-1508286672978-1508286732978,1321921,189000)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException: Cannot seek after EOF
>   at 
> org.apache.hadoop.hdfs.DFSStripedInputStream.seek(DFSStripedInputStream.java:331)
>   at 
> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.read(FileBasedWriteAheadLogRandomReader.scala:37)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:120)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142)
>   ... 18 more
> {noformat}
> HDFS allows you to force a file to be replicated, regardless of the FS 
> defaults -- we should do that for the WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690045#comment-16690045
 ] 

Apache Spark commented on SPARK-26095:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23061

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26095:


Assignee: Apache Spark

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Critical
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26095:


Assignee: (was: Apache Spark)

> make-distribution.sh is hanging in jenkins
> --
>
> Key: SPARK-26095
> URL: https://issues.apache.org/jira/browse/SPARK-26095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/23017 for further discussion.
> maven seems to get stuck here:
> {noformat}
> "BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
> runnable [0x7f16882fd000]
>java.lang.Thread.State: RUNNABLE
> at org.jdom2.Element.isAncestor(Element.java:1052)
> at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
> at org.jdom2.ContentList.add(ContentList.java:244)
> at org.jdom2.Element.addContent(Element.java:950)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
> at 
> org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
> at 
> org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
> at 
> org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
> at 
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
> at 
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
> at 
> org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> And in fact I see a bunch of threads stuck there. Trying a few different 
> things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26095) make-distribution.sh is hanging in jenkins

2018-11-16 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26095:
--

 Summary: make-distribution.sh is hanging in jenkins
 Key: SPARK-26095
 URL: https://issues.apache.org/jira/browse/SPARK-26095
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


See https://github.com/apache/spark/pull/23017 for further discussion.

maven seems to get stuck here:

{noformat}
"BuilderThread 5" #80 prio=5 os_prio=0 tid=0x7f16b850 nid=0x7bcf 
runnable [0x7f16882fd000]
   java.lang.Thread.State: RUNNABLE
at org.jdom2.Element.isAncestor(Element.java:1052)
at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
at org.jdom2.ContentList.add(ContentList.java:244)
at org.jdom2.Element.addContent(Element.java:950)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateModel(MavenJDOMWriter.java:1623)
at 
org.apache.maven.plugins.shade.pom.MavenJDOMWriter.write(MavenJDOMWriter.java:2156)
at org.apache.maven.plugins.shade.pom.PomWriter.write(PomWriter.java:75)
at 
org.apache.maven.plugins.shade.mojo.ShadeMojo.rewriteDependencyReducedPomIfWeHaveReduction(ShadeMojo.java:1049)
at 
org.apache.maven.plugins.shade.mojo.ShadeMojo.createDependencyReducedPom(ShadeMojo.java:978)
at 
org.apache.maven.plugins.shade.mojo.ShadeMojo.execute(ShadeMojo.java:538)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
at 
org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:200)
at 
org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call(MultiThreadedBuilder.java:196)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

And in fact I see a bunch of threads stuck there. Trying a few different things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-16 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-26094:


 Summary: Streaming WAL should create parent dirs
 Key: SPARK-26094
 URL: https://issues.apache.org/jira/browse/SPARK-26094
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 3.0.0
Reporter: Imran Rashid
Assignee: Imran Rashid


SPARK-25871 introduced a regression in the streaming WAL -- it no longer makes 
all the parent dirs, so you may see an exception like this in cases that used 
to work:

{noformat}
18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
Failed to write to write ahead log after 3 failures
...
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at 
org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
...
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
/tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26093) Read Avro: ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource

2018-11-16 Thread Dagang Wei (JIRA)
Dagang Wei created SPARK-26093:
--

 Summary: Read Avro: ClassNotFoundException: 
org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
 Key: SPARK-26093
 URL: https://issues.apache.org/jira/browse/SPARK-26093
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Spark 2.4.0

Scala 2.11.12

Java 1.8.0_181
Reporter: Dagang Wei


I downloaded and unpacked spark-2.4.0-bin-hadoop2.7.tgz to my Linux, then I 
followed [Read Avro 
files|https://docs.databricks.com/spark/latest/data-sources/read-avro.html] to 
read a local Avro file in spark-shell:

$ bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

...

version 2.4.0
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_181)

...

scala> 

import com.databricks.spark.avro._

scala> 

val df = spark.read.avro("file:///.../foo.avro")

java.lang.ClassNotFoundException: Failed to find data source: 
org.apache.spark.sql.avro.AvroFileFormat. Please find packages at 
http://spark.apache.org/third-party-projects.html
 at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
 at 
com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
 at 
com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
 ... 51 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.sql.avro.AvroFileFormat.DefaultSource
 at 
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
 at scala.util.Try$.apply(Try.scala:192)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
 at scala.util.Try.orElse(Try.scala:84)
 at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
 ... 55 more

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689919#comment-16689919
 ] 

Apache Spark commented on SPARK-26092:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23060

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26092:


Assignee: (was: Apache Spark)

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26092:


Assignee: Apache Spark

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689920#comment-16689920
 ] 

Apache Spark commented on SPARK-26092:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23060

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26092:
-
Issue Type: Bug  (was: Test)

> Use CheckpointFileManager to write the streaming metadata file
> --
>
> Key: SPARK-26092
> URL: https://issues.apache.org/jira/browse/SPARK-26092
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> We should use CheckpointFileManager to write the streaming metadata file to 
> avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26092) Use CheckpointFileManager to write the streaming metadata file

2018-11-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26092:


 Summary: Use CheckpointFileManager to write the streaming metadata 
file
 Key: SPARK-26092
 URL: https://issues.apache.org/jira/browse/SPARK-26092
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


We should use CheckpointFileManager to write the streaming metadata file to 
avoid potential partial file issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26091:


Assignee: (was: Apache Spark)

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689894#comment-16689894
 ] 

Apache Spark commented on SPARK-26091:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/23059

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26091:


Assignee: Apache Spark

> Upgrade to 2.3.4 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-26091
> URL: https://issues.apache.org/jira/browse/SPARK-26091
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26091) Upgrade to 2.3.4 for Hive Metastore Client 2.3

2018-11-16 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-26091:
-

 Summary: Upgrade to 2.3.4 for Hive Metastore Client 2.3
 Key: SPARK-26091
 URL: https://issues.apache.org/jira/browse/SPARK-26091
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25905) BlockManager should expose getRemoteManagedBuffer to avoid creating bytebuffers

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689860#comment-16689860
 ] 

Apache Spark commented on SPARK-25905:
--

User 'wypoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23058

> BlockManager should expose getRemoteManagedBuffer to avoid creating 
> bytebuffers
> ---
>
> Key: SPARK-25905
> URL: https://issues.apache.org/jira/browse/SPARK-25905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> The block manager currently only lets you get a handle on remote data as a 
> {{ChunkedByteBuffer}}.  But with remote reads of cached data, you really only 
> need an input stream view of the data, which is already available with the 
> {{ManagedBuffer}} which is fetched.  By forcing conversion to a 
> {{ChunkedByteBuffer}}, we end up using more memory than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25905) BlockManager should expose getRemoteManagedBuffer to avoid creating bytebuffers

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25905:


Assignee: (was: Apache Spark)

> BlockManager should expose getRemoteManagedBuffer to avoid creating 
> bytebuffers
> ---
>
> Key: SPARK-25905
> URL: https://issues.apache.org/jira/browse/SPARK-25905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> The block manager currently only lets you get a handle on remote data as a 
> {{ChunkedByteBuffer}}.  But with remote reads of cached data, you really only 
> need an input stream view of the data, which is already available with the 
> {{ManagedBuffer}} which is fetched.  By forcing conversion to a 
> {{ChunkedByteBuffer}}, we end up using more memory than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25905) BlockManager should expose getRemoteManagedBuffer to avoid creating bytebuffers

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25905:


Assignee: Apache Spark

> BlockManager should expose getRemoteManagedBuffer to avoid creating 
> bytebuffers
> ---
>
> Key: SPARK-25905
> URL: https://issues.apache.org/jira/browse/SPARK-25905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>  Labels: memory-analysis
>
> The block manager currently only lets you get a handle on remote data as a 
> {{ChunkedByteBuffer}}.  But with remote reads of cached data, you really only 
> need an input stream view of the data, which is already available with the 
> {{ManagedBuffer}} which is fetched.  By forcing conversion to a 
> {{ChunkedByteBuffer}}, we end up using more memory than necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21559) Remove Mesos fine-grained mode

2018-11-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21559:

Target Version/s: 3.0.0
  Labels: release-notes  (was: )

> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>  Labels: release-notes
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grained mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grained mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26069.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-16 Thread Luciano Resende (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689695#comment-16689695
 ] 

Luciano Resende commented on SPARK-22865:
-

Any update on this issue?  [~akorzhuev] seems to have things automated and 
maybe we should enhance it and make it part of the release process? If folks 
are ok I might take a quick look into it. 

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-16 Thread Luciano Resende (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689695#comment-16689695
 ] 

Luciano Resende edited comment on SPARK-22865 at 11/16/18 5:23 PM:
---

Any update on this issue?  [~akorzhuev] seems to have things automated and 
maybe we should enhance it and make it part of the release process? If folks 
are ok I might take a quick look into it as I need these images for a related 
project.


was (Author: luciano resende):
Any update on this issue?  [~akorzhuev] seems to have things automated and 
maybe we should enhance it and make it part of the release process? If folks 
are ok I might take a quick look into it. 

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency

2018-11-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689641#comment-16689641
 ] 

Sean Owen commented on SPARK-26045:
---

This looks like you are using a different version of Avro than Spark uses? I am 
not sure how Spark could compile with this but then fail in this way.

> Error in the spark 2.4 release package with the spark-avro_2.11 depdency
> 
>
> Key: SPARK-26045
> URL: https://issues.apache.org/jira/browse/SPARK-26045
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
> Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Oscar garcía 
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hello I have been problems with the last spark 2.4 release, the read avro 
> file feature does not seem to be working, I have fixed it in local building 
> the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ 
> dependencies.
> With the default spark 2.4 release when I try to read an avro file spark 
> raise the following exception.  
> {code:java}
> spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
> scala> spark.read.format("avro").load("file.avro")
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105
> {code}
> Checksum:  spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 
> 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 
> 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-16 Thread Sean Owen (JIRA)
Sean Owen created SPARK-26090:
-

 Summary: Resolve most miscellaneous deprecation and build warnings 
for Spark 3
 Key: SPARK-26090
 URL: https://issues.apache.org/jira/browse/SPARK-26090
 Project: Spark
  Issue Type: Improvement
  Components: ML, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
miscellaneous ones here.

They're too numerous and small to list here; see the pull request. Some 
highlights:

- @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
Java. Instead, case classes can explicitly declare getters
- Lots of work in the Kinesis examples to update and avoid deprecation
- Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
- Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
- finalize() is finally deprecated (just needs to be suppressed)
- StageInfo.attempId was deprecated and easiest to remove here

I'm not now going to touch some chunks of deprecation warnings:

- Parquet deprecations
- Hive deprecations (particularly serde2 classes)
- Deprecations in generated code (mostly Thriftserver CLI)
- ProcessingTime deprecations (we may need to revive this class as internal)
- many MLlib deprecations because they concern methods that may be removed 
anyway
- a few Kinesis deprecations I couldn't figure out
- Mesos get/setRole, which I don't know well
- Kafka/ZK deprecations (e.g. poll())
- a few other ones that will probably resolve by deleting a deprecated method




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26052) Spark should output a _SUCCESS file for every partition correctly written

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26052:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Spark should output a _SUCCESS file for every partition correctly written
> -
>
> Key: SPARK-26052
> URL: https://issues.apache.org/jira/browse/SPARK-26052
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.0
>Reporter: Matt Matolcsi
>Priority: Minor
>
> When writing a set of partitioned Parquet files to HDFS using 
> dataframe.write.parquet(), a _SUCCESS file is written to hdfs://path/to/table 
> after successful completion, though the actual Parquet files will end up in 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/ 
> If partitions are written out one at a time (e.g., an hourly ETL), the 
> _SUCCESS file is overwritten by each subsequent run and information on what 
> partitions were correctly written is lost.
> I would like to be able to keep track of what partitions were successfully 
> written in HDFS. I think this could be done by writing the _SUCCESS files to 
> the same partition directories where the Parquet files reside, i.e., 
> hdfs://path/to/table/partition_key1=val1/partition_key2=val2/
> Since https://issues.apache.org/jira/browse/SPARK-13207 has been resolved, I 
> don't think this should break partition discovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26073) remove invalid comment as we don't use it anymore

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26073.
---
   Resolution: Fixed
 Assignee: caoxuewen
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/23044

> remove invalid comment as we don't use it anymore
> -
>
> Key: SPARK-26073
> URL: https://issues.apache.org/jira/browse/SPARK-26073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: caoxuewen
>Priority: Major
> Fix For: 3.0.0
>
>
> remove invalid comment as we don't use it anymore
> More details: [https://github.com/apache/spark/pull/22976] (comment)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1517.
--
Resolution: Won't Fix

Ancient JIRA here ... I think we did publish these and don't know that we do 
anymore. Partly there was concern in the past about 'featuring' unreleased 
code. I don't think we're going to do anything more here.

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights

2018-11-16 Thread Olivier Sannier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689633#comment-16689633
 ] 

Olivier Sannier edited comment on SPARK-23709 at 11/16/18 4:30 PM:
---

Well, sure, but nobody answered on the mailing list and I'm convinced that the 
documentation should be updated to provide the answer to this question.

So I'm left with no viable solution here, and that's pretty annoying.


was (Author: obones):
Well, sure, but nobody answered on the mailing list and I'm convinced that the 
documentation should be update to provide the answer to this question.

So I'm left with no viable solution here, and that's pretty annoying.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the 
> sum of weights
> ---
>
> Key: SPARK-23709
> URL: https://issues.apache.org/jira/browse/SPARK-23709
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Olivier Sannier
>Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the 
> source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where 
> each row is associated to a weight for the final dataset, ie for each tree 
> asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with 
> replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to 
> compute the variance is not always equal to the source data count, it is 
> sometimes less, sometimes more.
> I went digging in the source and found the 
> BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a 
> Poisson distribution to assign a weight to each row. And this distribution 
> does not guarantee that the total of weights for a given tree is equal to the 
> source dataset count.
> Looking around in here, it seems this is done for performance reasons because 
> the approximation it gives is good enough, especially when dealing with very 
> large datasets.
> However, I could not find any documentation that clearly explains this. Would 
> you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights

2018-11-16 Thread Olivier Sannier (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689633#comment-16689633
 ] 

Olivier Sannier commented on SPARK-23709:
-

Well, sure, but nobody answered on the mailing list and I'm convinced that the 
documentation should be update to provide the answer to this question.

So I'm left with no viable solution here, and that's pretty annoying.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the 
> sum of weights
> ---
>
> Key: SPARK-23709
> URL: https://issues.apache.org/jira/browse/SPARK-23709
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Olivier Sannier
>Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the 
> source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where 
> each row is associated to a weight for the final dataset, ie for each tree 
> asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with 
> replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to 
> compute the variance is not always equal to the source data count, it is 
> sometimes less, sometimes more.
> I went digging in the source and found the 
> BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a 
> Poisson distribution to assign a weight to each row. And this distribution 
> does not guarantee that the total of weights for a given tree is equal to the 
> source dataset count.
> Looking around in here, it seems this is done for performance reasons because 
> the approximation it gives is good enough, especially when dealing with very 
> large datasets.
> However, I could not find any documentation that clearly explains this. Would 
> you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22481) CatalogImpl.refreshTable is slow

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22481:
--
  Priority: Major  (was: Critical)
Issue Type: Improvement  (was: Bug)

> CatalogImpl.refreshTable is slow
> 
>
> Key: SPARK-22481
> URL: https://issues.apache.org/jira/browse/SPARK-22481
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Major
>
> CatalogImpl.refreshTable was updated in 2.1.1 and since than it has become 
> really slow.
> The cause of the issue is that it is now *always* creates a dataset, and this 
> is redundant most of the time, we only need the dataset if the table is 
> cached.
> before 2.1.1:
>   override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> val logicalPlan = 
> sparkSession.sessionState.catalog.lookupRelation(tableIdent)
> // Use lookupCachedData directly since RefreshTable also takes 
> databaseName.
> val isCached = 
> sparkSession.sharedState.cacheManager.lookupCachedData(logicalPlan).nonEmpty
> if (isCached) {
>   // Create a data frame to represent the table.
>   // TODO: Use uncacheTable once it supports database name.
>  {color:red} val df = Dataset.ofRows(sparkSession, logicalPlan){color}
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(df, blocking = true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(df, 
> Some(tableIdent.table))
> }
>   }
> after 2.1.1:
>override def refreshTable(tableName: String): Unit = {
> val tableIdent = 
> sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName)
> // Temp tables: refresh (or invalidate) any metadata/data cached in the 
> plan recursively.
> // Non-temp tables: refresh the metadata cache.
> sessionCatalog.refreshTable(tableIdent)
> // If this table is cached as an InMemoryRelation, drop the original
> // cached version and make the new version cached lazily.
> {color:red}   val table = sparkSession.table(tableIdent){color}
> if (isCached(table)) {
>   // Uncache the logicalPlan.
>   sparkSession.sharedState.cacheManager.uncacheQuery(table, blocking = 
> true)
>   // Cache it again.
>   sparkSession.sharedState.cacheManager.cacheQuery(table, 
> Some(tableIdent.table))
> }
>   }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12963.
---
Resolution: Not A Problem

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20697:
--
Priority: Major  (was: Critical)

This sounds like Hive functionality though; is it even resolvable in Spark?

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Major
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23530) It's not appropriate to let the original master exit while the leader of zookeeper shutdown

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23530:
--
Priority: Major  (was: Critical)

What is the consequence though?

> It's not appropriate to let the original master exit while the leader of 
> zookeeper shutdown
> ---
>
> Key: SPARK-23530
> URL: https://issues.apache.org/jira/browse/SPARK-23530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: liuxianjiao
>Priority: Major
>
> When the leader of zookeeper shutdown,the current method of spark is letting 
> the master exit to revoke the leadership.However,this sacrifice a master 
> node.According the treatment of hadoop and storm ,we should let the origin 
> active master to be standby ,or Re-election for spark master,or any other 
> ways to revoke leadership gracefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5594.
--
Resolution: Cannot Reproduce

Closing a very old issue here; reopen if it happens on master

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at scala.Option.foreach(Option.scala:236)
> at 
> 

[jira] [Resolved] (SPARK-23709) BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the sum of weights

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23709.
---
Resolution: Not A Problem

This is a question for the mailing list rather than JIRA at this stage.

> BaggedPoint.convertToBaggedRDDSamplingWithReplacement does not guarantee the 
> sum of weights
> ---
>
> Key: SPARK-23709
> URL: https://issues.apache.org/jira/browse/SPARK-23709
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Olivier Sannier
>Priority: Critical
>
> When using a bagging method like RandomForest, the theory dictates that the 
> source dataset is copied over with a subsample of rows.
> To avoid excessive memory usage, Spark uses the BaggedPoint concept where 
> each row is associated to a weight for the final dataset, ie for each tree 
> asked for the RandomForest.
> RandomForest requires that the dataset for each tree is a random draw with 
> replacement from the source data, that has the same size as the source data.
> However, during investigations, we found out that the count value used to 
> compute the variance is not always equal to the source data count, it is 
> sometimes less, sometimes more.
> I went digging in the source and found the 
> BaggedPoint.convertToBaggedRDDSamplingWithReplacement method which uses a 
> Poisson distribution to assign a weight to each row. And this distribution 
> does not guarantee that the total of weights for a given tree is equal to the 
> source dataset count.
> Looking around in here, it seems this is done for performance reasons because 
> the approximation it gives is good enough, especially when dealing with very 
> large datasets.
> However, I could not find any documentation that clearly explains this. Would 
> you have any link on the subject?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24985) Executing SQL with "Full Outer Join" on top of large tables when there is data skew met OOM

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24985:
--
Priority: Major  (was: Critical)

> Executing SQL with "Full Outer Join" on top of large tables when there is 
> data skew met OOM
> ---
>
> Key: SPARK-24985
> URL: https://issues.apache.org/jira/browse/SPARK-24985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: sheperd huang
>Priority: Major
>
> When we run SQL with "Full Outer Join" on large tables when there is data 
> skew, we found it's quite easy to hit OOM. We once thought we hit 
> https://issues.apache.org/jira/browse/SPARK-13450. But taking a look at fix 
> in [https://github.com/apache/spark/pull/16909,] we found that PR hasn't 
> handled the "Full Outer Join" case.
> The root cause of the OOM is there are a lot of rows with the same key.
> See below code:
> {code:java}
> private def findMatchingRows(matchingKey: InternalRow): Unit = {
>   leftMatches.clear()
>   rightMatches.clear()
>   leftIndex = 0
>   rightIndex = 0
>   while (leftRowKey != null && keyOrdering.compare(leftRowKey, matchingKey) 
> == 0){
>   leftMatches += leftRow.copy()
>   advancedLeft()
> }
>   while (rightRowKey != null && keyOrdering.compare(rightRowKey, matchingKey) 
> == 0) {
>  rightMatches += rightRow.copy()
>  advancedRight()
> }
> {code}
> It seems we haven't limited the data added to leftMatches and rightMatches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24906) Adaptively set split size for columnar file to ensure the task read data size fit expectation

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24906:
--
Priority: Major  (was: Critical)

> Adaptively set split size for columnar file to ensure the task read data size 
> fit expectation
> -
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25309) Sci-kit Learn like Auto Pipeline Parallelization in Spark

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25309:
--
Priority: Minor  (was: Critical)

> Sci-kit Learn like Auto Pipeline Parallelization in Spark 
> --
>
> Key: SPARK-25309
> URL: https://issues.apache.org/jira/browse/SPARK-25309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.1
>Reporter: Ravi
>Priority: Minor
>
> SPARK-19357 and SPARK-21911 have helped parallelize Pipelines in Spark. 
> However, instead of setting the parallelism Parameter in the CrossValidator 
> it would be good to have something like njobs=-1 (like Scikit Learn) where 
> the Pipeline DAG could be automatically parallelized and scheduled based on 
> the resources allocated to the Spark Session instead of having the user pick 
> the integer value for this parameter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25917) Spark UI's executors page loads forever when memoryMetrics in None. Fix is to JSON ignore memorymetrics when it is None.

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25917:
--
  Priority: Major  (was: Critical)
Issue Type: Improvement  (was: Bug)

> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> 
>
> Key: SPARK-25917
> URL: https://issues.apache.org/jira/browse/SPARK-25917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: Rong Tang
>Priority: Major
>
> Spark UI's executors page loads forever when memoryMetrics in None. Fix is to 
> JSON ignore memorymetrics when it is None.
> ## How was this patch tested?
> Before fix: (loads forever)
> ![image](https://user-images.githubusercontent.com/1785565/47875681-64dfe480-ddd4-11e8-8d15-5ed1457bc24f.png)
> After fix:
> ![image](https://user-images.githubusercontent.com/1785565/47875691-6b6e5c00-ddd4-11e8-9895-db8dd9730ee1.png)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25548) In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field with true in the And(partitionOps, nonPartitionOps) to make the partition can be pruned

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25548:
--
Priority: Major  (was: Critical)

> In the PruneFileSourcePartitions optimizer, replace the nonPartitionOps field 
> with true in the And(partitionOps, nonPartitionOps) to make the partition can 
> be pruned
> -
>
> Key: SPARK-25548
> URL: https://issues.apache.org/jira/browse/SPARK-25548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: eaton
>Assignee: Apache Spark
>Priority: Major
>
> In the PruneFileSourcePartitions optimizer, the partition files will not be 
> pruned if we use partition filter and non partition filter together, for 
> example:
> sql("CREATE TABLE IF NOT EXISTS src_par (key INT, value STRING) partitioned 
> by(p_d int) stored as parquet ")
>  sql("insert overwrite table src_par partition(p_d=2) select 2 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=3) select 3 as key, '4' as 
> value")
>  sql("insert overwrite table src_par partition(p_d=4) select 4 as key, '4' as 
> value")
> The sql below will scan all the partition files, in which, the partition 
> **p_d=4** should be pruned.
>  **sql("select * from src_par where (p_d=2 and key=2) or (p_d=3 and 
> key=3)").show**



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24703) Unable to multiply calendar interval

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24703:
--
  Priority: Minor  (was: Critical)
Issue Type: Improvement  (was: Bug)

Unless it works in Hive or is standard SQL, I don't think this can be 
considered a bug. SQL syntax varies across RDBMSes.

> Unable to multiply calendar interval
> 
>
> Key: SPARK-24703
> URL: https://issues.apache.org/jira/browse/SPARK-24703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Priyanka Garg
>Priority: Minor
>
> When i am trying to multiply calender interval with long/int , I am getting 
> below error. The same syntax is supported in Postgres.
> spark.sql("select  interval '1' day * 3").show()
> org.apache.spark.sql.AnalysisException: cannot resolve '(3 * interval 1 
> days)' due to data type mismatch: differing types in '(interval 1 days) * 3' 
> (int and calendarinterval).; line 1 pos 7;
> 'Project [unresolvedalias((interval 1 days * 3) , None)]
> +- OneRowRelation
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-14492.
-

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14492.
---
Resolution: Not A Problem

This should stay closed. The reports here are about building with varying 
versions of Hive, not vs Hive metastores. The former does not necessarily work.

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-11-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26034.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23056
[https://github.com/apache/spark/pull/23056]

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25934:
-

Assignee: Matt Molek

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Assignee: Matt Molek
>Priority: Major
> Fix For: 2.3.3, 2.4.1, 3.0.0
>
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25934) Mesos: SPARK_CONF_DIR should not be propogated by spark submit

2018-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25934.
---
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0
   2.3.3

Issue resolved by pull request 22937
[https://github.com/apache/spark/pull/22937]

> Mesos: SPARK_CONF_DIR should not be propogated by spark submit
> --
>
> Key: SPARK-25934
> URL: https://issues.apache.org/jira/browse/SPARK-25934
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.2
>Reporter: Matt Molek
>Assignee: Matt Molek
>Priority: Major
> Fix For: 2.3.3, 3.0.0, 2.4.1
>
>
> This is very similar to how SPARK_HOME caused problems for spark on Mesos in 
> SPARK-12345
> The `spark submit` command is setting spark.mesos.driverEnv.SPARK_CONF_DIR to 
> whatever the SPARK_CONF_DIR was for the command that submitted the job.
> This is doesn't make sense for most mesos situations, and it broke spark for 
> my team when we upgraded from 2.2.0 to 2.3.2. I haven't tested it but I think 
> 2.4.0 will have the same issue.
> It's preventing spark-env.sh from running because now SPARK_CONF_DIR points 
> to some non-existent directory, instead of the unpacked spark binary in the 
> Mesos sandbox like it should.
> I'm not that familiar with the spark code base, but I think this could be 
> fixed by simply adding a `&& k != "SPARK_CONF_DIR"` clause to this filter 
> statement: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala#L421



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2018-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689556#comment-16689556
 ] 

Imran Rashid commented on SPARK-4105:
-

fyi I opened SPARK-26089 for handling issues in large corrupt blocks, as we 
recently ran into that, and I dont' see any other issues for it.

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.2.0
>
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> 

[jira] [Commented] (SPARK-26089) Handle large corrupt shuffle blocks

2018-11-16 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689554#comment-16689554
 ] 

Imran Rashid commented on SPARK-26089:
--

cc [~joshrosen] [~zsxwing] as you looked at SPARK-4105

also [~tgraves] [~Dhruve Ashar] [~saisai_shao] as you've been looking at other 
aspects of dealing w/ hardware failures.

> Handle large corrupt shuffle blocks
> ---
>
> Key: SPARK-26089
> URL: https://issues.apache.org/jira/browse/SPARK-26089
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> We've seen a bad disk lead to corruption in a shuffle block, which lead to 
> tasks repeatedly failing after fetching the data with an IOException.  The 
> tasks get retried, but the same corrupt data gets fetched again, and the 
> tasks keep failing.  As there isn't a fetch-failure, the jobs eventually 
> fail, spark never tries to regenerate the shuffle data.
> This is the same as SPARK-4105, but that fix only covered small blocks.  
> There was some discussion during that change about this limitation 
> (https://github.com/apache/spark/pull/15923#discussion_r88756017) and 
> followups to cover larger blocks (which would involve spilling to disk to 
> avoid OOM), but it looks like that never happened.
> I can think of a few approaches to this:
> 1) wrap the shuffle block input stream with another input stream, that 
> converts all exceptions into FetchFailures.  This is similar to the fix of 
> SPARK-4105, but that reads the entire input stream up-front, and instead I'm 
> proposing to do it within the InputStream itself so its streaming and does 
> not have a large memory overhead.
> 2) Add checksums to shuffle blocks.  This was proposed 
> [here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
> complex.
> 3) Try to tackle this with blacklisting instead: when there is any failure in 
> a task that is reading shuffle data, assign some "blame" to the source of the 
> shuffle data, and eventually blacklist the source.  It seems really tricky to 
> get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26089) Handle large corrupt shuffle blocks

2018-11-16 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-26089:


 Summary: Handle large corrupt shuffle blocks
 Key: SPARK-26089
 URL: https://issues.apache.org/jira/browse/SPARK-26089
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Shuffle, Spark Core
Affects Versions: 2.4.0
Reporter: Imran Rashid


We've seen a bad disk lead to corruption in a shuffle block, which lead to 
tasks repeatedly failing after fetching the data with an IOException.  The 
tasks get retried, but the same corrupt data gets fetched again, and the tasks 
keep failing.  As there isn't a fetch-failure, the jobs eventually fail, spark 
never tries to regenerate the shuffle data.

This is the same as SPARK-4105, but that fix only covered small blocks.  There 
was some discussion during that change about this limitation 
(https://github.com/apache/spark/pull/15923#discussion_r88756017) and followups 
to cover larger blocks (which would involve spilling to disk to avoid OOM), but 
it looks like that never happened.

I can think of a few approaches to this:

1) wrap the shuffle block input stream with another input stream, that converts 
all exceptions into FetchFailures.  This is similar to the fix of SPARK-4105, 
but that reads the entire input stream up-front, and instead I'm proposing to 
do it within the InputStream itself so its streaming and does not have a large 
memory overhead.

2) Add checksums to shuffle blocks.  This was proposed 
[here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
complex.

3) Try to tackle this with blacklisting instead: when there is any failure in a 
task that is reading shuffle data, assign some "blame" to the source of the 
shuffle data, and eventually blacklist the source.  It seems really tricky to 
get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26088) DataSourceV2 should expose row count and attribute statistics

2018-11-16 Thread Martin Junghanns (JIRA)
Martin Junghanns created SPARK-26088:


 Summary: DataSourceV2 should expose row count and attribute 
statistics
 Key: SPARK-26088
 URL: https://issues.apache.org/jira/browse/SPARK-26088
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Martin Junghanns


During investigation of CBO and DataSourceV2 we found, that
{code}
org.apache.spark.sql.sources.v2.reader.Statistics
{code}
misses attribute/column statistics and that
{code}
DataSourceV2Relation#computeStats
{code}
wraps
{code}
org.apache.spark.sql.sources.v2.reader.Statistics
{code}
into
{code}
org.apache.spark.sql.catalyst.plans.logical.Statistics
{code}
without forwarding the optional {{rowCount}} if present.

However {{rowCount}} and {{attributeStats}} are used during CBO e.g. in 
{{JoinEstimation}} and {{AggregateEstimation}}.

We propose that:
* {{org.apache.spark.sql.sources.v2.reader.Statistics}} mirrors 
{{org.apache.spark.sql.catalyst.plans.logical.Statistics}} 
* {{DataSourceV2Relation}} forwards all the information to be available during 
CBO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25799) DataSourceApiV2 scan reuse does not respect options

2018-11-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Kießling resolved SPARK-25799.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> DataSourceApiV2 scan reuse does not respect options
> ---
>
> Key: SPARK-25799
> URL: https://issues.apache.org/jira/browse/SPARK-25799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Max Kießling
>Priority: Major
> Fix For: 2.4.0
>
>
> When creating a custom data source with the Data Source API V2 it seems that 
> the computation of possible scan reuses is broken when the same data source 
> is used but configured with different configuration options. In the case when 
> both scans produce the same schema (which is always the case for count 
> queries with column pruning enabled) the optimizer will reuse the scan 
> produced by on of the data source instance for both branches of the query. 
> This can lead to wrong results if the configuration option somehow influences 
> the returned data.
> The behavior can be reproduced with the following example:
> {code:scala}
> import org.apache.spark.sql.sources.v2.reader._
> import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
> ReadSupport}
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.{Row, SparkSession}
> import scala.tools.nsc.interpreter.JList
> class AdvancedDataSourceV2 extends DataSourceV2 with ReadSupport {
>   class Reader(rowCount: Int) extends DataSourceReader
> with SupportsPushDownRequiredColumns {
> var requiredSchema = new StructType().add("i", "int").add("j", "int")
> override def pruneColumns(requiredSchema: StructType): Unit = {
>   this.requiredSchema = requiredSchema
> }
> override def readSchema(): StructType = {
>   requiredSchema
> }
> override def createDataReaderFactories(): JList[DataReaderFactory[Row]] = 
> {
>   val res = new java.util.ArrayList[DataReaderFactory[Row]]
>   res.add(new AdvancedDataReaderFactory(0, 5, requiredSchema))
>   res.add(new AdvancedDataReaderFactory(5, rowCount, requiredSchema))
>   res
> }
>   }
>   override def createReader(options: DataSourceOptions): DataSourceReader =
> new Reader(options.get("rows").orElse("10").toInt)
> }
> class AdvancedDataReaderFactory(start: Int, end: Int, requiredSchema: 
> StructType)
>   extends DataReaderFactory[Row] with DataReader[Row] {
>   private var current = start - 1
>   override def createDataReader(): DataReader[Row] = {
> new AdvancedDataReaderFactory(start, end, requiredSchema)
>   }
>   override def close(): Unit = {}
>   override def next(): Boolean = {
> current += 1
> current < end
>   }
>   override def get(): Row = {
> val values = requiredSchema.map(_.name).map {
>   case "i" => current
>   case "j" => -current
> }
> Row.fromSeq(values)
>   }
> }
> object DataSourceTest extends App {
>   val spark = SparkSession.builder().master("local[*]").getOrCreate()
>   val cls = classOf[AdvancedDataSourceV2]
>   val with100 = spark.read.format(cls.getName).option("rows", 100).load()
>   val with10 = spark.read.format(cls.getName).option("rows", 10).load()
>   assert(with100.union(with10).count == 110)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25799) DataSourceApiV2 scan reuse does not respect options

2018-11-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689409#comment-16689409
 ] 

Max Kießling commented on SPARK-25799:
--

Apparently this issues is already resolved with the new 2.4.0 release. Thanks!

> DataSourceApiV2 scan reuse does not respect options
> ---
>
> Key: SPARK-25799
> URL: https://issues.apache.org/jira/browse/SPARK-25799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Max Kießling
>Priority: Major
>
> When creating a custom data source with the Data Source API V2 it seems that 
> the computation of possible scan reuses is broken when the same data source 
> is used but configured with different configuration options. In the case when 
> both scans produce the same schema (which is always the case for count 
> queries with column pruning enabled) the optimizer will reuse the scan 
> produced by on of the data source instance for both branches of the query. 
> This can lead to wrong results if the configuration option somehow influences 
> the returned data.
> The behavior can be reproduced with the following example:
> {code:scala}
> import org.apache.spark.sql.sources.v2.reader._
> import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, 
> ReadSupport}
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.{Row, SparkSession}
> import scala.tools.nsc.interpreter.JList
> class AdvancedDataSourceV2 extends DataSourceV2 with ReadSupport {
>   class Reader(rowCount: Int) extends DataSourceReader
> with SupportsPushDownRequiredColumns {
> var requiredSchema = new StructType().add("i", "int").add("j", "int")
> override def pruneColumns(requiredSchema: StructType): Unit = {
>   this.requiredSchema = requiredSchema
> }
> override def readSchema(): StructType = {
>   requiredSchema
> }
> override def createDataReaderFactories(): JList[DataReaderFactory[Row]] = 
> {
>   val res = new java.util.ArrayList[DataReaderFactory[Row]]
>   res.add(new AdvancedDataReaderFactory(0, 5, requiredSchema))
>   res.add(new AdvancedDataReaderFactory(5, rowCount, requiredSchema))
>   res
> }
>   }
>   override def createReader(options: DataSourceOptions): DataSourceReader =
> new Reader(options.get("rows").orElse("10").toInt)
> }
> class AdvancedDataReaderFactory(start: Int, end: Int, requiredSchema: 
> StructType)
>   extends DataReaderFactory[Row] with DataReader[Row] {
>   private var current = start - 1
>   override def createDataReader(): DataReader[Row] = {
> new AdvancedDataReaderFactory(start, end, requiredSchema)
>   }
>   override def close(): Unit = {}
>   override def next(): Boolean = {
> current += 1
> current < end
>   }
>   override def get(): Row = {
> val values = requiredSchema.map(_.name).map {
>   case "i" => current
>   case "j" => -current
> }
> Row.fromSeq(values)
>   }
> }
> object DataSourceTest extends App {
>   val spark = SparkSession.builder().master("local[*]").getOrCreate()
>   val cls = classOf[AdvancedDataSourceV2]
>   val with100 = spark.read.format(cls.getName).option("rows", 100).load()
>   val with10 = spark.read.format(cls.getName).option("rows", 10).load()
>   assert(with100.union(with10).count == 110)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689368#comment-16689368
 ] 

Apache Spark commented on SPARK-26078:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23057

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26078:


Assignee: Apache Spark

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689370#comment-16689370
 ] 

Apache Spark commented on SPARK-26078:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23057

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26078:


Assignee: (was: Apache Spark)

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26087) Spark on K8S jobs fail when returning large result sets in client mode under certain conditions

2018-11-16 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-26087:

Description: 
*Short description*: Jobs on Spark on k8s fail when certain conditions are met: 
when using client mode with driver on a host external to K8s cluster and the 
task result size is greater than the value "maxDirectResultSize" and the 
executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver 
three cases are posible: (1) failure if the result size is greater than 
MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the 
size is equal or less than "maxDirectResultSize", (3) pointers to the task 
results in the block manager are sent back to the driver in the case of result 
size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into 
the block manager rather than data), the driver needs to be able to access the 
block manager address from in the executor JVM. This currently fails in our 
configuration when using a Spark driver outside the K8S cluster, as the block 
manager addresses are private K8S pods addresses and thus not routable from the 
driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *spark.task.maxDirectResultSize* 
=, for example with  = 100 to make this problem appear 
with most tasks even returning small result size.

*Error* stack:

 
{code:java}
ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding 
blocksjava.io.IOException: Connecting to /10.100.112.3:35973 timed out (3 
ms)
{code}
Existing *workaround*:
 * Set *spark.task.maxDirectResultSize* to the same value as 
spark.driver.maxResultSize when running in the configuration affected by this 
bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor 
blockmanagers are registered with ips private to K8S, that is why the task 
assignment works however driver cannot connect to executor blockmanager to pull 
the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to 
connect to the Spark driver over a hostname and a port that is routable from 
the Spark executors”, however, this case is about the Spark driver needing to 
connect to the executors’ block managers.

  was:
*Short description*: Jobs on Spark on k8s fail when certain conditions are met: 
when using client mode with driver on a host external to K8s cluster and the 
task result size is greater than the value "maxDirectResultSize" and the 
executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver 
three cases are posible: (1) failure if the result size is greater than 
MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the 
size is equal or less than "maxDirectResultSize", (3) pointers to the task 
results in the block manager are sent back to the driver in the case of result 
size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into 
the block manager rather than data), the driver needs to be able to access the 
block manager address from in the executor JVM. This currently fails in our 
configuration when using a Spark driver outside the K8S cluster, as the block 
manager addresses are private K8S pods addresses and thus not routable from the 
driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *task.maxDirectResultSize* 
=, for example with  = 100 to make this problem appear 
with most tasks even returning small result size.

Existing *workaround*:
 * Set *task.maxDirectResultSize* to the same value as 
spark.driver.maxResultSize when running in the configuration affected by this 
bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor 
blockmanagers are registered with ips private to K8S, that is why the task 
assignment works however driver cannot connect to executor blockmanager to pull 
the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to 
connect to the Spark driver over a hostname and a port that is routable from 
the Spark executors”, however, this case is about the Spark driver needing to 

[jira] [Commented] (SPARK-26087) Spark on K8S jobs fail when returning large result sets in client mode under certain conditions

2018-11-16 Thread Piotr Mrowczynski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689358#comment-16689358
 ] 

Piotr Mrowczynski commented on SPARK-26087:
---

Corresponding error message is:
{code:java}
18/11/12 12:40:31 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Connecting to /10.100.112.3:35973 timed out (3 ms)
{code}

> Spark on K8S jobs fail when returning large result sets in client mode under 
> certain conditions
> ---
>
> Key: SPARK-26087
> URL: https://issues.apache.org/jira/browse/SPARK-26087
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> *Short description*: Jobs on Spark on k8s fail when certain conditions are 
> met: when using client mode with driver on a host external to K8s cluster and 
> the task result size is greater than the value "maxDirectResultSize" and the 
> executor blockmanagers in the k8s cluster are not routable from the driver.
> *Background:* when the result of a task has to be serialized back to the 
> driver three cases are posible: (1) failure if the result size is greater 
> than MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if 
> the size is equal or less than "maxDirectResultSize", (3) pointers to the 
> task results in the block manager are sent back to the driver in the case of 
> result size between "maxDirectResultSize" and MAX_RESULT_SIZE.
> For behavior 3 to successfully complete (i.e. the driver is sent pointers 
> into the block manager rather than data), the driver needs to be able to 
> access the block manager address from in the executor JVM. This currently 
> fails in our configuration when using a Spark driver outside the K8S cluster, 
> as the block manager addresses are private K8S pods addresses and thus not 
> routable from the driver, at least in our configuration, but we think others 
> may be affected.
> How to *reproduce* the issue:
>  * Use Spark on K8S with a driver located externally to the K8S cluster
>  * Start the Spark session using --conf *task.maxDirectResultSize* 
> =, for example with  = 100 to make this problem appear 
> with most tasks even returning small result size.
> Existing *workaround*:
>  * Set *task.maxDirectResultSize* to the same value as 
> spark.driver.maxResultSize when running in the configuration affected by this 
> bug (K8S with driver external to the cluster).
> Additional *notes*:
>  * The executors are registered with routable public ips but the executor 
> blockmanagers are registered with ips private to K8S, that is why the task 
> assignment works however driver cannot connect to executor blockmanager to 
> pull the final result (if result is bigger than maxDirectResultSize).
>  * Spark documentation correctly states "spark executors must be able to 
> connect to the Spark driver over a hostname and a port that is routable from 
> the Spark executors”, however, this case is about the Spark driver needing to 
> connect to the executors’ block managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26087) Spark on K8S jobs fail when returning large result sets in client mode under certain conditions

2018-11-16 Thread Piotr Mrowczynski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689358#comment-16689358
 ] 

Piotr Mrowczynski edited comment on SPARK-26087 at 11/16/18 12:30 PM:
--

Corresponding error message on the driver is:
{code:java}
18/11/12 12:40:31 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Connecting to /10.100.112.3:35973 timed out (3 ms)
{code}


was (Author: mrow4a):
Corresponding error message is:
{code:java}
18/11/12 12:40:31 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks
java.io.IOException: Connecting to /10.100.112.3:35973 timed out (3 ms)
{code}

> Spark on K8S jobs fail when returning large result sets in client mode under 
> certain conditions
> ---
>
> Key: SPARK-26087
> URL: https://issues.apache.org/jira/browse/SPARK-26087
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> *Short description*: Jobs on Spark on k8s fail when certain conditions are 
> met: when using client mode with driver on a host external to K8s cluster and 
> the task result size is greater than the value "maxDirectResultSize" and the 
> executor blockmanagers in the k8s cluster are not routable from the driver.
> *Background:* when the result of a task has to be serialized back to the 
> driver three cases are posible: (1) failure if the result size is greater 
> than MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if 
> the size is equal or less than "maxDirectResultSize", (3) pointers to the 
> task results in the block manager are sent back to the driver in the case of 
> result size between "maxDirectResultSize" and MAX_RESULT_SIZE.
> For behavior 3 to successfully complete (i.e. the driver is sent pointers 
> into the block manager rather than data), the driver needs to be able to 
> access the block manager address from in the executor JVM. This currently 
> fails in our configuration when using a Spark driver outside the K8S cluster, 
> as the block manager addresses are private K8S pods addresses and thus not 
> routable from the driver, at least in our configuration, but we think others 
> may be affected.
> How to *reproduce* the issue:
>  * Use Spark on K8S with a driver located externally to the K8S cluster
>  * Start the Spark session using --conf *task.maxDirectResultSize* 
> =, for example with  = 100 to make this problem appear 
> with most tasks even returning small result size.
> Existing *workaround*:
>  * Set *task.maxDirectResultSize* to the same value as 
> spark.driver.maxResultSize when running in the configuration affected by this 
> bug (K8S with driver external to the cluster).
> Additional *notes*:
>  * The executors are registered with routable public ips but the executor 
> blockmanagers are registered with ips private to K8S, that is why the task 
> assignment works however driver cannot connect to executor blockmanager to 
> pull the final result (if result is bigger than maxDirectResultSize).
>  * Spark documentation correctly states "spark executors must be able to 
> connect to the Spark driver over a hostname and a port that is routable from 
> the Spark executors”, however, this case is about the Spark driver needing to 
> connect to the executors’ block managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26087) Spark on K8S jobs fail when returning large result sets in client mode under certain conditions

2018-11-16 Thread Luca Canali (JIRA)
Luca Canali created SPARK-26087:
---

 Summary: Spark on K8S jobs fail when returning large result sets 
in client mode under certain conditions
 Key: SPARK-26087
 URL: https://issues.apache.org/jira/browse/SPARK-26087
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Luca Canali


*Short description*: Jobs on Spark on k8s fail when certain conditions are met: 
when using client mode with driver on a host external to K8s cluster and the 
task result size is greater than the value "maxDirectResultSize" and the 
executor blockmanagers in the k8s cluster are not routable from the driver.

*Background:* when the result of a task has to be serialized back to the driver 
three cases are posible: (1) failure if the result size is greater than 
MAX_RESULT_SIZE,  (2) the result is serialized directly to the driver if the 
size is equal or less than "maxDirectResultSize", (3) pointers to the task 
results in the block manager are sent back to the driver in the case of result 
size between "maxDirectResultSize" and MAX_RESULT_SIZE.

For behavior 3 to successfully complete (i.e. the driver is sent pointers into 
the block manager rather than data), the driver needs to be able to access the 
block manager address from in the executor JVM. This currently fails in our 
configuration when using a Spark driver outside the K8S cluster, as the block 
manager addresses are private K8S pods addresses and thus not routable from the 
driver, at least in our configuration, but we think others may be affected.

How to *reproduce* the issue:
 * Use Spark on K8S with a driver located externally to the K8S cluster
 * Start the Spark session using --conf *task.maxDirectResultSize* 
=, for example with  = 100 to make this problem appear 
with most tasks even returning small result size.

Existing *workaround*:
 * Set *task.maxDirectResultSize* to the same value as 
spark.driver.maxResultSize when running in the configuration affected by this 
bug (K8S with driver external to the cluster).

Additional *notes*:
 * The executors are registered with routable public ips but the executor 
blockmanagers are registered with ips private to K8S, that is why the task 
assignment works however driver cannot connect to executor blockmanager to pull 
the final result (if result is bigger than maxDirectResultSize).
 * Spark documentation correctly states "spark executors must be able to 
connect to the Spark driver over a hostname and a port that is routable from 
the Spark executors”, however, this case is about the Spark driver needing to 
connect to the executors’ block managers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-16 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689312#comment-16689312
 ] 

Herman van Hovell commented on SPARK-26084:
---

[~simeons] since you have already propose a solution, do you mind opening a PR.

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689268#comment-16689268
 ] 

Liang-Chi Hsieh commented on SPARK-26078:
-

I simply have a look for it, but don't have great fix yet. If [~mgaido] can 
come out a PR, I can help review it. Thanks.

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689262#comment-16689262
 ] 

Marco Gaido commented on SPARK-26078:
-

I'll investigate this immediately, thanks [~cloud_fan].

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency

2018-11-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689217#comment-16689217
 ] 

Marco Gaido commented on SPARK-26045:
-

[~o.garcia] can you please create a PR for this?

> Error in the spark 2.4 release package with the spark-avro_2.11 depdency
> 
>
> Key: SPARK-26045
> URL: https://issues.apache.org/jira/browse/SPARK-26045
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
> Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Oscar garcía 
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hello I have been problems with the last spark 2.4 release, the read avro 
> file feature does not seem to be working, I have fixed it in local building 
> the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ 
> dependencies.
> With the default spark 2.4 release when I try to read an avro file spark 
> raise the following exception.  
> {code:java}
> spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
> scala> spark.read.format("avro").load("file.avro")
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105
> {code}
> Checksum:  spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 
> 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 
> 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-16 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689210#comment-16689210
 ] 

Prashant Sharma commented on SPARK-26059:
-

Well to reproduce it, we need to submit a failing job to spark standalone.

e.g. spark://10.0.0.1:7077 is the master URL, then submit a sparkPi job as

bin/run-example --master spark://10.0.0.1:7077 SparkPi test

Since it expects a numeric argument for number of partitions, the job will 
fail. And on the Web UI or through REST query, we can see the status of 
completed Apps and it shows FINISHED.

I have not taken a screen shot, but it does not give out any more detail.

> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2018-11-16 Thread Neeraj Bhadani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689200#comment-16689200
 ] 

Neeraj Bhadani commented on SPARK-26075:


[~hyukjin.kwon] I haven't tried this with Spark 2.4. However, I have verified 
above with Spark 2.2 and Spark 2.3 as I mentioned earlier.

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2018-11-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689202#comment-16689202
 ] 

Hyukjin Kwon commented on SPARK-26075:
--

It should be helpful if that can be verified.

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >