[jira] [Resolved] (SPARK-26637) Makes GetArrayItem nullability more precise

2019-01-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26637.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23566
[https://github.com/apache/spark/pull/23566]

> Makes GetArrayItem nullability more precise
> ---
>
> Key: SPARK-26637
> URL: https://issues.apache.org/jira/browse/SPARK-26637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> In master, GetArrayItem nullable is always true;
> https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L236
> But, If input array size is constant and ordinal is foldable, we could make 
> GetArrayItem nullability more precise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26637) Makes GetArrayItem nullability more precise

2019-01-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26637:
---

Assignee: Takeshi Yamamuro

> Makes GetArrayItem nullability more precise
> ---
>
> Key: SPARK-26637
> URL: https://issues.apache.org/jira/browse/SPARK-26637
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> In master, GetArrayItem nullable is always true;
> https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L236
> But, If input array size is constant and ordinal is foldable, we could make 
> GetArrayItem nullability more precise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26699) Dataset column discrepancies between Parquet

2019-01-22 Thread Lakshmi Praveena (JIRA)
Lakshmi Praveena created SPARK-26699:


 Summary: Dataset column discrepancies between Parquet 
 Key: SPARK-26699
 URL: https://issues.apache.org/jira/browse/SPARK-26699
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.2
Reporter: Lakshmi Praveena


Hi,

 

When i run my job in Local mode with same parquet input files, the output is -

 

locations

[[[true, [[, phys...
[[[true, [[, phys...
[[[true, [[, phys...
 null
[[[true, [[, phys...
[[[true, [[, phys...
[[[true, [[, phys...
[[[true, [[, phys...
[[[true, [[, phys...
[[[true, [[, phys...

 

But when i run the same code base with same input parquet files in the YARN 
cluster mode, my output is as below -


 locations

[*WrappedArray*([tr...
[*WrappedArray*([tr...
[WrappedArray([tr...
 null
[WrappedArray([tr...
[WrappedArray([tr...
[WrappedArray([tr...
[WrappedArray([tr...
[WrappedArray([tr...
[WrappedArray([tr...

Its appending WrappedArray :(

I am using Apache Spark 2.3.2 version and the EMR Version while cluster is 
5.19.0. What could be the reason for discrepancies in the output of certain 
Table columns ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26678) Empty values end up as quoted empty strings in CSV files

2019-01-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749598#comment-16749598
 ] 

Hyukjin Kwon commented on SPARK-26678:
--

We should distinguish empty string and missing value. Use {{emptyValue}} option 
to distinguish.

> Empty values end up as quoted empty strings in CSV files
> 
>
> Key: SPARK-26678
> URL: https://issues.apache.org/jira/browse/SPARK-26678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Robert V
>Priority: Major
>  Labels: csv
>
> h1. Problem statement
> Empty string values were written to CSV as unquoted strings prior Spark 
> version 2.4.0.
> From version 2.4.0 empty string values end up as "" values in CSV files which 
> is a problem if an application was expected to not wrap empty values in 
> quotes (which is certainly the case if the CSV is intended to be used in 
> Microsoft PowerBI for example as it doesn't handle CSV files with double 
> quotes).
> The following code ends up with the following results in the different 
> versions of Spark:
>  
> ||Spark version||Code||Result||
> |2.3.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/23.csv")
> {code}|{noformat}
> name
> aa
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/24.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").option("quote", 
> "").csv("/24-2.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> If the intention was to produce standard-looking CSV files (even though CSV 
> standard doesn't exists) we still need a way to disable automatic quoting.
> Also, using
> {code:java}
> option("quote", "\u")
> {code}
> had no effect; double-quotes were used still.
> h1. Proposed solution
> Using the option
> {code:java}
> option("quote", "")
> {code}
> should disable quotes.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26678) Empty values end up as quoted empty strings in CSV files

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26678.
--
Resolution: Not A Problem

> Empty values end up as quoted empty strings in CSV files
> 
>
> Key: SPARK-26678
> URL: https://issues.apache.org/jira/browse/SPARK-26678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Robert V
>Priority: Major
>  Labels: csv
>
> h1. Problem statement
> Empty string values were written to CSV as unquoted strings prior Spark 
> version 2.4.0.
> From version 2.4.0 empty string values end up as "" values in CSV files which 
> is a problem if an application was expected to not wrap empty values in 
> quotes (which is certainly the case if the CSV is intended to be used in 
> Microsoft PowerBI for example as it doesn't handle CSV files with double 
> quotes).
> The following code ends up with the following results in the different 
> versions of Spark:
>  
> ||Spark version||Code||Result||
> |2.3.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/23.csv")
> {code}|{noformat}
> name
> aa
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").csv("/24.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> |2.4.0|{code:java}
> val df = List("aa", "", "bb").toDF("name")
> df.coalesce(1).write.option("header", "true").option("quote", 
> "").csv("/24-2.csv")
> {code}|{noformat}
> name
> aa
> ""
> bb
> {noformat}|
> If the intention was to produce standard-looking CSV files (even though CSV 
> standard doesn't exists) we still need a way to disable automatic quoting.
> Also, using
> {code:java}
> option("quote", "\u")
> {code}
> had no effect; double-quotes were used still.
> h1. Proposed solution
> Using the option
> {code:java}
> option("quote", "")
> {code}
> should disable quotes.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26699) Dataset column output discrepancies

2019-01-22 Thread Lakshmi Praveena (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lakshmi Praveena updated SPARK-26699:
-
Summary: Dataset column output discrepancies   (was: Dataset column 
discrepancies between Parquet )

> Dataset column output discrepancies 
> 
>
> Key: SPARK-26699
> URL: https://issues.apache.org/jira/browse/SPARK-26699
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Lakshmi Praveena
>Priority: Major
>
> Hi,
>  
> When i run my job in Local mode with same parquet input files, the output is -
>  
> locations
> 
> [[[true, [[, phys...
> [[[true, [[, phys...
> [[[true, [[, phys...
>  null
> [[[true, [[, phys...
> [[[true, [[, phys...
> [[[true, [[, phys...
> [[[true, [[, phys...
> [[[true, [[, phys...
> [[[true, [[, phys...
>  
> But when i run the same code base with same input parquet files in the YARN 
> cluster mode, my output is as below -
> 
>  locations
> 
> [*WrappedArray*([tr...
> [*WrappedArray*([tr...
> [WrappedArray([tr...
>  null
> [WrappedArray([tr...
> [WrappedArray([tr...
> [WrappedArray([tr...
> [WrappedArray([tr...
> [WrappedArray([tr...
> [WrappedArray([tr...
> Its appending WrappedArray :(
> I am using Apache Spark 2.3.2 version and the EMR Version while cluster is 
> 5.19.0. What could be the reason for discrepancies in the output of certain 
> Table columns ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26698) Use ConfigEntry for hardcoded configs for memory and storage categories

2019-01-22 Thread SongYadong (JIRA)
SongYadong created SPARK-26698:
--

 Summary: Use ConfigEntry for hardcoded configs for memory and 
storage categories
 Key: SPARK-26698
 URL: https://issues.apache.org/jira/browse/SPARK-26698
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: SongYadong


Some of following memory and storage related configs are still hardcoded, We 
try to use ConfigEntry and put them in the config package :
spark.storage.cleanupFilesAfterExecutorExit
spark.diskStore.subDirectories
spark.block.failures.beforeLocationRefresh
spark.storage.unrollMemoryThreshold
spark.storage.memoryMapThreshold
spark.memory.storageFraction
spark.memory.offHeap.enabled
spark.memory.offHeap.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26698) Use ConfigEntry for hardcoded configs for memory and storage categories

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26698:


Assignee: (was: Apache Spark)

> Use ConfigEntry for hardcoded configs for memory and storage categories
> ---
>
> Key: SPARK-26698
> URL: https://issues.apache.org/jira/browse/SPARK-26698
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some of following memory and storage related configs are still hardcoded, We 
> try to use ConfigEntry and put them in the config package :
> spark.storage.cleanupFilesAfterExecutorExit
> spark.diskStore.subDirectories
> spark.block.failures.beforeLocationRefresh
> spark.storage.unrollMemoryThreshold
> spark.storage.memoryMapThreshold
> spark.memory.storageFraction
> spark.memory.offHeap.enabled
> spark.memory.offHeap.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26698) Use ConfigEntry for hardcoded configs for memory and storage categories

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26698:


Assignee: Apache Spark

> Use ConfigEntry for hardcoded configs for memory and storage categories
> ---
>
> Key: SPARK-26698
> URL: https://issues.apache.org/jira/browse/SPARK-26698
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: SongYadong
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Some of following memory and storage related configs are still hardcoded, We 
> try to use ConfigEntry and put them in the config package :
> spark.storage.cleanupFilesAfterExecutorExit
> spark.diskStore.subDirectories
> spark.block.failures.beforeLocationRefresh
> spark.storage.unrollMemoryThreshold
> spark.storage.memoryMapThreshold
> spark.memory.storageFraction
> spark.memory.offHeap.enabled
> spark.memory.offHeap.size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19478) JDBC Sink

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19478.
--
Resolution: Not A Problem

Im resolving per 
https://github.com/apache/spark/pull/17190#issuecomment-456456303

> JDBC Sink
> -
>
> Key: SPARK-19478
> URL: https://issues.apache.org/jira/browse/SPARK-19478
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.0.0
>Reporter: Michael Armbrust
>Priority: Major
>
> A sink that transactionally commits data into a database use JDBC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26679) Deconflict spark.executor.pyspark.memory and spark.python.worker.memory

2019-01-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749557#comment-16749557
 ] 

Hyukjin Kwon commented on SPARK-26679:
--

[~rdblue], if there's no case we can currently come up with (I can't), let's 
combine both. We can think this is like JVM memory limit - it split when it 
fails to allocate more memory, and it goes OOM if it try to use more memory 
than the limit.

> Deconflict spark.executor.pyspark.memory and spark.python.worker.memory
> ---
>
> Key: SPARK-26679
> URL: https://issues.apache.org/jira/browse/SPARK-26679
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Priority: Major
>
> In 2.4.0, spark.executor.pyspark.memory was added to limit the total memory 
> space of a python worker. There is another RDD setting, 
> spark.python.worker.memory that controls when Spark decides to spill data to 
> disk. These are currently similar, but not related to one another.
> PySpark should probably use spark.executor.pyspark.memory to limit or default 
> the setting of spark.python.worker.memory because the latter property 
> controls spilling and should be lower than the total memory limit. Renaming 
> spark.python.worker.memory would also help clarity because it sounds like it 
> should control the limit, but is more like the JVM setting 
> spark.memory.fraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26696:


Assignee: (was: Apache Spark)

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: dataset, encoding
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26677:


Assignee: (was: Apache Spark)

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26677:


Assignee: Apache Spark

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749526#comment-16749526
 ] 

Hyukjin Kwon commented on SPARK-26677:
--

Im gonna open a PR soon.

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26697) ShuffleBlockFetcherIterator can log block sizes in addition to num blocks

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26697:


Assignee: Apache Spark

> ShuffleBlockFetcherIterator can log block sizes in addition to num blocks
> -
>
> Key: SPARK-26697
> URL: https://issues.apache.org/jira/browse/SPARK-26697
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> Every so often I find myself looking at executor logs, wondering why 
> something is going wrong (failed exec, or seems to be stuck etc) and I wish I 
> had a bit more info about shuffle sizes.  {{ShuffleBlockFetcherIterator}} 
> logs the number of local & remote blocks, but not their sizes.  It would be 
> really easy to add in size info too.
> eg. instead of 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local 
> blocks and 1 remote blocks
> {noformat}
> it should be 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 
> (97.0 B) local blocks and 1 (97.0 B) remote blocks
> {noformat}
> I know this is a really minor change, but I've wanted it multiple times, 
> seems worth it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26697) ShuffleBlockFetcherIterator can log block sizes in addition to num blocks

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26697:


Assignee: (was: Apache Spark)

> ShuffleBlockFetcherIterator can log block sizes in addition to num blocks
> -
>
> Key: SPARK-26697
> URL: https://issues.apache.org/jira/browse/SPARK-26697
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> Every so often I find myself looking at executor logs, wondering why 
> something is going wrong (failed exec, or seems to be stuck etc) and I wish I 
> had a bit more info about shuffle sizes.  {{ShuffleBlockFetcherIterator}} 
> logs the number of local & remote blocks, but not their sizes.  It would be 
> really easy to add in size info too.
> eg. instead of 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local 
> blocks and 1 remote blocks
> {noformat}
> it should be 
> {noformat}
> 19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
> ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 
> (97.0 B) local blocks and 1 (97.0 B) remote blocks
> {noformat}
> I know this is a really minor change, but I've wanted it multiple times, 
> seems worth it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26697) ShuffleBlockFetcherIterator can log block sizes in addition to num blocks

2019-01-22 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-26697:


 Summary: ShuffleBlockFetcherIterator can log block sizes in 
addition to num blocks
 Key: SPARK-26697
 URL: https://issues.apache.org/jira/browse/SPARK-26697
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 2.4.0
Reporter: Imran Rashid


Every so often I find myself looking at executor logs, wondering why something 
is going wrong (failed exec, or seems to be stuck etc) and I wish I had a bit 
more info about shuffle sizes.  {{ShuffleBlockFetcherIterator}} logs the number 
of local & remote blocks, but not their sizes.  It would be really easy to add 
in size info too.

eg. instead of 

{noformat}
19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 1 local 
blocks and 1 remote blocks
{noformat}

it should be 

{noformat}
19/01/22 23:10:49.978 Executor task launch worker for task 8 INFO 
ShuffleBlockFetcherIterator: Getting 2 (194.0 B) non-empty blocks including 1 
(97.0 B) local blocks and 1 (97.0 B) remote blocks
{noformat}

I know this is a really minor change, but I've wanted it multiple times, seems 
worth it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26677:
-
Priority: Blocker  (was: Major)

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26677:
-
Labels: correctness  (was: )

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Major
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749518#comment-16749518
 ] 

Simeon Simeonov commented on SPARK-26696:
-

[PR with improvement|https://github.com/apache/spark/pull/23620]

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: dataset, encoding
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26696:


Assignee: Apache Spark

> Dataset encoder should be publicly accessible
> -
>
> Key: SPARK-26696
> URL: https://issues.apache.org/jira/browse/SPARK-26696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Simeon Simeonov
>Assignee: Apache Spark
>Priority: Major
>  Labels: dataset, encoding
>
> As a platform, Spark should enable framework developers to accomplish outside 
> of the Spark codebase much of what can be accomplished inside the Spark 
> codebase. One of the obstacles to this is a historical pattern of excessive 
> data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
> issue is an example of this pattern when it comes to {{Dataset}}.
> Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
> Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
> {{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
> {{Encoder[A]}}. A naive approach would change the function signature to 
> `foo[A : Encoder]` but this is poor API design that requires unnecessarily 
> carrying of implicits from user code into framework code. We know 
> `Encoder[A]` exists because we have access to an instance of `Dataset[A]`... 
> but its `encoder` is not accessible.
> The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
> case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26696) Dataset encoder should be publicly accessible

2019-01-22 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-26696:
---

 Summary: Dataset encoder should be publicly accessible
 Key: SPARK-26696
 URL: https://issues.apache.org/jira/browse/SPARK-26696
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Simeon Simeonov


As a platform, Spark should enable framework developers to accomplish outside 
of the Spark codebase much of what can be accomplished inside the Spark 
codebase. One of the obstacles to this is a historical pattern of excessive 
data hiding in Spark, e.g., {{expr}} in {{Column}} not being accessible. This 
issue is an example of this pattern when it comes to {{Dataset}}.

Consider a transformation with the signature `def foo[A](ds: Dataset[A]): 
Dataset[A]`, which requires the use of {{toDF()}}. To get back to 
{{Dataset[A]}} would require calling {{.as[A]}}, which requires an implicit 
{{Encoder[A]}}. A naive approach would change the function signature to `foo[A 
: Encoder]` but this is poor API design that requires unnecessarily carrying of 
implicits from user code into framework code. We know `Encoder[A]` exists 
because we have access to an instance of `Dataset[A]`... but its `encoder` is 
not accessible.

The solution is simple: make {{encoder}} a {{@transient val}} just as is the 
case with {{queryExecution}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Chang Quanyou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749484#comment-16749484
 ] 

Chang Quanyou commented on SPARK-26668:
---

NO, I try it use 2.2.2 version; Not works;  I use spark on yarn; and CDH 
Version,I'm not sure If I use high version in client

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments:  batch_interval_10min_later.png,  
> batch_interval_stage.png, 10.4.41.64_shutdown.png, 10.4.42.64_start.png, 
> batch_interval_6s.png, batch_interval_6s_processing.png, executor.log, 
> kafka_consumer.log, shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26695) data source V2 API refactoring (continuous read)

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26695:


Assignee: Wenchen Fan  (was: Apache Spark)

> data source V2 API refactoring (continuous read)
> 
>
> Key: SPARK-26695
> URL: https://issues.apache.org/jira/browse/SPARK-26695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26695) data source V2 API refactoring (continuous read)

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26695:


Assignee: Apache Spark  (was: Wenchen Fan)

> data source V2 API refactoring (continuous read)
> 
>
> Key: SPARK-26695
> URL: https://issues.apache.org/jira/browse/SPARK-26695
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26695) data source V2 API refactoring (continuous read)

2019-01-22 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-26695:
---

 Summary: data source V2 API refactoring (continuous read)
 Key: SPARK-26695
 URL: https://issues.apache.org/jira/browse/SPARK-26695
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26677:
-
Component/s: (was: Spark Core)
 SQL

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Major
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26427) Upgrade Apache ORC to 1.5.4

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749400#comment-16749400
 ] 

Dongjoon Hyun commented on SPARK-26427:
---

It's only ORC dependency changes.
{code}
-orc-core-1.5.2-nohive.jar
-orc-mapreduce-1.5.2-nohive.jar
-orc-shims-1.5.2.jar
+orc-core-1.5.4-nohive.jar
+orc-mapreduce-1.5.4-nohive.jar
+orc-shims-1.5.4.jar
{code}

> Upgrade Apache ORC to 1.5.4
> ---
>
> Key: SPARK-26427
> URL: https://issues.apache.org/jira/browse/SPARK-26427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to update Apache ORC dependency to the latest version 1.5.4 
> released at Dec. 20. ([Release 
> Notes|https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320=12344187])
> {code}
> [ORC-237] - OrcFile.mergeFiles Specified block size is less than configured 
> minimum value
> [ORC-409] - Changes for extending MemoryManagerImpl
> [ORC-410] - Fix a locale-dependent test in TestCsvReader
> [ORC-416] - Avoid opening data reader when there is no stripe
> [ORC-417] - Use dynamic Apache Maven mirror link
> [ORC-419] - Ensure to call `close` at RecordReaderImpl constructor exception
> [ORC-432] - openjdk 8 has a bug that prevents surefire from working
> [ORC-435] - Ability to read stripes that are greater than 2GB
> [ORC-437] - Make acid schema checks case insensitive
> [ORC-411] - Update build to work with Java 10.
> [ORC-418] - Fix broken docker build script
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26677:
-
Priority: Major  (was: Critical)

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Major
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749396#comment-16749396
 ] 

Hyukjin Kwon commented on SPARK-26668:
--

[~quanyou.chang], are you able to test this in higher Spark version like 2.4?

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments:  batch_interval_10min_later.png,  
> batch_interval_stage.png, 10.4.41.64_shutdown.png, 10.4.42.64_start.png, 
> batch_interval_6s.png, batch_interval_6s_processing.png, executor.log, 
> kafka_consumer.log, shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26427) Upgrade Apache ORC to 1.5.4

2019-01-22 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749391#comment-16749391
 ] 

Wenchen Fan commented on SPARK-26427:
-

does it include other transitive dependences upgrade?

> Upgrade Apache ORC to 1.5.4
> ---
>
> Key: SPARK-26427
> URL: https://issues.apache.org/jira/browse/SPARK-26427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to update Apache ORC dependency to the latest version 1.5.4 
> released at Dec. 20. ([Release 
> Notes|https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320=12344187])
> {code}
> [ORC-237] - OrcFile.mergeFiles Specified block size is less than configured 
> minimum value
> [ORC-409] - Changes for extending MemoryManagerImpl
> [ORC-410] - Fix a locale-dependent test in TestCsvReader
> [ORC-416] - Avoid opening data reader when there is no stripe
> [ORC-417] - Use dynamic Apache Maven mirror link
> [ORC-419] - Ensure to call `close` at RecordReaderImpl constructor exception
> [ORC-432] - openjdk 8 has a bug that prevents surefire from working
> [ORC-435] - Ability to read stripes that are greater than 2GB
> [ORC-437] - Make acid schema checks case insensitive
> [ORC-411] - Update build to work with Java 10.
> [ORC-418] - Fix broken docker build script
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26228:
-

Assignee: Sean Owen

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Assignee: Sean Owen
>Priority: Major
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26228) OOM issue encountered when computing Gramian matrix

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26228.
---
   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.1
   3.0.0

Issue resolved by pull request 23600
[https://github.com/apache/spark/pull/23600]

> OOM issue encountered when computing Gramian matrix 
> 
>
> Key: SPARK-26228
> URL: https://issues.apache.org/jira/browse/SPARK-26228
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Chen Lin
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0, 2.4.1, 2.3.4
>
> Attachments: 1.jpeg
>
>
> {quote}/**
>  * Computes the Gramian matrix `A^T A`.
>   *
>  * @note This cannot be computed on matrices with more than 65535 columns.
>   */
> {quote}
> As the above annotation of computeGramianMatrix in RowMatrix.scala said, it 
> supports computing on matrices with no more than 65535 columns.
> However, we find that it will throw OOM(Request Array Size Exceeds VM Limit) 
> when computing on matrices with 16000 columns.
> The root casue seems that the TreeAggregate writes a  very long buffer array 
> (16000*16000*8) which exceeds jvm limit(2^31 - 1).
> Does RowMatrix really supports computing on matrices with no more than 65535 
> columns?
> I doubt that computeGramianMatrix has a very serious performance issue.
> Do anyone has done some performance expriments before?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26694) Console progress bar not showing in 3.0

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26694:


Assignee: (was: Apache Spark)

> Console progress bar not showing in 3.0
> ---
>
> Key: SPARK-26694
> URL: https://issues.apache.org/jira/browse/SPARK-26694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is the code that initializes it:
> {code}
> _progressBar =
>   if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
> Some(new ConsoleProgressBar(this))
>   } else {
> None
>   }
> {code}
> SPARK-25118 changed the way the log system is initialized, and the code above 
> is not initializing the progress bar anymore.
> [~ankur.gupta]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26694) Console progress bar not showing in 3.0

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26694:


Assignee: Apache Spark

> Console progress bar not showing in 3.0
> ---
>
> Key: SPARK-26694
> URL: https://issues.apache.org/jira/browse/SPARK-26694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Critical
>
> This is the code that initializes it:
> {code}
> _progressBar =
>   if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
> Some(new ConsoleProgressBar(this))
>   } else {
> None
>   }
> {code}
> SPARK-25118 changed the way the log system is initialized, and the code above 
> is not initializing the progress bar anymore.
> [~ankur.gupta]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26605) New executors failing with expired tokens in client mode

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26605:
--

Assignee: Marcelo Vanzin

> New executors failing with expired tokens in client mode
> 
>
> Key: SPARK-26605
> URL: https://issues.apache.org/jira/browse/SPARK-26605
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> We ran into an issue with new executors being started with expired tokens in 
> client mode; cluster mode is fine. Master branch is also not affected.
> This means that executors that start after 7 days would fail in this scenario.
> Patch coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26605) New executors failing with expired tokens in client mode

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26605.

   Resolution: Fixed
Fix Version/s: 2.4.1

Issue resolved by pull request 23523
[https://github.com/apache/spark/pull/23523]

> New executors failing with expired tokens in client mode
> 
>
> Key: SPARK-26605
> URL: https://issues.apache.org/jira/browse/SPARK-26605
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.1
>
>
> We ran into an issue with new executors being started with expired tokens in 
> client mode; cluster mode is fine. Master branch is also not affected.
> This means that executors that start after 7 days would fail in this scenario.
> Patch coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24484) Power Iteration Clustering is giving incorrect clustering results when there are mutiple leading eigen values.

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24484.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21627
[https://github.com/apache/spark/pull/21627]

> Power Iteration Clustering is giving incorrect clustering results when there 
> are mutiple leading eigen values.
> --
>
> Key: SPARK-24484
> URL: https://issues.apache.org/jira/browse/SPARK-24484
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
> Fix For: 3.0.0
>
>
> When there are multiple leading eigen values of the normalized affinity 
> matrix, power iteration clustering gives incorrect results.
> We should either give an error or warning to the user when PIC doesn't 
> converges ( ie. 
> when |\lambda_1/\lambda_2| = 1 )
> {code:java}
> test("Fail to converge: Multiple leading eigen values") {
> /*
>  Graph:
>  2
>/
>  /
> 13 - - 4
> Adjacency matrix:
>   [(0, 1, 0, 0),
>   (1, 0, 0, 0),
>  A =  (0, 0, 0, 1),
>(0, 0, 1, 0)]
> */
> val data = Seq[(Long, Long, Double)](
>   (1, 2, 1.0),
>   (3, 4, 1.0)
> ).toDF("src", "dst", "weight")
> val result = new PowerIterationClustering()
>   .setK(2)
>   .setMaxIter(20)
>   .setInitMode("random")
>   .setWeightCol("weight")
>   .assignClusters(data)
>   .select('id, 'cluster)
> val predictions = Array.fill(2)(mutable.Set.empty[Long])
> result.collect().foreach {
>   case Row(id: Long, cluster: Integer) => predictions(cluster) += id
> }
> assert(predictions.toSet == Set(Array(1, 2).toSet, Array(3, 4).toSet))
>   }
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24484) Power Iteration Clustering is giving incorrect clustering results when there are mutiple leading eigen values.

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24484:
-

Assignee: shahid

> Power Iteration Clustering is giving incorrect clustering results when there 
> are mutiple leading eigen values.
> --
>
> Key: SPARK-24484
> URL: https://issues.apache.org/jira/browse/SPARK-24484
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Major
>
> When there are multiple leading eigen values of the normalized affinity 
> matrix, power iteration clustering gives incorrect results.
> We should either give an error or warning to the user when PIC doesn't 
> converges ( ie. 
> when |\lambda_1/\lambda_2| = 1 )
> {code:java}
> test("Fail to converge: Multiple leading eigen values") {
> /*
>  Graph:
>  2
>/
>  /
> 13 - - 4
> Adjacency matrix:
>   [(0, 1, 0, 0),
>   (1, 0, 0, 0),
>  A =  (0, 0, 0, 1),
>(0, 0, 1, 0)]
> */
> val data = Seq[(Long, Long, Double)](
>   (1, 2, 1.0),
>   (3, 4, 1.0)
> ).toDF("src", "dst", "weight")
> val result = new PowerIterationClustering()
>   .setK(2)
>   .setMaxIter(20)
>   .setInitMode("random")
>   .setWeightCol("weight")
>   .assignClusters(data)
>   .select('id, 'cluster)
> val predictions = Array.fill(2)(mutable.Set.empty[Long])
> result.collect().foreach {
>   case Row(id: Long, cluster: Integer) => predictions(cluster) += id
> }
> assert(predictions.toSet == Set(Array(1, 2).toSet, Array(3, 4).toSet))
>   }
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26694) Console progress bar not showing in 3.0

2019-01-22 Thread Ankur Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749280#comment-16749280
 ] 

Ankur Gupta commented on SPARK-26694:
-

I am working on this

> Console progress bar not showing in 3.0
> ---
>
> Key: SPARK-26694
> URL: https://issues.apache.org/jira/browse/SPARK-26694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Critical
>
> This is the code that initializes it:
> {code}
> _progressBar =
>   if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
> Some(new ConsoleProgressBar(this))
>   } else {
> None
>   }
> {code}
> SPARK-25118 changed the way the log system is initialized, and the code above 
> is not initializing the progress bar anymore.
> [~ankur.gupta]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21708) use sbt 1.x

2019-01-22 Thread PJ Fanning (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PJ Fanning updated SPARK-21708:
---
Summary: use sbt 1.x  (was: use sbt 1.0.0)

> use sbt 1.x
> ---
>
> Key: SPARK-21708
> URL: https://issues.apache.org/jira/browse/SPARK-21708
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
>
> Should improve sbt build times.
> http://www.scala-sbt.org/1.0/docs/sbt-1.0-Release-Notes.html
> According to https://github.com/sbt/sbt/issues/3424, we will need to change 
> the HTTP location where we get the sbt-launch jar.
> Other related issues:
> SPARK-14401
> https://github.com/typesafehub/sbteclipse/issues/343
> https://github.com/jrudolph/sbt-dependency-graph/issues/134
> https://github.com/AlpineNow/junit_xml_listener/issues/6
> https://github.com/spray/sbt-revolver/issues/62
> https://github.com/ihji/sbt-antlr4/issues/14



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26427) Upgrade Apache ORC to 1.5.4

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749223#comment-16749223
 ] 

Dongjoon Hyun commented on SPARK-26427:
---

Hi, [~smilegator] and [~cloud_fan].
In general, we don't change the dependency at the maintenance release.
But, for this case, I'm wondering if we can have this to reduce the resource 
leakage in `branch-2.4` for next Spark 2.4.1.
Can I proceed to make a backport PR?

> Upgrade Apache ORC to 1.5.4
> ---
>
> Key: SPARK-26427
> URL: https://issues.apache.org/jira/browse/SPARK-26427
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to update Apache ORC dependency to the latest version 1.5.4 
> released at Dec. 20. ([Release 
> Notes|https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320=12344187])
> {code}
> [ORC-237] - OrcFile.mergeFiles Specified block size is less than configured 
> minimum value
> [ORC-409] - Changes for extending MemoryManagerImpl
> [ORC-410] - Fix a locale-dependent test in TestCsvReader
> [ORC-416] - Avoid opening data reader when there is no stripe
> [ORC-417] - Use dynamic Apache Maven mirror link
> [ORC-419] - Ensure to call `close` at RecordReaderImpl constructor exception
> [ORC-432] - openjdk 8 has a bug that prevents surefire from working
> [ORC-435] - Ability to read stripes that are greater than 2GB
> [ORC-437] - Make acid schema checks case insensitive
> [ORC-411] - Update build to work with Java 10.
> [ORC-418] - Fix broken docker build script
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23505) Flaky test: ParquetQuerySuite

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749220#comment-16749220
 ] 

Dongjoon Hyun edited comment on SPARK-23505 at 1/22/19 10:31 PM:
-

I've been monitoring the Jenkins. At least, for `branch-2.4` and `master`, this 
seems to be resolved. I don't see the `ParquetQuerySuite` failure in Jenkins. 
I'll monitor more.


was (Author: dongjoon):
I've monitoring the Jenkins. At least, for `branch-2.4` and `master`, this 
seems to be resolved. I don't see the failure in Jenkins.

> Flaky test: ParquetQuerySuite
> -
>
> Key: SPARK-23505
> URL: https://issues.apache.org/jira/browse/SPARK-23505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Seen on an unrelated PR;
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87635/testReport/org.apache.spark.sql.execution.datasources.parquet/ParquetQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01253324699 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01253324699 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.afterEach(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.runTest(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> 

[jira] [Commented] (SPARK-23505) Flaky test: ParquetQuerySuite

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749220#comment-16749220
 ] 

Dongjoon Hyun commented on SPARK-23505:
---

I've monitoring the Jenkins. At least, for `branch-2.4` and `master`, this 
seems to be resolved. I don't see the failure in Jenkins.

> Flaky test: ParquetQuerySuite
> -
>
> Key: SPARK-23505
> URL: https://issues.apache.org/jira/browse/SPARK-23505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Seen on an unrelated PR;
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87635/testReport/org.apache.spark.sql.execution.datasources.parquet/ParquetQuerySuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01253324699 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01253324699 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.eventually(ParquetQuerySuite.scala:41)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.afterEach(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.runTest(ParquetQuerySuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> 

[jira] [Created] (SPARK-26694) Console progress bar not showing in 3.0

2019-01-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26694:
--

 Summary: Console progress bar not showing in 3.0
 Key: SPARK-26694
 URL: https://issues.apache.org/jira/browse/SPARK-26694
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


This is the code that initializes it:

{code}
_progressBar =
  if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
  } else {
None
  }
{code}

SPARK-25118 changed the way the log system is initialized, and the code above 
is not initializing the progress bar anymore.

[~ankur.gupta]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26661) Show actual class name of the writing command in CTAS explain

2019-01-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26661.
---
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23582

> Show actual class name of the writing command in CTAS explain
> -
>
> Key: SPARK-26661
> URL: https://issues.apache.org/jira/browse/SPARK-26661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Trivial
> Fix For: 3.0.0
>
>
> The explain output of the Hive CTAS command, regardless of whether it's 
> actually writing via Hive's SerDe or converted into using Spark's data 
> source, would always show that it's using {{InsertIntoHiveTable}} because 
> it's hardcoded.
> e.g.
> {code:none}
> Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: 
> foo, InsertIntoHiveTable]
> {code}
> This CTAS is converted into using Spark's data source, but it still says 
> {{InsertIntoHiveTable}} in the explain output.
> It's better to show the actual class name of the writing command used. For 
> the example above, it'd be:
> {code:none}
> Execute OptimizedCreateHiveTableAsSelectCommand [Database:default, TableName: 
> foo, InsertIntoHadoopFsRelationCommand]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26693) Large Numbers Truncated

2019-01-22 Thread Jason Blahovec (JIRA)
Jason Blahovec created SPARK-26693:
--

 Summary: Large Numbers Truncated 
 Key: SPARK-26693
 URL: https://issues.apache.org/jira/browse/SPARK-26693
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Code was run in Zeppelin using Spark 2.4.
Reporter: Jason Blahovec


We have a process that takes a file dumped from an external API and formats it 
for use in other processes.  These API dumps are brought into Spark with all 
fields read in as strings.  One of the fields is a 19 digit visitor ID.  Since 
implementing Spark 2.4 a few weeks ago, we have noticed that dataframes read 
the 19 digits correctly but any function in SQL appears to truncate the last 
two digits and replace them with "00".  

Our process is set up to convert these numbers to bigint, which worked before 
Spark 2.4.  We looked into data types, and the possibility of changing to a 
"long" type with no luck.  At that point we tried bringing in the string value 
as is, with the same result.  I've added code that should replicate the issue 
with a few 19 digit test cases and demonstrating the type conversions I tried.

Results for the code below are shown here:

dfTestExpanded.show:

+---+---+---+ | idAsString| 
idAsBigint| idAsLong| 
+---+---+---+ 
|4065453307562594031|4065453307562594031|4065453307562594031| 
|765995720523059|765995720523059|765995720523059| 
|1614560078712787995|1614560078712787995|1614560078712787995| 
+---+---+---+

Run this query in a paragraph:

%sql

select * from global_temp.testTable

and see these results (all 3 columns):

4065453307562594000

765995720523000

1614560078712788000

 

Another notable observation was that this issue soes not appear to affect joins 
on the affected fields - we are seeing issues when the fields are used in where 
clauses or as part of a select list.

 

 
{code:java}

// code placeholder


%pyspark

from pyspark.sql.functions import *


sfTestValue = StructField("testValue",StringType(), True)
schemaTest = StructType([sfTestValue])

listTestValues = []
listTestValues.append(("4065453307562594031",))
listTestValues.append(("765995720523059",))
listTestValues.append(("1614560078712787995",))

dfTest = spark.createDataFrame(listTestValues, schemaTest)

dfTestExpanded = dfTest.selectExpr(\
"testValue as idAsString",\
"cast(testValue as bigint) as idAsBigint",\
"cast(testValue as long) as idAsLong")

dfTestExpanded.show() ##This will show three columns of data correctly.

dfTestExpanded.createOrReplaceGlobalTempView('testTable') ##When this table is 
viewed in a %sql paragraph, the truncated values are shown.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749156#comment-16749156
 ] 

shane knapp commented on SPARK-26608:
-

no, it's for the jenkins job builder configs that are currently held in a 
databricks repo...  these will soon be ported over to the main spark repo.

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749151#comment-16749151
 ] 

Dongjoon Hyun commented on SPARK-26608:
---

Thanks! Is the pending spark config PR for AmpLab Git repo instead of Apache 
Spark repo?

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749137#comment-16749137
 ] 

shane knapp edited comment on SPARK-26608 at 1/22/19 8:48 PM:
--

they're all disabled, and i'm waiting on the spark config PR to be merged 
before i resolve this.

 !Screen Shot 2019-01-22 at 12.47.43 PM.png! 


was (Author: shaneknapp):
they're all disabled, and i'm waiting on the spark config PR to be merged 
before i resolve this.

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749137#comment-16749137
 ] 

shane knapp commented on SPARK-26608:
-

they're all disabled, and i'm waiting on the spark config PR to be merged 
before i resolve this.

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png, 
> Screen Shot 2019-01-22 at 12.47.43 PM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749122#comment-16749122
 ] 

Dongjoon Hyun commented on SPARK-26608:
---

Oh, sorry. I forgot to reply here. Yes, right!

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26608) Remove Jenkins jobs for `branch-2.2`

2019-01-22 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749100#comment-16749100
 ] 

shane knapp commented on SPARK-26608:
-

ping [~dongjoon]

i'm assuming we'll want to remove ALL 2.2 jobs, yes?

> Remove Jenkins jobs for `branch-2.2`
> 
>
> Key: SPARK-26608
> URL: https://issues.apache.org/jira/browse/SPARK-26608
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.2.3
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
> Attachments: Screen Shot 2019-01-11 at 8.47.27 PM.png, Screen Shot 
> 2019-01-14 at 11.39.05 AM.png, Screen Shot 2019-01-14 at 11.42.46 AM.png
>
>
> This issue aims to remove the following Jenkins jobs for `branch-2.2` because 
> of EOL.
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.6/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/]
> As of today, the branch is healthy.
> !Screen Shot 2019-01-11 at 8.47.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26685) Building Spark Images with latest Docker does not honour spark_uid build argument

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26685.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23611
[https://github.com/apache/spark/pull/23611]

> Building Spark Images with latest Docker does not honour spark_uid build 
> argument
> -
>
> Key: SPARK-26685
> URL: https://issues.apache.org/jira/browse/SPARK-26685
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> Latest Docker releases are stricter in their interpretation of the scope of 
> build arguments meaning the location of the {{ARG spark_uid}} declaration 
> puts it out of scope by the time the variable is consumed resulting in the 
> Python and R images still running as {{root}} regardless of what the user may 
> have specified as the desired UID.
> e.g. Images built with {{-u 456}} provided to {{bin/docker-image-tool.sh}}
> {noformat}
> > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
> bash-4.4# whoami
> root
> bash-4.4# id -u
> 0
> bash-4.4# exit
> > docker run -it --entrypoint /bin/bash rvesse/spark:uid456
> bash-4.4$ id -u
> 456
> bash-4.4$ exit
> {noformat}
> Note that for the Python image the build argument was out of scope and 
> ignored.  For the base image the {{ARG}} declaration is in an in-scope 
> location and so is honoured correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26685) Building Spark Images with latest Docker does not honour spark_uid build argument

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26685:
--

Assignee: Rob Vesse

> Building Spark Images with latest Docker does not honour spark_uid build 
> argument
> -
>
> Key: SPARK-26685
> URL: https://issues.apache.org/jira/browse/SPARK-26685
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
>
> Latest Docker releases are stricter in their interpretation of the scope of 
> build arguments meaning the location of the {{ARG spark_uid}} declaration 
> puts it out of scope by the time the variable is consumed resulting in the 
> Python and R images still running as {{root}} regardless of what the user may 
> have specified as the desired UID.
> e.g. Images built with {{-u 456}} provided to {{bin/docker-image-tool.sh}}
> {noformat}
> > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
> bash-4.4# whoami
> root
> bash-4.4# id -u
> 0
> bash-4.4# exit
> > docker run -it --entrypoint /bin/bash rvesse/spark:uid456
> bash-4.4$ id -u
> 456
> bash-4.4$ exit
> {noformat}
> Note that for the Python image the build argument was out of scope and 
> ignored.  For the base image the {{ARG}} declaration is in an in-scope 
> location and so is honoured correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25887) Allow specifying Kubernetes context to use

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25887.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22904
[https://github.com/apache/spark/pull/22904]

> Allow specifying Kubernetes context to use
> --
>
> Key: SPARK-25887
> URL: https://issues.apache.org/jira/browse/SPARK-25887
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> In working on SPARK-25809 support was added to the integration testing 
> machinery for Spark on K8S to use an arbitrary context from the users K8S 
> config file.  However this can fail/cause false positives because regardless 
> of what the integration test harness does the K8S submission client uses the 
> Fabric 8 client library in such a way that it only ever configures itself 
> from the current context.
> For users who work with multiple K8S clusters or who have multiple K8S 
> "users" for interacting with their cluster being able to support arbitrary 
> contexts without forcing the user to first {{kubectl config use-context 
> }} is an important improvement.
> This would be a fairly small fix to {{SparkKubernetesClientFactory}} and an 
> associated configuration key, likely {{spark.kubernetes.context}} to go along 
> with this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25887) Allow specifying Kubernetes context to use

2019-01-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25887:
--

Assignee: Rob Vesse

> Allow specifying Kubernetes context to use
> --
>
> Key: SPARK-25887
> URL: https://issues.apache.org/jira/browse/SPARK-25887
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
>
> In working on SPARK-25809 support was added to the integration testing 
> machinery for Spark on K8S to use an arbitrary context from the users K8S 
> config file.  However this can fail/cause false positives because regardless 
> of what the integration test harness does the K8S submission client uses the 
> Fabric 8 client library in such a way that it only ever configures itself 
> from the current context.
> For users who work with multiple K8S clusters or who have multiple K8S 
> "users" for interacting with their cluster being able to support arbitrary 
> contexts without forcing the user to first {{kubectl config use-context 
> }} is an important improvement.
> This would be a fairly small fix to {{SparkKubernetesClientFactory}} and an 
> associated configuration key, likely {{spark.kubernetes.context}} to go along 
> with this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:04 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. Will do 
further analysis 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:03 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Chang Quanyou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Quanyou updated SPARK-26668:
--
Attachment: batch_interval_6s.png
batch_interval_6s_processing.png
 batch_interval_stage.png
 batch_interval_10min_later.png

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments:  batch_interval_10min_later.png,  
> batch_interval_stage.png, 10.4.41.64_shutdown.png, 10.4.42.64_start.png, 
> batch_interval_6s.png, batch_interval_6s_processing.png, executor.log, 
> kafka_consumer.log, shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Chang Quanyou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748954#comment-16748954
 ] 

Chang Quanyou commented on SPARK-26668:
---

Thank you for replying, here are details:

1: It is DStream, spark streaming version is 2.2

2: Kafka version: spark-streaming-kafka-0-10_2.11; the inner Kafka version is 
0.10.1

the detail step(oder by it, you can recurrent easily ) :

1: At least of 2 brokers, and restart all

2: Create one Topic

3: And prepare the code to consume the topic;of course the topic has data

4: select one broker,  bash kafka-server-stop.sh; and see the UI; It's 
normal,no Influence;

5:  and use the command   `/kafka-server-start.sh ../config/server.properties` 
,you know not use 【-daemon】;five seconds later; use 【ctrl + C】;and  exec 
jps;you also can see the kafka process(died process); Then,the 

Magic Things is that schedule time and processing time increased;This step just 
similary to shutdown the broker,but this way is easy to test

okay ; That's my test step;later I will show the logs:

1: the executor log: please see the attachment of executor.log; two brokers 
address: [10.4.42.65:6667, 10.4.42.64:6667]; and shutdown the 10.4.42.64(just 
like the step 5)

2: I use the original Kafka consumer API; I see the log; It will update cluster 
metadata, and remove 10.4.42.64,you can see the attachment of kafka_consumer.log

3:And I also upload the wireshark results;when the  machine is down or restart ;


okay ,that's the detail; then I will give my solution

It's right, I just want to give up;I try many ways to solve it,But I can't; 
Finally ;I analyze the stage; and see the event Time; I found only one or two 
task used much time (3s~4s);and the batch interval is 3s; so increase  the  
batch interval to 6s, I found  the it's okay ;but also have problems;  you can 
see the list attachments:

   
batch_interval_stage.png: the event time case
batch_interval_6s.png _interval_6s_processing.png: add the batch interval 
time to 6s
batch_interval_10min_later.png: 10 minutes later,when increase the interval 
time;
 
 
 
 

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments:  batch_interval_10min_later.png,  
> batch_interval_stage.png, 10.4.41.64_shutdown.png, 10.4.42.64_start.png, 
> batch_interval_6s.png, batch_interval_6s_processing.png, executor.log, 
> kafka_consumer.log, shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26692) Structured Streaming: Aggregation + JOIN not working

2019-01-22 Thread Theo Diefenthal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theo Diefenthal resolved SPARK-26692.
-
Resolution: Invalid

Just read the crucial part of the doc again
{code:java}
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#support-matrix-for-joins-in-streaming-querie{code}
{code:java}
As of Spark 2.3, you cannot use other non-map-like operations before joins. 
Here are a few examples of what cannot be used.

- Cannot use streaming aggregations before joins.

- Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode 
before joins.{code}
It would still be nice if Spark would raise an exception instead of just 
delivering empty results

> Structured Streaming: Aggregation + JOIN not working
> 
>
> Key: SPARK-26692
> URL: https://issues.apache.org/jira/browse/SPARK-26692
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Theo Diefenthal
>Priority: Major
>
> I tried to setup a simple streaming pipeline with two streams on two data 
> sources (CSV files) where one stream is fist windowed (aggregated) and then 
> the streams are joined. As output, I chose console for development in Append 
> Mode.
> After multiple hours of setup and testing, I still wasn't able to get a 
> working example running. I also found a StackOverflow topic here 
> [https://stackoverflow.com/questions/52300247/spark-scala-structured-streaming-aggregation-and-self-join]
>  where they have the same findings as I had: "In general, {{append}} output 
> mode with aggregations is not a recommended way. As far as I understand". And 
> "I had the same empty output in my case with different aggregation. Once I 
> changed it to {{update}} mode I got output. I have strong doubts that you 
> will be able to do it in {{append}} mode. My understanding is that {{append}} 
> mode only for {{map-like}} operations, e.g. filter, transform entry etc. I 
> believe that multiphase processing". 
> I observed the same and only got empty output in my example:
> {code:java}
> public static SparkSession buildSession() throws Exception {
> return SparkSession.builder()
> .appName("StreamGroupJoin")
> .config("spark.sql.shuffle.partitions", 4)
> .master("local[2]")
> .getOrCreate();
> }
> public static Dataset loadData(SparkSession session, String filepath) {
> return session
> .readStream()
> .format("csv")
> .option("header", true)
> .option("path", filepath)
> .schema(new StructType().add("ts", 
> DataTypes.TimestampType).add("color", DataTypes.StringType).add("data", 
> DataTypes.StringType))
> .load();
> }
> public static void main(String[] args) throws Exception {
> SparkSession session = buildSession();
> Dataset shieldStream = loadData(session, 
> "streamingpoc/src/main/resources/simpleSHIELD");
> Dataset argusStream = loadData(session, 
> "streamingpoc/src/main/resources/simpleARGUS");
> shieldStream = shieldStream.withWatermark("ts", "0 hours");
> argusStream = argusStream.withWatermark("ts", "0 hours");
> argusStream = argusStream.groupBy(window(col("ts"), "24 hours"), 
> col("color")).count();
> argusStream = argusStream.select(col("window.start").as("argusStart"), 
> col("window.end").as("argusEnd"), col("color").as("argusColor"), 
> col("count").as("argusCount"));
> //argusStream = argusStream.withWatermark("argusStart", "0 hours");
> Dataset joinedStream = argusStream.join(shieldStream, expr("color = 
> argusColor AND ts >= argusStart AND ts <= argusEnd"));
> joinedStream = joinedStream.withWatermark("ts", "0 hours");
> StreamingQuery joinedQuery = joinedStream.writeStream()
> .outputMode(OutputMode.Append())
> .format("console")
> .start();
> joinedQuery.awaitTermination();
> System.out.println("DONE");
> }{code}
> I'd like to address that at least in my testing version of Spark 2.4.0, it is 
> not even possible to switch to OutputMode.Update due to "_Inner join between 
> two streaming DataFrames/Datasets is not supported in Complete output mode, 
> only in Append output mode_"
> In my example, I used two simple CSV datasets having the same format and one 
> matching row which should be output after the JOIN. 
> If I work without JOIN, both streams (aggregated and not) work fine. If I 
> work without aggregation, JOIN works fine. But if I use both (at least in 
> append mode), it doesn't work out. If I don't use Spark Structured Streaming 
> but standard Spark Dataframes, I get the result I also planned to have. 
> *Possible Solutions*
>  #  Either there is a bug/missusage in my 

[jira] [Commented] (SPARK-24437) Memory leak in UnsafeHashedRelation

2019-01-22 Thread Dave DeCaprio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748948#comment-16748948
 ] 

Dave DeCaprio commented on SPARK-24437:
---

I'm actually running into a very similar situation right now.

[~dvogelbacher], I'm interested to know what your solution was.

> Memory leak in UnsafeHashedRelation
> ---
>
> Key: SPARK-24437
> URL: https://issues.apache.org/jira/browse/SPARK-24437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: gagan taneja
>Priority: Critical
> Attachments: Screen Shot 2018-05-30 at 2.05.40 PM.png, Screen Shot 
> 2018-05-30 at 2.07.22 PM.png, Screen Shot 2018-11-01 at 10.38.30 AM.png
>
>
> There seems to memory leak with 
> org.apache.spark.sql.execution.joins.UnsafeHashedRelation
> We have a long running instance of STS.
> With each query execution requiring Broadcast Join, UnsafeHashedRelation is 
> getting added for cleanup in ContextCleaner. This reference of 
> UnsafeHashedRelation is being held at some other Collection and not becoming 
> eligible for GC and because of this ContextCleaner is not able to clean it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Fix Version/s: 2.3.4

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Chang Quanyou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang Quanyou updated SPARK-26668:
--
Attachment: kafka_consumer.log
executor.log
10.4.42.64_start.png
10.4.41.64_shutdown.png

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments: 10.4.41.64_shutdown.png, 10.4.42.64_start.png, 
> executor.log, kafka_consumer.log, shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Affects Version/s: 2.3.0
   2.3.1

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26665:
-
Affects Version/s: 2.3.2

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748922#comment-16748922
 ] 

Attila Zsolt Piros commented on SPARK-26688:


As I know a node only could have one label.

So if node label is already used for something else (to group somehow your 
nodes) this could be a lightweight way to still exclude some YARN nodes.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26692) Structured Streaming: Aggregation + JOIN not working

2019-01-22 Thread Theo Diefenthal (JIRA)
Theo Diefenthal created SPARK-26692:
---

 Summary: Structured Streaming: Aggregation + JOIN not working
 Key: SPARK-26692
 URL: https://issues.apache.org/jira/browse/SPARK-26692
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Theo Diefenthal


I tried to setup a simple streaming pipeline with two streams on two data 
sources (CSV files) where one stream is fist windowed (aggregated) and then the 
streams are joined. As output, I chose console for development in Append Mode.

After multiple hours of setup and testing, I still wasn't able to get a working 
example running. I also found a StackOverflow topic here 
[https://stackoverflow.com/questions/52300247/spark-scala-structured-streaming-aggregation-and-self-join]
 where they have the same findings as I had: "In general, {{append}} output 
mode with aggregations is not a recommended way. As far as I understand". And 
"I had the same empty output in my case with different aggregation. Once I 
changed it to {{update}} mode I got output. I have strong doubts that you will 
be able to do it in {{append}} mode. My understanding is that {{append}} mode 
only for {{map-like}} operations, e.g. filter, transform entry etc. I believe 
that multiphase processing". 

I observed the same and only got empty output in my example:
{code:java}
public static SparkSession buildSession() throws Exception {
return SparkSession.builder()
.appName("StreamGroupJoin")
.config("spark.sql.shuffle.partitions", 4)
.master("local[2]")
.getOrCreate();
}

public static Dataset loadData(SparkSession session, String filepath) {
return session
.readStream()
.format("csv")
.option("header", true)
.option("path", filepath)
.schema(new StructType().add("ts", 
DataTypes.TimestampType).add("color", DataTypes.StringType).add("data", 
DataTypes.StringType))
.load();
}

public static void main(String[] args) throws Exception {

SparkSession session = buildSession();

Dataset shieldStream = loadData(session, 
"streamingpoc/src/main/resources/simpleSHIELD");
Dataset argusStream = loadData(session, 
"streamingpoc/src/main/resources/simpleARGUS");

shieldStream = shieldStream.withWatermark("ts", "0 hours");

argusStream = argusStream.withWatermark("ts", "0 hours");
argusStream = argusStream.groupBy(window(col("ts"), "24 hours"), 
col("color")).count();
argusStream = argusStream.select(col("window.start").as("argusStart"), 
col("window.end").as("argusEnd"), col("color").as("argusColor"), 
col("count").as("argusCount"));
//argusStream = argusStream.withWatermark("argusStart", "0 hours");

Dataset joinedStream = argusStream.join(shieldStream, expr("color = 
argusColor AND ts >= argusStart AND ts <= argusEnd"));
joinedStream = joinedStream.withWatermark("ts", "0 hours");

StreamingQuery joinedQuery = joinedStream.writeStream()
.outputMode(OutputMode.Append())
.format("console")
.start();

joinedQuery.awaitTermination();

System.out.println("DONE");
}{code}
I'd like to address that at least in my testing version of Spark 2.4.0, it is 
not even possible to switch to OutputMode.Update due to "_Inner join between 
two streaming DataFrames/Datasets is not supported in Complete output mode, 
only in Append output mode_"

In my example, I used two simple CSV datasets having the same format and one 
matching row which should be output after the JOIN. 

If I work without JOIN, both streams (aggregated and not) work fine. If I work 
without aggregation, JOIN works fine. But if I use both (at least in append 
mode), it doesn't work out. If I don't use Spark Structured Streaming but 
standard Spark Dataframes, I get the result I also planned to have. 

*Possible Solutions*
 #  Either there is a bug/missusage in my code. In that case, the ticket can be 
closed and I'd be happy if someone could tell me what I did wrong. I tried 
quite a lot with different Watermark settings but wasn't able to find a working 
setup.
 # Perform a fix for OuputMode Append if technically possible (From my 
theoretical understanding of Big-Data-Streaming in general, this should be 
possible, but I'm not too much into the topic and I'm not familar with Spark 
afterall)
 # Make this option unavailable in spark (i.e. print out a pipeline error that 
an aggregated stream can't be joined in append mode as is already done if I try 
to join two aggregated streams). In that case, the documentation should also be 
updated and stated out that for anyone willing to perform aggregations and 
JOINs, he is advised to put the aggregation output back into a sink like kafka 
and reread from there for the join.

 

Following is the result I'd like to obtain (And which 

[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748915#comment-16748915
 ] 

Mridul Muralidharan commented on SPARK-26688:
-

What is the usecase for this ?
As others have mentioned in the news group, nodelabels is the typical way to 
prevent (or require) allocation requests to satisfy some constraint.

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26665) BlockTransferService.fetchBlockSync may hang forever

2019-01-22 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26665.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> BlockTransferService.fetchBlockSync may hang forever
> 
>
> Key: SPARK-26665
> URL: https://issues.apache.org/jira/browse/SPARK-26665
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> `ByteBuffer.allocate` may throw OutOfMemoryError when the block is large but 
> no enough memory is available. However, when this happens, right now 
> BlockTransferService.fetchBlockSync will just hang forever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26649) Noop Streaming Sink using DSV2

2019-01-22 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748887#comment-16748887
 ] 

Gabor Somogyi commented on SPARK-26649:
---

Started to work on this.

> Noop Streaming Sink using DSV2
> --
>
> Key: SPARK-26649
> URL: https://issues.apache.org/jira/browse/SPARK-26649
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Extend this noop data source to support a streaming sink that ignores all the 
> elements.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26691) WholeStageCodegen after InMemoryTableScan task takes significant time and time increases based on the input size

2019-01-22 Thread Vikash Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikash Kumar updated SPARK-26691:
-
Summary: WholeStageCodegen after InMemoryTableScan task takes significant 
time and time increases based on the input size  (was: WholeStageCodegen after 
InMemoryTableScan task takes more time and time increases based on the input 
size)

> WholeStageCodegen after InMemoryTableScan task takes significant time and 
> time increases based on the input size
> 
>
> Key: SPARK-26691
> URL: https://issues.apache.org/jira/browse/SPARK-26691
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vikash Kumar
>Priority: Major
> Attachments: DataScale_LkpPolicy_FirstRow_SF1_JobDetail.png, 
> DataScale_LkpPolicy_FirstRow_SF50_JobDetail.png, WholeStageCodegen.PNG
>
>
> Scenario :  I am doing a left outer join between Sreaming dataframe and 
> Static dataframe and writing result to kafka target. Static dataframe is 
> created with Hive Source and Streaming dataframe is created with kafka 
> source. And joining both the dataframe with equal condition. Here is sample 
> program.
>  
> {code:java}
> package com.spark.exec;
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions.{ broadcast => infabroadcast }
> import java.io._
> import java.sql.Timestamp
> import scala.reflect.ClassTag
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.streaming._
> import org.apache.spark.sql.streaming.Trigger._
> import java.util.UUID.randomUUID
> import org.apache.spark.storage.StorageLevel
> object Spark0 {
> def main(s:Array[String]) {
> val sqlContext = SparkSession.builder().enableHiveSupport().getOrCreate()
> import sqlContext.implicits._
> import org.apache.spark.sql.functions.{stddev_samp, var_samp}
> val v1 = 
> sqlContext.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092").option("subscribe", "source").load().toDF();
> val schema = StructType(List(StructField("id", IntegerType, true), 
> StructField("name", StringType, true)))
> val stream = v1.selectExpr("cast (value as string) as json")
> .select(from_json($"json", schema=schema) as "data")
> .select("data.*")
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;
> val joinDF = stream.join(static, stream.col("id").equalTo(static.col("id")), 
> "left_outer")
> val result = joinDF.selectExpr("to_json(struct(*)) AS value")
> val UUID = randomUUID().toString
> val checkpoint = "/tmp/" + UUID
> result.writeStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092")
> .option("topic", "target").options(Map(Tuple2("batch.size", "16384"), 
> Tuple2("metadata.fetch.timeout.ms", "1"), Tuple2("linger.ms", "1000")))
> .option("checkpointLocation", 
> checkpoint).trigger(Trigger.ProcessingTime(2L)).start()
> val activeStreams = sqlContext.streams.active
> activeStreams.foreach( stream => stream.awaitTermination())
> }
> }
> {code}
>  
> On the static dataframe applied repartition and persist function.
> {code:java}
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;{code}
> What i observed in Spark UI that WholeStageCodegen after InMemoryTableScan 
> task takes is taking significant amount of time in every batch which degrade 
> the performance. And time increases for large amount of dataset in Hive 
> source (static datafrme). we have already persisted the data after 
> reparation. What is WholeStageCodegen is doing here which is taking 
> significant amount of time based on the hive source dataset? Is this 
> happening as per design?
> Expectation is that when we have partitioned and persisted the data frame in 
> memory or disk than we should just need to read the data from memory and pass 
> it to joiner to join the data.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26657) Port DayWeek, DayOfWeek and WeekDay on Proleptic Gregorian calendar

2019-01-22 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26657.
---
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

> Port DayWeek, DayOfWeek and WeekDay on Proleptic Gregorian calendar
> ---
>
> Key: SPARK-26657
> URL: https://issues.apache.org/jira/browse/SPARK-26657
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently DayWeek and its children use hybrid calendar. Need to port the 
> classes on java.time and use Proleptic Gregorian calendar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 4:22 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+

|value|

+-+

|    B|
|null|

+-+

When the issue happens only if the columns has Duplicate row data. We do the 
research to understand the code. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN commented on SPARK-26677:


I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+

|value|

+-+

|    B|
|null|

+-+

When the issue happens only if the columns has Duplicate row data. We do the 
research to understand the code. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16838) Add PMML export for ML KMeans in PySpark

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-16838:
-

Assignee: Huaxin Gao

> Add PMML export for ML KMeans in PySpark
> 
>
> Key: SPARK-16838
> URL: https://issues.apache.org/jira/browse/SPARK-16838
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Huaxin Gao
>Priority: Major
>
> After we finish SPARK-11237 we should also expose PMML export in the Python 
> API for KMeans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26691) WholeStageCodegen after InMemoryTableScan task takes more time and time increases based on the input size

2019-01-22 Thread Vikash Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikash Kumar updated SPARK-26691:
-
Attachment: WholeStageCodegen.PNG

> WholeStageCodegen after InMemoryTableScan task takes more time and time 
> increases based on the input size
> -
>
> Key: SPARK-26691
> URL: https://issues.apache.org/jira/browse/SPARK-26691
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vikash Kumar
>Priority: Major
> Attachments: DataScale_LkpPolicy_FirstRow_SF1_JobDetail.png, 
> DataScale_LkpPolicy_FirstRow_SF50_JobDetail.png, WholeStageCodegen.PNG
>
>
> Scenario :  I am doing a left outer join between Sreaming dataframe and 
> Static dataframe and writing result to kafka target. Static dataframe is 
> created with Hive Source and Streaming dataframe is created with kafka 
> source. And joining both the dataframe with equal condition. Here is sample 
> program.
>  
> {code:java}
> package com.spark.exec;
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions.{ broadcast => infabroadcast }
> import java.io._
> import java.sql.Timestamp
> import scala.reflect.ClassTag
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.streaming._
> import org.apache.spark.sql.streaming.Trigger._
> import java.util.UUID.randomUUID
> import org.apache.spark.storage.StorageLevel
> object Spark0 {
> def main(s:Array[String]) {
> val sqlContext = SparkSession.builder().enableHiveSupport().getOrCreate()
> import sqlContext.implicits._
> import org.apache.spark.sql.functions.{stddev_samp, var_samp}
> val v1 = 
> sqlContext.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092").option("subscribe", "source").load().toDF();
> val schema = StructType(List(StructField("id", IntegerType, true), 
> StructField("name", StringType, true)))
> val stream = v1.selectExpr("cast (value as string) as json")
> .select(from_json($"json", schema=schema) as "data")
> .select("data.*")
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;
> val joinDF = stream.join(static, stream.col("id").equalTo(static.col("id")), 
> "left_outer")
> val result = joinDF.selectExpr("to_json(struct(*)) AS value")
> val UUID = randomUUID().toString
> val checkpoint = "/tmp/" + UUID
> result.writeStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092")
> .option("topic", "target").options(Map(Tuple2("batch.size", "16384"), 
> Tuple2("metadata.fetch.timeout.ms", "1"), Tuple2("linger.ms", "1000")))
> .option("checkpointLocation", 
> checkpoint).trigger(Trigger.ProcessingTime(2L)).start()
> val activeStreams = sqlContext.streams.active
> activeStreams.foreach( stream => stream.awaitTermination())
> }
> }
> {code}
>  
> On the static dataframe applied repartition and persist function.
> {code:java}
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;{code}
> What i observed in Spark UI that WholeStageCodegen after InMemoryTableScan 
> task takes is taking significant amount of time in every batch which degrade 
> the performance. And time increases for large amount of dataset in Hive 
> source (static datafrme). we have already persisted the data after 
> reparation. What is WholeStageCodegen is doing here which is taking 
> significant amount of time based on the hive source dataset? Is this 
> happening as per design?
> Expectation is that when we have partitioned and persisted the data frame in 
> memory or disk than we should just need to read the data from memory and pass 
> it to joiner to join the data.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26691) WholeStageCodegen after InMemoryTableScan task takes more time and time increases based on the input size

2019-01-22 Thread Vikash Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikash Kumar updated SPARK-26691:
-
Attachment: DataScale_LkpPolicy_FirstRow_SF50_JobDetail.png
DataScale_LkpPolicy_FirstRow_SF1_JobDetail.png

> WholeStageCodegen after InMemoryTableScan task takes more time and time 
> increases based on the input size
> -
>
> Key: SPARK-26691
> URL: https://issues.apache.org/jira/browse/SPARK-26691
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Vikash Kumar
>Priority: Major
> Attachments: DataScale_LkpPolicy_FirstRow_SF1_JobDetail.png, 
> DataScale_LkpPolicy_FirstRow_SF50_JobDetail.png
>
>
> Scenario :  I am doing a left outer join between Sreaming dataframe and 
> Static dataframe and writing result to kafka target. Static dataframe is 
> created with Hive Source and Streaming dataframe is created with kafka 
> source. And joining both the dataframe with equal condition. Here is sample 
> program.
>  
> {code:java}
> package com.spark.exec;
> import org.apache.spark._
> import org.apache.spark.rdd._
> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions.{ broadcast => infabroadcast }
> import java.io._
> import java.sql.Timestamp
> import scala.reflect.ClassTag
> import scala.collection.JavaConversions._
> import org.apache.spark.sql.streaming._
> import org.apache.spark.sql.streaming.Trigger._
> import java.util.UUID.randomUUID
> import org.apache.spark.storage.StorageLevel
> object Spark0 {
> def main(s:Array[String]) {
> val sqlContext = SparkSession.builder().enableHiveSupport().getOrCreate()
> import sqlContext.implicits._
> import org.apache.spark.sql.functions.{stddev_samp, var_samp}
> val v1 = 
> sqlContext.readStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092").option("subscribe", "source").load().toDF();
> val schema = StructType(List(StructField("id", IntegerType, true), 
> StructField("name", StringType, true)))
> val stream = v1.selectExpr("cast (value as string) as json")
> .select(from_json($"json", schema=schema) as "data")
> .select("data.*")
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;
> val joinDF = stream.join(static, stream.col("id").equalTo(static.col("id")), 
> "left_outer")
> val result = joinDF.selectExpr("to_json(struct(*)) AS value")
> val UUID = randomUUID().toString
> val checkpoint = "/tmp/" + UUID
> result.writeStream.format("kafka").option("kafka.bootstrap.servers", 
> "localhost:49092")
> .option("topic", "target").options(Map(Tuple2("batch.size", "16384"), 
> Tuple2("metadata.fetch.timeout.ms", "1"), Tuple2("linger.ms", "1000")))
> .option("checkpointLocation", 
> checkpoint).trigger(Trigger.ProcessingTime(2L)).start()
> val activeStreams = sqlContext.streams.active
> activeStreams.foreach( stream => stream.awaitTermination())
> }
> }
> {code}
>  
> On the static dataframe applied repartition and persist function.
> {code:java}
> val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
> name FROM 
> default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;{code}
> What i observed in Spark UI that WholeStageCodegen after InMemoryTableScan 
> task takes is taking significant amount of time in every batch which degrade 
> the performance. And time increases for large amount of dataset in Hive 
> source (static datafrme). we have already persisted the data after 
> reparation. What is WholeStageCodegen is doing here which is taking 
> significant amount of time based on the hive source dataset? Is this 
> happening as per design?
> Expectation is that when we have partitioned and persisted the data frame in 
> memory or disk than we should just need to read the data from memory and pass 
> it to joiner to join the data.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26691) WholeStageCodegen after InMemoryTableScan task takes more time and time increases based on the input size

2019-01-22 Thread Vikash Kumar (JIRA)
Vikash Kumar created SPARK-26691:


 Summary: WholeStageCodegen after InMemoryTableScan task takes more 
time and time increases based on the input size
 Key: SPARK-26691
 URL: https://issues.apache.org/jira/browse/SPARK-26691
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Vikash Kumar


Scenario :  I am doing a left outer join between Sreaming dataframe and Static 
dataframe and writing result to kafka target. Static dataframe is created with 
Hive Source and Streaming dataframe is created with kafka source. And joining 
both the dataframe with equal condition. Here is sample program.

 
{code:java}
package com.spark.exec;

import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.storage.StorageLevel._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.{ broadcast => infabroadcast }
import java.io._
import java.sql.Timestamp
import scala.reflect.ClassTag
import scala.collection.JavaConversions._
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger._
import java.util.UUID.randomUUID
import org.apache.spark.storage.StorageLevel


object Spark0 {

def main(s:Array[String]) {
val sqlContext = SparkSession.builder().enableHiveSupport().getOrCreate()
import sqlContext.implicits._
import org.apache.spark.sql.functions.{stddev_samp, var_samp}

val v1 = 
sqlContext.readStream.format("kafka").option("kafka.bootstrap.servers", 
"localhost:49092").option("subscribe", "source").load().toDF();

val schema = StructType(List(StructField("id", IntegerType, true), 
StructField("name", StringType, true)))

val stream = v1.selectExpr("cast (value as string) as json")
.select(from_json($"json", schema=schema) as "data")
.select("data.*")

val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
name FROM 
default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;

val joinDF = stream.join(static, stream.col("id").equalTo(static.col("id")), 
"left_outer")
val result = joinDF.selectExpr("to_json(struct(*)) AS value")

val UUID = randomUUID().toString
val checkpoint = "/tmp/" + UUID
result.writeStream.format("kafka").option("kafka.bootstrap.servers", 
"localhost:49092")
.option("topic", "target").options(Map(Tuple2("batch.size", "16384"), 
Tuple2("metadata.fetch.timeout.ms", "1"), Tuple2("linger.ms", "1000")))
.option("checkpointLocation", 
checkpoint).trigger(Trigger.ProcessingTime(2L)).start()

val activeStreams = sqlContext.streams.active
activeStreams.foreach( stream => stream.awaitTermination())
}
}
{code}
 

On the static dataframe applied repartition and persist function.
{code:java}
val static = sqlContext.sql("SELECT hive_lookup.c0 as id, hive_lookup.c1 as 
name FROM 
default.hive_lookup").repartition(18).persist(StorageLevel.MEMORY_AND_DISK).toDF;{code}
What i observed in Spark UI that WholeStageCodegen after InMemoryTableScan task 
takes is taking significant amount of time in every batch which degrade the 
performance. And time increases for large amount of dataset in Hive source 
(static datafrme). we have already persisted the data after reparation. What is 
WholeStageCodegen is doing here which is taking significant amount of time 
based on the hive source dataset? Is this happening as per design?

Expectation is that when we have partitioned and persisted the data frame in 
memory or disk than we should just need to read the data from memory and pass 
it to joiner to join the data.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16838) Add PMML export for ML KMeans in PySpark

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16838.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23592
[https://github.com/apache/spark/pull/23592]

> Add PMML export for ML KMeans in PySpark
> 
>
> Key: SPARK-16838
> URL: https://issues.apache.org/jira/browse/SPARK-16838
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> After we finish SPARK-11237 we should also expose PMML export in the Python 
> API for KMeans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26187) Stream-stream left outer join returns outer nulls for already matched rows

2019-01-22 Thread Pavel Chernikov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748808#comment-16748808
 ] 

Pavel Chernikov commented on SPARK-26187:
-

[~sandeep.katta2007], initially I used Spark 2.3.2, but I could also reproduce 
it on master, as [~kabhwan] did.

> Stream-stream left outer join returns outer nulls for already matched rows
> --
>
> Key: SPARK-26187
> URL: https://issues.apache.org/jira/browse/SPARK-26187
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Pavel Chernikov
>Priority: Major
>
> This is basically the same issue as SPARK-26154, but with slightly easier 
> reproducible and concrete example:
> {code:java}
> val rateStream = session.readStream
>  .format("rate")
>  .option("rowsPerSecond", 1)
>  .option("numPartitions", 1)
>  .load()
> import org.apache.spark.sql.functions._
> val fooStream = rateStream
>  .select(col("value").as("fooId"), col("timestamp").as("fooTime"))
> val barStream = rateStream
>  // Introduce misses for ease of debugging
>  .where(col("value") % 2 === 0)
>  .select(col("value").as("barId"), col("timestamp").as("barTime")){code}
> If barStream is configured to happen earlier than fooStream, based on time 
> range condition, than everything is all right, no previously matched records 
> are flushed with outer NULLs:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  fooTime >= barTime AND
>  fooTime <= barTime + interval 5 seconds
> """),
>joinType = "leftOuter"
>  )
>  .writeStream
>  .format("console")
>  .option("truncate", false)
>  .start(){code}
> It's easy to observe that only odd rows are flushed with NULLs on the right:
> {code:java}
> [info] Batch: 1 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |0    |2018-11-27 13:12:34.976|0    |2018-11-27 13:12:34.976| 
> [info] |6    |2018-11-27 13:12:40.976|6    |2018-11-27 13:12:40.976| 
> [info] |10   |2018-11-27 13:12:44.976|10   |2018-11-27 13:12:44.976| 
> [info] |8    |2018-11-27 13:12:42.976|8    |2018-11-27 13:12:42.976| 
> [info] |2    |2018-11-27 13:12:36.976|2    |2018-11-27 13:12:36.976| 
> [info] |4    |2018-11-27 13:12:38.976|4    |2018-11-27 13:12:38.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 2 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |1    |2018-11-27 13:12:35.976|null |null                   | 
> [info] |3    |2018-11-27 13:12:37.976|null |null                   | 
> [info] |12   |2018-11-27 13:12:46.976|12   |2018-11-27 13:12:46.976| 
> [info] |18   |2018-11-27 13:12:52.976|18   |2018-11-27 13:12:52.976| 
> [info] |14   |2018-11-27 13:12:48.976|14   |2018-11-27 13:12:48.976| 
> [info] |20   |2018-11-27 13:12:54.976|20   |2018-11-27 13:12:54.976| 
> [info] |16   |2018-11-27 13:12:50.976|16   |2018-11-27 13:12:50.976| 
> [info] +-+---+-+---+ 
> [info] Batch: 3 
> [info] +-+---+-+---+ 
> [info] |fooId|fooTime                |barId|barTime                | 
> [info] +-+---+-+---+ 
> [info] |26   |2018-11-27 13:13:00.976|26   |2018-11-27 13:13:00.976| 
> [info] |22   |2018-11-27 13:12:56.976|22   |2018-11-27 13:12:56.976| 
> [info] |7    |2018-11-27 13:12:41.976|null |null                   | 
> [info] |9    |2018-11-27 13:12:43.976|null |null                   | 
> [info] |28   |2018-11-27 13:13:02.976|28   |2018-11-27 13:13:02.976| 
> [info] |5    |2018-11-27 13:12:39.976|null |null                   | 
> [info] |11   |2018-11-27 13:12:45.976|null |null                   | 
> [info] |13   |2018-11-27 13:12:47.976|null |null                   | 
> [info] |24   |2018-11-27 13:12:58.976|24   |2018-11-27 13:12:58.976| 
> [info] +-+---+-+---+
> {code}
> On the other hand, if we switch the ordering and now fooStream is happening 
> earlier based on time range condition:
> {code:java}
> val query = fooStream
>  .withWatermark("fooTime", "5 seconds")
>  .join(
>barStream.withWatermark("barTime", "5 seconds"),
>expr("""
>  barId = fooId AND
>  barTime >= fooTime AND
>  barTime <= fooTime + interval 5 seconds
> """),
>

[jira] [Assigned] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26680:


Assignee: Apache Spark

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> scala> p1 = s
> p1: Seq[Any] = Stream(1, ?)
> scala> s.foreach(println)
> 1
> java.lang.StackOverflowError
>   at 

[jira] [Assigned] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26680:


Assignee: (was: Apache Spark)

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> scala> p1 = s
> p1: Seq[Any] = Stream(1, ?)
> scala> s.foreach(println)
> 1
> java.lang.StackOverflowError
>   at 

[jira] [Commented] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748811#comment-16748811
 ] 

Apache Spark commented on SPARK-26680:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/23617

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> 

[jira] [Commented] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748809#comment-16748809
 ] 

Gabor Somogyi commented on SPARK-26668:
---

[~quanyou.chang] Is it DStreams or Strctured Streaming?
Anyway, without steps to reproduce/ or driver/executor logs hard to make any 
judgement.

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments: shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26668) One Kafka broker serve is down,the spark streaming start consuming delay

2019-01-22 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748809#comment-16748809
 ] 

Gabor Somogyi edited comment on SPARK-26668 at 1/22/19 3:03 PM:


[~quanyou.chang] Is it DStreams or Strctured Streaming?
Anyway, without steps to reproduce or driver/executor logs hard to make any 
judgement.


was (Author: gsomogyi):
[~quanyou.chang] Is it DStreams or Strctured Streaming?
Anyway, without steps to reproduce/ or driver/executor logs hard to make any 
judgement.

> One Kafka broker serve is down,the spark streaming start consuming delay
> 
>
> Key: SPARK-26668
> URL: https://issues.apache.org/jira/browse/SPARK-26668
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Chang Quanyou
>Priority: Major
> Attachments: shutdown.png
>
>
> description: The Kafka cluster size is 5, one serve is down(the machine is 
> shutdown,due to the disk),And we can't reboot, discover the spark streaming 
> which is delayed; the batch processing time from 3s to 1 min; and the batch 
> size reducing ;  and when the serve start; and the batch is normal



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26649) Noop Streaming Sink using DSV2

2019-01-22 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26649:
--
Issue Type: New Feature  (was: Bug)

> Noop Streaming Sink using DSV2
> --
>
> Key: SPARK-26649
> URL: https://issues.apache.org/jira/browse/SPARK-26649
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> Extend this noop data source to support a streaming sink that ignores all the 
> elements.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24938:
--
Priority: Minor  (was: Major)

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Nihar Sheth
>Priority: Minor
>  Labels: memory-analysis
> Fix For: 3.0.0
>
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24938.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22114
[https://github.com/apache/spark/pull/22114]

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Nihar Sheth
>Priority: Major
>  Labels: memory-analysis
> Fix For: 3.0.0
>
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24938:
-

Assignee: Nihar Sheth

> Understand usage of netty's onheap memory use, even with offheap pools
> --
>
> Key: SPARK-24938
> URL: https://issues.apache.org/jira/browse/SPARK-24938
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Nihar Sheth
>Priority: Major
>  Labels: memory-analysis
>
> We've observed that netty uses large amount of onheap memory in its pools, in 
> addition to the expected offheap memory when I added some instrumentation 
> (using SPARK-24918 and https://github.com/squito/spark-memory). We should 
> figure out why its using that memory, and whether its really necessary.
> It might be just this one line:
> https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82
> which means that even with a small burst of messages, each arena will grow by 
> 16MB which could lead to a 128 MB spike of an almost entirely unused pool.  
> Switching to requesting a buffer from the default pool would probably fix 
> this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26688:


Assignee: Apache Spark

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-22 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26688:


Assignee: (was: Apache Spark)

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-22 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-26680:
--
Affects Version/s: 2.3.2

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> scala> p1 = s
> p1: Seq[Any] = Stream(1, ?)
> scala> s.foreach(println)
> 1
> java.lang.StackOverflowError
>   at 

[jira] [Updated] (SPARK-26680) StackOverflowError if Stream passed to groupBy

2019-01-22 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-26680:
--
Affects Version/s: 2.4.0

> StackOverflowError if Stream passed to groupBy
> --
>
> Key: SPARK-26680
> URL: https://issues.apache.org/jira/browse/SPARK-26680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This Java code results in a StackOverflowError:
> {code:java}
> List groupByCols = new ArrayList<>();
> groupByCols.add(new Column("id1"));
> scala.collection.Seq groupByColsSeq =
> JavaConverters.asScalaIteratorConverter(groupByCols.iterator())
> .asScala().toSeq();
> df.groupBy(groupByColsSeq).max("id2").toDF("id1", "id2").show();
> {code}
> The {{toSeq}} method above produces a Stream. Passing a Stream to groupBy 
> results in the StackOverflowError. In fact, the error can be produced more 
> easily in spark-shell:
> {noformat}
> scala> val df = spark.read.schema("id1 int, id2 int").csv("testinput.csv")
> df: org.apache.spark.sql.DataFrame = [id1: int, id2: int]
> scala> val groupBySeq = Stream(col("id1"))
> groupBySeq: scala.collection.immutable.Stream[org.apache.spark.sql.Column] = 
> Stream(id1, ?)
> scala> df.groupBy(groupBySeq: _*).max("id2").toDF("id1", "id2").collect
> java.lang.StackOverflowError
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
>   at scala.collection.immutable.Stream.$anonfun$map$1(Stream.scala:418)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1171)
>   at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1161)
>   at scala.collection.immutable.Stream.drop(Stream.scala:797)
>   at scala.collection.immutable.Stream.drop(Stream.scala:204)
>   at scala.collection.LinearSeqOptimized.apply(LinearSeqOptimized.scala:66)
>   at scala.collection.LinearSeqOptimized.apply$(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.Stream.apply(Stream.scala:204)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport.$anonfun$consume$3(WholeStageCodegenExec.scala:159)
> ...etc...
> {noformat}
> This is due to the lazy nature of Streams. The method {{consume}} in 
> {{CodegenSupport}} assumes that a map function will be eagerly evaluated:
> {code:java}
> val inputVars =
> ctx.currentVars = null <== the closure cares about this
> ctx.INPUT_ROW = row
> output.zipWithIndex.map { case (attr, i) =>
>   BoundReference(i, attr.dataType, attr.nullable).genCode(ctx)
> -
> -
> -
> ctx.currentVars = inputVars
> ctx.INPUT_ROW = null
> ctx.freshNamePrefix = parent.variablePrefix
> val evaluated = evaluateRequiredVariables(output, inputVars, 
> parent.usedInputs)
> {code}
> The closure passed to the map function assumes {{ctx.currentVars}} will be 
> set to null. But due to lazy evaluation, {{ctx.currentVars}} is set to 
> something else by the time the closure is actually called. Worse yet, 
> {{ctx.currentVars}} is set to the yet-to-be evaluated inputVars stream. The 
> closure uses {{ctx.currentVars}} (via the call {{genCode(ctx)}}), therefore 
> it ends up using the data structure it is attempting to create.
> You can recreate the problem is a vanilla Scala shell:
> {code:java}
> scala> var p1: Seq[Any] = null
> p1: Seq[Any] = null
> scala> val s = Stream(1, 2).zipWithIndex.map { case (x, i) => if (p1 != null) 
> p1(i) else x }
> s: scala.collection.immutable.Stream[Any] = Stream(1, ?)
> scala> p1 = s
> p1: Seq[Any] = Stream(1, ?)
> scala> s.foreach(println)
> 1
> java.lang.StackOverflowError
>   at 

[jira] [Resolved] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler categories.

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26463.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23416
[https://github.com/apache/spark/pull/23416]

> Use ConfigEntry for hardcoded configs for scheduler categories.
> ---
>
> Key: SPARK-26463
> URL: https://issues.apache.org/jira/browse/SPARK-26463
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Make the following hardcoded configs to use {{ConfigEntry}}.
> {code}
> spark.dynamicAllocation
> spark.scheduler
> spark.rpc
> spark.task
> spark.speculation
> spark.cleaner
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26690) Checkpoints of Dataframes are not visible in the SQL UI

2019-01-22 Thread Tom van Bussel (JIRA)
Tom van Bussel created SPARK-26690:
--

 Summary: Checkpoints of Dataframes are not visible in the SQL UI
 Key: SPARK-26690
 URL: https://issues.apache.org/jira/browse/SPARK-26690
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Tom van Bussel


Checkpoints and local checkpoints of dataframes do not show up in the SQL UI.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler categories.

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26463:
-

Assignee: Kazuaki Ishizaki

> Use ConfigEntry for hardcoded configs for scheduler categories.
> ---
>
> Key: SPARK-26463
> URL: https://issues.apache.org/jira/browse/SPARK-26463
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Make the following hardcoded configs to use {{ConfigEntry}}.
> {code}
> spark.dynamicAllocation
> spark.scheduler
> spark.rpc
> spark.task
> spark.speculation
> spark.cleaner
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26616) Expose document frequency in IDFModel

2019-01-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26616.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23549
[https://github.com/apache/spark/pull/23549]

> Expose document frequency in IDFModel
> -
>
> Key: SPARK-26616
> URL: https://issues.apache.org/jira/browse/SPARK-26616
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Jatin Puri
>Assignee: Jatin Puri
>Priority: Minor
> Fix For: 3.0.0
>
>
> As part of `org.apache.spark.ml.feature.IDFModel`, the following can be 
> exposed:
>  
> 1. Document frequency vector
> 2. Number of documents
>  
> The above are already computed in calculating idf vector. It simply need to 
> be exposed as `public val`
>  
> This avoids re-implementation for someone who needs to compute 
> DocumentFrequency of terms. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >