[jira] [Assigned] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25592:


Assignee: Apache Spark  (was: Xiao Li)

> Bump master branch version to 3.0.0-SNAPSHOT
> 
>
> Key: SPARK-25592
> URL: https://issues.apache.org/jira/browse/SPARK-25592
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> This patch bumps the master branch version to `3.0.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635058#comment-16635058
 ] 

Apache Spark commented on SPARK-25592:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22606

> Bump master branch version to 3.0.0-SNAPSHOT
> 
>
> Key: SPARK-25592
> URL: https://issues.apache.org/jira/browse/SPARK-25592
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> This patch bumps the master branch version to `3.0.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25592:


Assignee: Xiao Li  (was: Apache Spark)

> Bump master branch version to 3.0.0-SNAPSHOT
> 
>
> Key: SPARK-25592
> URL: https://issues.apache.org/jira/browse/SPARK-25592
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> This patch bumps the master branch version to `3.0.0-SNAPSHOT`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25592) Bump master branch version to 3.0.0-SNAPSHOT

2018-10-01 Thread Xiao Li (JIRA)
Xiao Li created SPARK-25592:
---

 Summary: Bump master branch version to 3.0.0-SNAPSHOT
 Key: SPARK-25592
 URL: https://issues.apache.org/jira/browse/SPARK-25592
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Xiao Li
Assignee: Xiao Li


This patch bumps the master branch version to `3.0.0-SNAPSHOT`.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-01 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635048#comment-16635048
 ] 

Liang-Chi Hsieh edited comment on SPARK-25461 at 10/2/18 5:27 AM:
--

I've looked more at this. We don't really check if pandas.Series's type matches 
with pre-defined return type. For this case, seems the conversion is not 
correct.

I was trying to add some check and throw exception when mismatching is 
detected. But looks like we leverage such behavior in current codebase. For 
example, there is a test {{test_vectorized_udf_null_short}}:
{code}
data = [(None,), (2,), (3,), (4,)]
schema = StructType().add("short", ShortType())
df = self.spark.createDataFrame(data, schema)
short_f = pandas_udf(lambda x: x, ShortType())
res = df.select(short_f(col('short')))
self.assertEquals(df.collect(), res.collect())
{code}
The Pandas.Series is of float64 but we define return type as ShortType. In this 
case, it works well. So seems to disallow such conversion is not feasible. For 
now, I think we can print some warning message if such mismatching is detected.

cc [~hyukjin.kwon] What do you think about this idea?


was (Author: viirya):
I've looked more at this. We don't really check if pandas.Series's type matches 
with pre-defined return type. For this case, seems the conversion is not 
correct and silently ignored.

I was trying to add some check and throw exception when mismatching is 
detected. But looks like we leverage such behavior in current codebase. For 
example, there is a test {{test_vectorized_udf_null_short}}:

{code:python}
data = [(None,), (2,), (3,), (4,)]
schema = StructType().add("short", ShortType())
df = self.spark.createDataFrame(data, schema)
short_f = pandas_udf(lambda x: x, ShortType())
res = df.select(short_f(col('short')))
self.assertEquals(df.collect(), res.collect())
{code}

The Pandas.Series is of float64 but we define return type as ShortType. In this 
case, it works well. So seems to disallow such conversion is not feasible. For 
now, I think we can print some warning message if such mismatching is detected.

cc [~hyukjin.kwon] What do you think about this idea?

 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-01 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635048#comment-16635048
 ] 

Liang-Chi Hsieh commented on SPARK-25461:
-

I've looked more at this. We don't really check if pandas.Series's type matches 
with pre-defined return type. For this case, seems the conversion is not 
correct and silently ignored.

I was trying to add some check and throw exception when mismatching is 
detected. But looks like we leverage such behavior in current codebase. For 
example, there is a test {{test_vectorized_udf_null_short}}:

{code:python}
data = [(None,), (2,), (3,), (4,)]
schema = StructType().add("short", ShortType())
df = self.spark.createDataFrame(data, schema)
short_f = pandas_udf(lambda x: x, ShortType())
res = df.select(short_f(col('short')))
self.assertEquals(df.collect(), res.collect())
{code}

The Pandas.Series is of float64 but we define return type as ShortType. In this 
case, it works well. So seems to disallow such conversion is not feasible. For 
now, I think we can print some warning message if such mismatching is detected.

cc [~hyukjin.kwon] What do you think about this idea?

 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25591) PySpark Accumulators with multiple PythonUDFs

2018-10-01 Thread Abdeali Kothari (JIRA)
Abdeali Kothari created SPARK-25591:
---

 Summary: PySpark Accumulators with multiple PythonUDFs
 Key: SPARK-25591
 URL: https://issues.apache.org/jira/browse/SPARK-25591
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Abdeali Kothari


When having multiple Python UDFs - the last Python UDF's accumulator is the 
only accumulator that gets updated.


{code:python}
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql import functions as F
from pyspark.sql import types as T

from pyspark import AccumulatorParam

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
test_accum = spark.sparkContext.accumulator(0.0)

SHUFFLE = False

def main(data):
print(">>> Check0", test_accum.value)
def test(x):
global test_accum
test_accum += 1.0
return x

print(">>> Check1", test_accum.value)

def test2(x):
global test_accum
test_accum += 100.0
return x

print(">>> Check2", test_accum.value)
func_udf = F.udf(test, T.DoubleType())
print(">>> Check3", test_accum.value)
func_udf2 = F.udf(test2, T.DoubleType())
print(">>> Check4", test_accum.value)

data = data.withColumn("out1", func_udf(data["a"]))
if SHUFFLE:
data = data.repartition(2)
print(">>> Check5", test_accum.value)
data = data.withColumn("out2", func_udf2(data["b"]))
if SHUFFLE:
data = data.repartition(2)
print(">>> Check6", test_accum.value)

data.show()  # ACTION
print(">>> Check7", test_accum.value)
return data


df = spark.createDataFrame([
[1.0, 2.0]
], schema=T.StructType([T.StructField(field_name, T.DoubleType(), True) for 
field_name in ["a", "b"]]))

df2 = main(df)
{code}



 Output 1 - with SHUFFLE=False
...
# >>> Check7 100.0


 Output 2 - with SHUFFLE=True
...
# >>> Check7 101.0

Basically looks like:
 - Accumulator works only for last UDF before a shuffle-like operation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2018-10-01 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634871#comment-16634871
 ] 

Wenchen Fan commented on SPARK-15689:
-

So {{SupportsReportPartitioning}} is not powerful enough to support custom hash 
functions yet.

There are two major operators that may introduce shuffle: join and aggregate. 
Aggregate only needs to have the data clustered, but doesn't care how, so the 
data source v2 can support it, if your implementation catches 
{{ClusteredDistribution}}. Join needs the data of the 2 children clustered by 
the spark shuffle hash function, which is not supported by data source v2 
currently. 

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25543) Confusing log messages at DEBUG level, in K8s mode.

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25543:
--
Fix Version/s: (was: 2.4.1)
   (was: 2.5.0)
   2.4.0

> Confusing log messages at DEBUG level, in K8s mode.
> ---
>
> Key: SPARK-25543
> URL: https://issues.apache.org/jira/browse/SPARK-25543
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Minor
> Fix For: 2.4.0
>
>
> Steps to reproduce.
> Start spark shell by providing a K8s master. Then turn the debug log on, 
> {code}
> scala> sc.setLogLevel("DEBUG")
> {code}
> {code}
> sc.setLogLevel("DEBUG")
> scala> 2018-09-26 09:33:54 DEBUG ExecutorPodsLifecycleManager:58 - Removed 
> executors with ids  from Spark that were either found to be deleted or 
> non-existent in the cluster.
> 2018-09-26 09:33:55 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:33:56 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:33:56 DEBUG ExecutorPodsPollingSnapshotSource:58 - 
> Resynchronizing full executor pod state from Kubernetes.
> 2018-09-26 09:33:57 DEBUG ExecutorPodsAllocator:58 - Currently have 1 running 
> executors and 0 pending executors. Map() executors have been requested but 
> are pending appearance in the cluster.
> 2018-09-26 09:33:57 DEBUG ExecutorPodsAllocator:58 - Current number of 
> running executors is equal to the number of requested executors. Not scaling 
> up further.
> 2018-09-26 09:33:57 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:33:58 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:33:59 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:00 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:01 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:02 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:03 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:04 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:05 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from Spark that were either found to be deleted or non-existent in 
> the cluster.
> 2018-09-26 09:34:06 DEBUG ExecutorPodsLifecycleManager:58 - Removed executors 
> with ids  from ...
> {code}
> The fix is easy, first check if there are any removed executors, before 
> producing the log message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23401) Improve test cases for all supported types and unsupported types

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23401:
--
Fix Version/s: (was: 2.4.1)
   (was: 2.5.0)
   2.4.0

> Improve test cases for all supported types and unsupported types
> 
>
> Key: SPARK-23401
> URL: https://issues.apache.org/jira/browse/SPARK-23401
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Aleksandr Koriagin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Looks there are some missing types to test in supported types. 
> For example, please see 
> https://github.com/apache/spark/blob/c338c8cf8253c037ecd4f39bbd58ed5a86581b37/python/pyspark/sql/tests.py#L4397-L4401
> We can improve this test coverage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25542) Flaky test: OpenHashMapSuite

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25542:
--
Fix Version/s: (was: 2.4.1)
   2.4.0

> Flaky test: OpenHashMapSuite
> 
>
> Key: SPARK-25542
> URL: https://issues.apache.org/jira/browse/SPARK-25542
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> - 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96585/testReport/org.apache.spark.util.collection/OpenHashMapSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/]
>  (Sep 25, 2018 5:52:56 PM)
> {code:java}
> org.apache.spark.util.collection.OpenHashMapSuite.(It is not a test it is a 
> sbt.testing.SuiteSelector)
> Failing for the past 1 build (Since #96585 )
> Took 0 ms.
> Error Message
> sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: Java heap space
> Stacktrace
> sbt.ForkMain$ForkError: sbt.ForkMain$ForkError: java.lang.OutOfMemoryError: 
> Java heap space
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:117)
>   at scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:115)
>   at 
> org.apache.spark.util.collection.OpenHashMap$$anonfun$1.apply$mcVI$sp(OpenHashMap.scala:159)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehash(OpenHashSet.scala:234)
>   at 
> org.apache.spark.util.collection.OpenHashSet.rehashIfNeeded(OpenHashSet.scala:171)
>   at 
> org.apache.spark.util.collection.OpenHashMap$mcI$sp.update$mcI$sp(OpenHashMap.scala:86)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17$$anonfun$apply$4.apply$mcVI$sp(OpenHashMapSuite.scala:192)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:191)
>   at 
> org.apache.spark.util.collection.OpenHashMapSuite$$anonfun$17.apply(OpenHashMapSuite.scala:188)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25572) SparkR tests failed on CRAN on Java 10

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25572:
--
Target Version/s:   (was: 2.4.1, 2.5.0)
   Fix Version/s: (was: 2.4.1)
  (was: 2.5.0)
  2.4.0

> SparkR tests failed on CRAN on Java 10
> --
>
> Key: SPARK-25572
> URL: https://issues.apache.org/jira/browse/SPARK-25572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
> Fix For: 2.4.0
>
>
> follow up to SPARK-24255
> from 2.3.2 release we can see that CRAN doesn't seem to respect the system 
> requirements as running tests - we have seen cases where SparkR is run on 
> Java 10, which unfortunately Spark does not start on. For 2.4.x, lets attempt 
> skipping all tests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25578:
--
Target Version/s:   (was: 2.4.1)

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.0
>
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25570) Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25570:
--
Fix Version/s: (was: 2.4.1)
   (was: 2.5.0)
   2.4.0

> Replace 2.3.1 with 2.3.2 in HiveExternalCatalogVersionsSuite
> 
>
> Key: SPARK-25570
> URL: https://issues.apache.org/jira/browse/SPARK-25570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.3, 2.4.0
>
>
> This issue aims to prevent test slowdowns at HiveExternalCatalogVersionsSuite 
> by using the latest Spark 2.3.2 because the Apache mirror will remove the old 
> Spark 2.3.1 eventually. HiveExternalCatalogVersionsSuite will not fail 
> because SPARK-24813 implements a fallback logic, but it causes many trials in 
> all builds over `branch-2.3/branch-2.4/master`. We had better fix this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25578.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22600
[https://github.com/apache/spark/pull/22600]

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.0
>
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25578:
-

Assignee: Sean Owen

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.0
>
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25587) NPE in Dataset when reading from Parquet as Product

2018-10-01 Thread Michael Heuer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-25587:
--
Description: 
In an attempt to replicate the following issue in ADAM, a library downstream of 
Spark
https://github.com/bigdatagenomics/adam/issues/2058

also reported as
https://issues.apache.org/jira/browse/SPARK-25588

the following Spark Shell script throws NPE when attempting to read from 
Parquet.
{code:scala}
sc.setLogLevel("INFO")

import spark.implicits._

case class Inner(
  names: Seq[String] = Seq()
) 

case class Outer(
  inners: Seq[Inner] = Seq()
)

val inner = Inner(Seq("name0", "name1"))
val outer = Outer(Seq(inner))
val dataset = sc.parallelize(Seq(outer)).toDS()

val path = "outers.parquet"
dataset.toDF().write.format("parquet").save(path)

val roundtrip = spark.read.parquet(path).as[Outer]
roundtrip.first
{code}

Stack trace
{noformat}
$ spark-shell -i failure.scala
...
2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet 
WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
"name" : "inners",
"type" : {
  "type" : "array",
  "elementType" : {
"type" : "struct",
"fields" : [ {
  "name" : "names",
  "type" : {
"type" : "array",
"elementType" : "string",
"containsNull" : true
  },
  "nullable" : true,
  "metadata" : { }
} ]
  },
  "containsNull" : true
},
"nullable" : true,
"metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  optional group inners (LIST) {
repeated group list {
  optional group element {
optional group names (LIST) {
  repeated group list {
optional binary element (UTF8);
  }
}
  }
}
  }
}

16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to 
file. allocated memory: 0
2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem 
columnStore to file. allocated memory: 26
...
2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: 
struct>>>
2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:296)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$class.prepareArguments(objects.scala:98)
  at 

[jira] [Updated] (SPARK-25590) kubernetes-model-2.0.0.jar masks default Spark logging config

2018-10-01 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-25590:
---
Description: 
That jar file, which is packaged when the k8s profile is enabled, has a log4j 
configuration embedded in it:

{noformat}
$ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
log4j.properties
{noformat}

What this causes is that Spark will always use that log4j configuration instead 
of its own default (log4j-defaults.properties), unless the user overrides it by 
somehow adding their own in the classpath before the kubernetes one.

You can see that by running spark-shell. With the k8s jar in:

{noformat}
$ ./bin/spark-shell 
...
Setting default log level to "WARN"
{noformat}

Removing the k8s jar:

{noformat}
$ ./bin/spark-shell 
...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
{noformat}

The proper fix would be for the k8s jar to not ship that file, and then just 
upgrade the dependency in Spark, but if there's something easy we can do in the 
meantime...

  was:
That jar file, which is packaged when the k8s profile is enabled, has a log4j 
configuration embedded in it:

{noformat}
$ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
log4j.properties
{noformat}

What this causes is that Spark will always use that log4j configuration instead 
of its own default (log4j-defaults.properties), unless the user overrides it by 
somehow adding their own in the classpath before the kubernetes one.

You can see that by running spark-shell. With the k8s jar in:

{noformat}
$ ./bin/spark-shell 
...
Setting default log level to "WARN"
{noformat}

Removing the k8s jar:

{noformat}

{noformat}
$ ./bin/spark-shell 
...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
{noformat}

The proper fix would be for the k8s jar to not ship that file, and then just 
upgrade the dependency in Spark, but if there's something easy we can do in the 
meantime...


> kubernetes-model-2.0.0.jar masks default Spark logging config
> -
>
> Key: SPARK-25590
> URL: https://issues.apache.org/jira/browse/SPARK-25590
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> That jar file, which is packaged when the k8s profile is enabled, has a log4j 
> configuration embedded in it:
> {noformat}
> $ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
> log4j.properties
> {noformat}
> What this causes is that Spark will always use that log4j configuration 
> instead of its own default (log4j-defaults.properties), unless the user 
> overrides it by somehow adding their own in the classpath before the 
> kubernetes one.
> You can see that by running spark-shell. With the k8s jar in:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Setting default log level to "WARN"
> {noformat}
> Removing the k8s jar:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> {noformat}
> The proper fix would be for the k8s jar to not ship that file, and then just 
> upgrade the dependency in Spark, but if there's something easy we can do in 
> the meantime...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25590) kubernetes-model-2.0.0.jar masks default Spark logging config

2018-10-01 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-25590:
--

 Summary: kubernetes-model-2.0.0.jar masks default Spark logging 
config
 Key: SPARK-25590
 URL: https://issues.apache.org/jira/browse/SPARK-25590
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


That jar file, which is packaged when the k8s profile is enabled, has a log4j 
configuration embedded in it:

{noformat}
$ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
log4j.properties
{noformat}

What this causes is that Spark will always use that log4j configuration instead 
of its own default (log4j-defaults.properties), unless the user overrides it by 
somehow adding their own in the classpath before the kubernetes one.

You can see that by running spark-shell. With the k8s jar in:

{noformat}
$ ./bin/spark-shell 
...
Setting default log level to "WARN"
{noformat}

Removing the k8s jar:

{noformat}

{noformat}
$ ./bin/spark-shell 
...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
{noformat}

The proper fix would be for the k8s jar to not ship that file, and then just 
upgrade the dependency in Spark, but if there's something easy we can do in the 
meantime...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634741#comment-16634741
 ] 

Apache Spark commented on SPARK-25589:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22605

> Add BloomFilterBenchmark
> 
>
> Key: SPARK-25589
> URL: https://issues.apache.org/jira/browse/SPARK-25589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add BloomFilterBenchmark.
> For ORC data source, it has been supported for a long time.
> For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25589:


Assignee: (was: Apache Spark)

> Add BloomFilterBenchmark
> 
>
> Key: SPARK-25589
> URL: https://issues.apache.org/jira/browse/SPARK-25589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add BloomFilterBenchmark.
> For ORC data source, it has been supported for a long time.
> For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634740#comment-16634740
 ] 

Apache Spark commented on SPARK-25589:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22605

> Add BloomFilterBenchmark
> 
>
> Key: SPARK-25589
> URL: https://issues.apache.org/jira/browse/SPARK-25589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add BloomFilterBenchmark.
> For ORC data source, it has been supported for a long time.
> For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25589:


Assignee: Apache Spark

> Add BloomFilterBenchmark
> 
>
> Key: SPARK-25589
> URL: https://issues.apache.org/jira/browse/SPARK-25589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to add BloomFilterBenchmark.
> For ORC data source, it has been supported for a long time.
> For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25586:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

This is not a bug; SPARK-25118 is not committed. This is an improvement that 
might work around a problem in the proposed implementation of that issue.

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Minor
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> 

[jira] [Updated] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25589:
--
Component/s: Tests

> Add BloomFilterBenchmark
> 
>
> Key: SPARK-25589
> URL: https://issues.apache.org/jira/browse/SPARK-25589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Tests
>Affects Versions: 2.5.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to add BloomFilterBenchmark.
> For ORC data source, it has been supported for a long time.
> For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25589) Add BloomFilterBenchmark

2018-10-01 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25589:
-

 Summary: Add BloomFilterBenchmark
 Key: SPARK-25589
 URL: https://issues.apache.org/jira/browse/SPARK-25589
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.5.0
Reporter: Dongjoon Hyun


This issue aims to add BloomFilterBenchmark.
For ORC data source, it has been supported for a long time.
For Parquet data source, it's expected to be added in next Parquet release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25575) SQL tab in the spark UI doesn't have option of hiding tables, eventhough other UI tabs has.

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25575:
-

Assignee: shahid

> SQL tab in the spark UI doesn't have option of  hiding tables, eventhough 
> other UI tabs has. 
> -
>
> Key: SPARK-25575
> URL: https://issues.apache.org/jira/browse/SPARK-25575
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: Screenshot from 2018-09-29 23-26-45.png, Screenshot from 
> 2018-09-29 23-26-57.png
>
>
> Test tests:
>  1) bin/spark-shell
> {code:java}
> sql("create table a (id int)")
> for(i <- 1 to 100) sql(s"insert into a values ($i)")
> {code}
> Open SQL tab in the web UI,
>  !Screenshot from 2018-09-29 23-26-45.png! 
> Open Jobs tab,
>  !Screenshot from 2018-09-29 23-26-57.png! 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25575) SQL tab in the spark UI doesn't have option of hiding tables, eventhough other UI tabs has.

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25575.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22592
[https://github.com/apache/spark/pull/22592]

> SQL tab in the spark UI doesn't have option of  hiding tables, eventhough 
> other UI tabs has. 
> -
>
> Key: SPARK-25575
> URL: https://issues.apache.org/jira/browse/SPARK-25575
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 2.5.0
>
> Attachments: Screenshot from 2018-09-29 23-26-45.png, Screenshot from 
> 2018-09-29 23-26-57.png
>
>
> Test tests:
>  1) bin/spark-shell
> {code:java}
> sql("create table a (id int)")
> for(i <- 1 to 100) sql(s"insert into a values ($i)")
> {code}
> Open SQL tab in the web UI,
>  !Screenshot from 2018-09-29 23-26-45.png! 
> Open Jobs tab,
>  !Screenshot from 2018-09-29 23-26-57.png! 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25587) NPE in Dataset when reading from Parquet as Product

2018-10-01 Thread Michael Heuer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-25587:
--
Description: 
In an attempt to replicate the following issue in ADAM, a library downstream of 
Spark
https://github.com/bigdatagenomics/adam/issues/2058

also reported as
https://issues.apache.org/jira/browse/SPARK-25588

the following Spark Shell script throws NPE when attempting to read from 
Parquet.
{code:scala}
sc.setLogLevel("INFO")

import spark.implicits._

case class Inner(
  names: Seq[String] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => names
  }
  def canEqual(that: Any): Boolean = that match {
case Inner => true
case _ => false
  }
}

case class Outer(
  inners: Seq[Inner] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => inners
  }
  def canEqual(that: Any): Boolean = that match {
case Outer => true
case _ => false
  }
}

val inner = Inner(Seq("name0", "name1"))
val outer = Outer(Seq(inner))
val dataset = sc.parallelize(Seq(outer)).toDS()

val path = "outers.parquet"
dataset.toDF().write.format("parquet").save(path)

val roundtrip = spark.read.parquet(path).as[Outer]
roundtrip.first
{code}

Stack trace
{noformat}
$ spark-shell -i failure.scala
...
2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet 
WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
"name" : "inners",
"type" : {
  "type" : "array",
  "elementType" : {
"type" : "struct",
"fields" : [ {
  "name" : "names",
  "type" : {
"type" : "array",
"elementType" : "string",
"containsNull" : true
  },
  "nullable" : true,
  "metadata" : { }
} ]
  },
  "containsNull" : true
},
"nullable" : true,
"metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  optional group inners (LIST) {
repeated group list {
  optional group element {
optional group names (LIST) {
  repeated group list {
optional binary element (UTF8);
  }
}
  }
}
  }
}

16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to 
file. allocated memory: 0
2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem 
columnStore to file. allocated memory: 26
...
2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: 
struct>>>
2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
  at 

[jira] [Updated] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-01 Thread Michael Heuer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-25588:
--
Description: 
In ADAM, a library downstream of Spark, we use Avro to define a schema, 
generate Java classes from the Avro schema using the avro-maven-plugin, and 
generate Scala Products from the Avro schema using our own code generation 
library.

In the code path demonstrated by the following unit test, we write out to 
Parquet and read back in using an RDD of Avro-generated Java classes and then 
write out to Parquet and read back in using a Dataset of Avro-generated Scala 
Products.

{code:scala}
  sparkTest("transform reads to variant rdd") {
val reads = sc.loadAlignments(testFile("small.sam"))

def checkSave(variants: VariantRDD) {
  val tempPath = tmpLocation(".adam")
  variants.saveAsParquet(tempPath)

  assert(sc.loadVariants(tempPath).rdd.count === 20)
}

val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
VariantRDD](
  (rdd: RDD[AlignmentRecord]) => {
rdd.map(AlignmentRecordRDDSuite.varFn)
  })

checkSave(variants)

val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._

val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
VariantProduct, VariantRDD](
  (ds: Dataset[AlignmentRecordProduct]) => {
ds.map(r => {
  VariantProduct.fromAvro(
AlignmentRecordRDDSuite.varFn(r.toAvro))
})
  })

checkSave(variantsDs)
}
{code}
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540

Note the schema in Parquet are different:

RDD code path
{noformat}
$ parquet-tools schema 
/var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
message org.bdgenomics.formats.avro.Variant {
  optional binary contigName (UTF8);
  optional int64 start;
  optional int64 end;
  required group names (LIST) {
repeated binary array (UTF8);
  }
  optional boolean splitFromMultiAllelic;
  optional binary referenceAllele (UTF8);
  optional binary alternateAllele (UTF8);
  optional double quality;
  optional boolean filtersApplied;
  optional boolean filtersPassed;
  required group filtersFailed (LIST) {
repeated binary array (UTF8);
  }
  optional group annotation {
optional binary ancestralAllele (UTF8);
optional int32 alleleCount;
optional int32 readDepth;
optional int32 forwardReadDepth;
optional int32 reverseReadDepth;
optional int32 referenceReadDepth;
optional int32 referenceForwardReadDepth;
optional int32 referenceReverseReadDepth;
optional float alleleFrequency;
optional binary cigar (UTF8);
optional boolean dbSnp;
optional boolean hapMap2;
optional boolean hapMap3;
optional boolean validated;
optional boolean thousandGenomes;
optional boolean somatic;
required group transcriptEffects (LIST) {
  repeated group array {
optional binary alternateAllele (UTF8);
required group effects (LIST) {
  repeated binary array (UTF8);
}
optional binary geneName (UTF8);
optional binary geneId (UTF8);
optional binary featureType (UTF8);
optional binary featureId (UTF8);
optional binary biotype (UTF8);
optional int32 rank;
optional int32 total;
optional binary genomicHgvs (UTF8);
optional binary transcriptHgvs (UTF8);
optional binary proteinHgvs (UTF8);
optional int32 cdnaPosition;
optional int32 cdnaLength;
optional int32 cdsPosition;
optional int32 cdsLength;
optional int32 proteinPosition;
optional int32 proteinLength;
optional int32 distance;
required group messages (LIST) {
  repeated binary array (ENUM);
}
  }
}
required group attributes (MAP) {
  repeated group map (MAP_KEY_VALUE) {
required binary key (UTF8);
required binary value (UTF8);
  }
}
  }
}
{noformat}

Dataset code path:
{noformat}
$ parquet-tools schema 
/var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite2879366708769671307.adam/part-0-b123eb8b-2648-4648-8096-b3de95343141-c000.snappy.parquet
message spark_schema {
  optional binary contigName (UTF8);
  optional int64 start;
  optional int64 end;
  optional group names (LIST) {
repeated group list {
  optional binary element (UTF8);
}
  }
  optional boolean splitFromMultiAllelic;
  optional binary referenceAllele (UTF8);
  optional binary alternateAllele (UTF8);
  optional double quality;
  optional boolean filtersApplied;
  optional boolean filtersPassed;
  optional group filtersFailed (LIST) {
repeated group list {
  optional binary element (UTF8);
}
  }
  optional group annotation {
optional 

[jira] [Created] (SPARK-25588) SchemaParseException: Can't redefine: list when reading from Parquet

2018-10-01 Thread Michael Heuer (JIRA)
Michael Heuer created SPARK-25588:
-

 Summary: SchemaParseException: Can't redefine: list when reading 
from Parquet
 Key: SPARK-25588
 URL: https://issues.apache.org/jira/browse/SPARK-25588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Spark version 2.4.0 (RC2).

{noformat}
$ spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
Branch
Compiled by user  on 2018-09-27T14:50:10Z
Revision
Url
Type --help for more information.
{noformat}
Reporter: Michael Heuer


In ADAM, a library downstream of Spark, we use Avro to define a schema, 
generate Java classes from the Avro schema using the avro-maven-plugin, and 
generate Scala Products from the Avro schema using our own code generation 
library.

In the code path demonstrated by the following unit test, we write out to 
Parquet and read back in using an RDD of Avro-generated Java classes and then 
write out to Parquet and read back in using a Dataset of Avro-generated Scala 
Products.

{code:scala}
  sparkTest("transform reads to variant rdd") {
val reads = sc.loadAlignments(testFile("small.sam"))

def checkSave(variants: VariantRDD) {
  val tempPath = tmpLocation(".adam")
  variants.saveAsParquet(tempPath)

  assert(sc.loadVariants(tempPath).rdd.count === 20)
}

val variants: VariantRDD = reads.transmute[Variant, VariantProduct, 
VariantRDD](
  (rdd: RDD[AlignmentRecord]) => {
rdd.map(AlignmentRecordRDDSuite.varFn)
  })

checkSave(variants)

val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._

val variantsDs: VariantRDD = reads.transmuteDataset[Variant, 
VariantProduct, VariantRDD](
  (ds: Dataset[AlignmentRecordProduct]) => {
ds.map(r => {
  VariantProduct.fromAvro(
AlignmentRecordRDDSuite.varFn(r.toAvro))
})
  })

checkSave(variantsDs)
}
{code}
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDSuite.scala#L1540

Note the schema in Parquet are different:

RDD code path
{noformat}
$ parquet-tools schema 
/var/folders/m6/4yqn_4q129lbth_dq3qzj_8hgn/T/TempSuite3400691035694870641.adam/part-r-0.gz.parquet
message org.bdgenomics.formats.avro.Variant {
  optional binary contigName (UTF8);
  optional int64 start;
  optional int64 end;
  required group names (LIST) {
repeated binary array (UTF8);
  }
  optional boolean splitFromMultiAllelic;
  optional binary referenceAllele (UTF8);
  optional binary alternateAllele (UTF8);
  optional double quality;
  optional boolean filtersApplied;
  optional boolean filtersPassed;
  required group filtersFailed (LIST) {
repeated binary array (UTF8);
  }
  optional group annotation {
optional binary ancestralAllele (UTF8);
optional int32 alleleCount;
optional int32 readDepth;
optional int32 forwardReadDepth;
optional int32 reverseReadDepth;
optional int32 referenceReadDepth;
optional int32 referenceForwardReadDepth;
optional int32 referenceReverseReadDepth;
optional float alleleFrequency;
optional binary cigar (UTF8);
optional boolean dbSnp;
optional boolean hapMap2;
optional boolean hapMap3;
optional boolean validated;
optional boolean thousandGenomes;
optional boolean somatic;
required group transcriptEffects (LIST) {
  repeated group array {
optional binary alternateAllele (UTF8);
required group effects (LIST) {
  repeated binary array (UTF8);
}
optional binary geneName (UTF8);
optional binary geneId (UTF8);
optional binary featureType (UTF8);
optional binary featureId (UTF8);
optional binary biotype (UTF8);
optional int32 rank;
optional int32 total;
optional binary genomicHgvs (UTF8);
optional binary transcriptHgvs (UTF8);
optional binary proteinHgvs (UTF8);
optional int32 cdnaPosition;
optional int32 cdnaLength;
optional int32 cdsPosition;
optional int32 cdsLength;
optional int32 proteinPosition;
optional int32 proteinLength;
optional int32 distance;
required group messages (LIST) {
  repeated binary array (ENUM);
}
  }
}
required group attributes (MAP) {
  repeated group map (MAP_KEY_VALUE) {
required binary key (UTF8);
required binary value (UTF8);
  }
}
  }
}
{noformat}

Dataset code path:
{noformat}
$ parquet-tools schema 

[jira] [Updated] (SPARK-25587) NPE in Dataset when reading from Parquet as Product

2018-10-01 Thread Michael Heuer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Heuer updated SPARK-25587:
--
Description: 
In an attempt to replicate the following issue in ADAM, a library downstream of 
Spark
https://github.com/bigdatagenomics/adam/issues/2058

the following Spark Shell script throws NPE when attempting to read from 
Parquet.
{code:scala}
sc.setLogLevel("INFO")

import spark.implicits._

case class Inner(
  names: Seq[String] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => names
  }
  def canEqual(that: Any): Boolean = that match {
case Inner => true
case _ => false
  }
}

case class Outer(
  inners: Seq[Inner] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => inners
  }
  def canEqual(that: Any): Boolean = that match {
case Outer => true
case _ => false
  }
}

val inner = Inner(Seq("name0", "name1"))
val outer = Outer(Seq(inner))
val dataset = sc.parallelize(Seq(outer)).toDS()

val path = "outers.parquet"
dataset.toDF().write.format("parquet").save(path)

val roundtrip = spark.read.parquet(path).as[Outer]
roundtrip.first
{code}

Stack trace
{noformat}
$ spark-shell -i failure.scala
...
2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet 
WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
"name" : "inners",
"type" : {
  "type" : "array",
  "elementType" : {
"type" : "struct",
"fields" : [ {
  "name" : "names",
  "type" : {
"type" : "array",
"elementType" : "string",
"containsNull" : true
  },
  "nullable" : true,
  "metadata" : { }
} ]
  },
  "containsNull" : true
},
"nullable" : true,
"metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  optional group inners (LIST) {
repeated group list {
  optional group element {
optional group names (LIST) {
  repeated group list {
optional binary element (UTF8);
  }
}
  }
}
  }
}

16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to 
file. allocated memory: 0
2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem 
columnStore to file. allocated memory: 26
...
2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: 
struct>>>
2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:99)
  at 
org.apache.spark.sql.catalyst.expressions.objects.InvokeLike$$anonfun$3.apply(objects.scala:98)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 

[jira] [Created] (SPARK-25587) NPE in Dataset when reading from Parquet as Product

2018-10-01 Thread Michael Heuer (JIRA)
Michael Heuer created SPARK-25587:
-

 Summary: NPE in Dataset when reading from Parquet as Product
 Key: SPARK-25587
 URL: https://issues.apache.org/jira/browse/SPARK-25587
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Spark version 2.4.0 (RC2).

{noformat}
$ spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
Branch
Compiled by user  on 2018-09-27T14:50:10Z
Revision
Url
Type --help for more information.
{noformat}

Reporter: Michael Heuer


In an attempt to replicate the following issue in ADAM, a library downstream of 
Spark
https://github.com/bigdatagenomics/adam/issues/2058

the following Spark Shell script throws NPE when attempting to read from 
Parquet.
{code:scala}
sc.setLogLevel("INFO")

import spark.implicits._

case class Inner(
  names: Seq[String] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => names
  }
  def canEqual(that: Any): Boolean = that match {
case Inner => true
case _ => false
  }
}

case class Outer(
  inners: Seq[Inner] = Seq()
) extends Product {
  def productArity: Int = 1
  def productElement(i: Int): Any = i match {
case 0 => inners
  }
  def canEqual(that: Any): Boolean = that match {
case Outer => true
case _ => false
  }
}

val inner = Inner(Seq("name0", "name1"))
val outer = Outer(Seq(inner))
val dataset = sc.parallelize(Seq(outer)).toDS()

val path = "outers.parquet"
dataset.toDF().write.format("parquet").save(path)

val roundtrip = spark.read.parquet(path).as[Outer]
roundtrip.first
{code}

Stack trace
{noformat}
$ spark-shell -i failure.scala
...
2018-10-01 16:57:48 INFO  ParquetWriteSupport:54 - Initialized Parquet 
WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
"name" : "inners",
"type" : {
  "type" : "array",
  "elementType" : {
"type" : "struct",
"fields" : [ {
  "name" : "names",
  "type" : {
"type" : "array",
"elementType" : "string",
"containsNull" : true
  },
  "nullable" : true,
  "metadata" : { }
} ]
  },
  "containsNull" : true
},
"nullable" : true,
"metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  optional group inners (LIST) {
repeated group list {
  optional group element {
optional group names (LIST) {
  repeated group list {
optional binary element (UTF8);
  }
}
  }
}
  }
}

16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem columnStore to 
file. allocated memory: 0
2018-10-01 16:57:48 INFO  InternalParquetRecordWriter:165 - Flushing mem 
columnStore to file. allocated memory: 26
...
2018-10-01 16:57:49 INFO  FileSourceStrategy:54 - Output Data Schema: 
struct>>>
2018-10-01 16:57:49 INFO  FileSourceScanExec:54 - Pushed Filters:
java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.If.doGenCode(conditionalExpressions.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
  at 
org.apache.spark.sql.catalyst.expressions.objects.MapObjects.doGenCode(objects.scala:796)
  at 

[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-10-01 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634679#comment-16634679
 ] 

John Bauer commented on SPARK-21542:


The above is not as minimal as I would have liked... It is based on the unit 
tests associated with the fix referenced for DefaultParamsReadable, 
DefaultParamsWritable which I thought would test the desired behavior, i.e. 
save and load a pipeline after calling fit().  Unfortunately this was not 
tested, so I flailed at the code for a while until I got something that worked. 
 A lot of stuff left over from setting up unit tests could probably be removed. 
 But at least this seems to work..

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-10-01 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634677#comment-16634677
 ] 

John Bauer commented on SPARK-21542:



{code:python}
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 27 10:25:10 2018

@author: JohnBauer
"""
from pyspark.sql import DataFrame, Row
from pyspark.sql import SQLContext 
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.functions import udf

from pyspark import keyword_only, SparkContext
from pyspark.ml import Estimator, Model, Pipeline, PipelineModel, Transformer, 
UnaryTransformer

from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
#from pyspark.ml.util import *
from pyspark.ml.param import Param, Params, TypeConverters
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.sql.types import FloatType, DoubleType 
#, LongType, ArrayType, StringType, StructType, StructField

spark = SparkSession\
.builder\
.appName("Minimal_1")\
.getOrCreate()

data_path = "/Users/JohnBauer/spark/data/mllib"
# Load training data
data = 
spark.read.format("libsvm").load("{}/sample_libsvm_data.txt".format(data_path))
train, test = data.randomSplit([0.7, 0.3])
train.show(5)


class MockDataset(DataFrame):

def __init__(self):
self.index = 0


class HasFake(Params):

def __init__(self):
super(HasFake, self).__init__()
self.fake = Param(self, "fake", "fake param")

def getFake(self):
return self.getOrDefault(self.fake)


class MockTransformer(Transformer, DefaultParamsReadable, 
DefaultParamsWritable, HasFake):

def __init__(self):
super(MockTransformer, self).__init__()
self.dataset_index = None

def _transform(self, dataset):
self.dataset_index = dataset.index
dataset.index += 1
return dataset


class MockUnaryTransformer(UnaryTransformer,
   DefaultParamsReadable,
   DefaultParamsWritable,):
   #HasInputCol):

shift = Param(Params._dummy(), "shift", "The amount by which to shift " +
  "data in a DataFrame",
  typeConverter=TypeConverters.toFloat)

inputCol = Param(Params._dummy(), "inputCol", "column of DataFrame to 
transform",
  typeConverter=TypeConverters.toString)

outputCol = Param(Params._dummy(), "outputCol", "name of transformed column 
" +
  "to be added to DataFrame",
  typeConverter=TypeConverters.toString)

@keyword_only
def __init__(self, shiftVal=1, inputCol="features", outputCol="outputCol"): 
#, inputCol='features'):
super(MockUnaryTransformer, self).__init__()
self._setDefault(shift=1)
self._set(shift=shiftVal)
self._setDefault(inputCol=inputCol)
self._setDefault(outputCol=outputCol)

def getShift(self):
return self.getOrDefault(self.shift)

def setShift(self, shift):
self._set(shift=shift)

def createTransformFunc(self):
shiftVal = self.getShift()
return lambda x: x + shiftVal

def outputDataType(self):
return DoubleType()

def validateInputType(self, inputType):
if inputType != DoubleType():
print("input type: {}".format(inputType))
return
#raise TypeError("Bad input type: {}. ".format(inputType) +
#"Requires Double.")

def _transform(self, dataset):
shift = self.getOrDefault("shift")

def f(v):
return v + shift

t = FloatType()
out_col = self.getOutputCol()
in_col = dataset[self.getInputCol()]
return dataset.withColumn(out_col, udf(f, t)(in_col))

class MockEstimator(Estimator, DefaultParamsReadable, DefaultParamsWritable, 
HasFake):

def __init__(self):
super(MockEstimator, self).__init__()
self.dataset_index = None

def _fit(self, dataset):
self.dataset_index = dataset.index
model = MockModel()
self._copyValues(model)
return model

class MockModel(MockTransformer, Model, HasFake):
pass


#class PipelineTests(PySparkTestCase):
class PipelineTests(object):

def test_pipeline(self, data=None):
#dataset = MockDataset()
dataset = MockDataset() if data is None else data
estimator0 = MockEstimator()
transformer1 = MockTransformer()
estimator2 = MockEstimator()
transformer3 = MockTransformer()
transformer4 = MockUnaryTransformer(inputCol="label",
outputCol="shifted_label")
pipeline = Pipeline(stages=[estimator0, transformer1, estimator2,
transformer3, transformer4])
pipeline_model = pipeline.fit(dataset,
 

[jira] [Commented] (SPARK-15689) Data source API v2

2018-10-01 Thread Geoff Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634650#comment-16634650
 ] 

Geoff Freeman commented on SPARK-15689:
---

I'm having trouble figuring out how to expose a custom hash function from my 
DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't 
see how I can convert the physical.Partitioning that's required by 
outputPartitioning() into a HashPartitioning. I'd like to figure out how to 
pass this to DataSourceScanExec so that we can avoid shuffles. [~rxin] 
[~cloud_fan]

Thanks!

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15689) Data source API v2

2018-10-01 Thread Geoff Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634650#comment-16634650
 ] 

Geoff Freeman edited comment on SPARK-15689 at 10/1/18 9:24 PM:


I'm having trouble figuring out how to expose a custom hash function from my 
DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't 
see how I can convert the physical.Partitioning that's required by 
outputPartitioning() into a HashPartitioning. I'd like to figure out how to 
pass this to DataSourceScanExec so that we can avoid shuffles. Are there any 
examples of where it's been implemented that I could look at? [~rxin] 
[~cloud_fan]

Thanks!


was (Author: gfreeman):
I'm having trouble figuring out how to expose a custom hash function from my 
DataSourceV2. I'm trying to implement SupportsReportPartitioning, but I don't 
see how I can convert the physical.Partitioning that's required by 
outputPartitioning() into a HashPartitioning. I'd like to figure out how to 
pass this to DataSourceScanExec so that we can avoid shuffles. [~rxin] 
[~cloud_fan]

Thanks!

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: SPIP, releasenotes
> Fix For: 2.3.0
>
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25586:


Assignee: (was: Apache Spark)

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Assigned] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25586:


Assignee: Apache Spark

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Assignee: Apache Spark
>Priority: Major
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> 

[jira] [Commented] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634643#comment-16634643
 ] 

Apache Spark commented on SPARK-25586:
--

User 'ankuriitg' has created a pull request for this issue:
https://github.com/apache/spark/pull/22604

> toString method of GeneralizedLinearRegressionTrainingSummary runs in 
> infinite loop throwing StackOverflowError
> ---
>
> Key: SPARK-25586
> URL: https://issues.apache.org/jira/browse/SPARK-25586
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>
> After the change in SPARK-25118, which enables spark-shell to run with 
> default log level, test_glr_summary started failing with StackOverflow error.
> Cause: ClosureCleaner calls logDebug on various objects and when it is called 
> for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
> runs into infinite loop and fails with the below exception.
> {code}
> ==
> ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
> line 1809, in test_glr_summary
> self.assertTrue(isinstance(s.aic, float))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
>  line 1781, in aic
> return self._call_java("aic")
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py",
>  line 55, in _call_java
> return _java2py(sc, m(*java_args))
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o31639.aic.
> : java.lang.StackOverflowError
>   at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>   at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
>   at java.io.File.exists(File.java:819)
>   at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
>   at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
>   at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
>   at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
>   at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
>   at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
>   at java.lang.Class.getResourceAsStream(Class.java:2223)
>   at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
>   at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> 

[jira] [Created] (SPARK-25586) toString method of GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing StackOverflowError

2018-10-01 Thread Ankur Gupta (JIRA)
Ankur Gupta created SPARK-25586:
---

 Summary: toString method of 
GeneralizedLinearRegressionTrainingSummary runs in infinite loop throwing 
StackOverflowError
 Key: SPARK-25586
 URL: https://issues.apache.org/jira/browse/SPARK-25586
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Affects Versions: 2.3.0
Reporter: Ankur Gupta


After the change in SPARK-25118, which enables spark-shell to run with default 
log level, test_glr_summary started failing with StackOverflow error.

Cause: ClosureCleaner calls logDebug on various objects and when it is called 
for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which 
runs into infinite loop and fails with the below exception.

{code}
==
ERROR: test_glr_summary (pyspark.ml.tests.TrainingSummaryTest)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests.py", 
line 1809, in test_glr_summary
self.assertTrue(isinstance(s.aic, float))
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py",
 line 1781, in aic
return self._call_java("aic")
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/wrapper.py", 
line 55, in _call_java
return _java2py(sc, m(*java_args))
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 328, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o31639.aic.
: java.lang.StackOverflowError
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:819)
at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1245)
at sun.misc.URLClassPath$FileLoader.findResource(URLClassPath.java:1212)
at sun.misc.URLClassPath.findResource(URLClassPath.java:188)
at java.net.URLClassLoader$2.run(URLClassLoader.java:569)
at java.net.URLClassLoader$2.run(URLClassLoader.java:567)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:566)
at java.lang.ClassLoader.getResource(ClassLoader.java:1093)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:232)
at java.lang.Class.getResourceAsStream(Class.java:2223)
at 
org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:43)
at 
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:87)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:269)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:864)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:863)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:863)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:613)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.DeserializeToObjectExec.doExecute(objects.scala:89)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 

[jira] [Created] (SPARK-25585) Allow users to specify scale of result in Decimal arithmetic

2018-10-01 Thread Benito Kestelman (JIRA)
Benito Kestelman created SPARK-25585:


 Summary: Allow users to specify scale of result in Decimal 
arithmetic
 Key: SPARK-25585
 URL: https://issues.apache.org/jira/browse/SPARK-25585
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Benito Kestelman


The current behavior of Spark Decimal during arithmetic makes it difficult for 
users to achieve their desired level of precision. Numeric literals are 
automatically cast to unlimited precision during arithmetic, but the final 
result is cast down depending on the precision and scale of the operands, 
according to MS SQL rules (discussed in other JIRA's). This final cast can 
cause substantial loss of scale.

For example:
{noformat}
scala> spark.sql("select 1.3/3.41").show(false)

++

|(CAST(1.3 AS DECIMAL(3,2)) / CAST(3.41 AS DECIMAL(3,2)))|

++

|0.381232    |

++{noformat}
To get higher scale in the result, a user must cast the operands to higher 
scale:
{noformat}
scala> spark.sql("select cast(1.3 as decimal(5,4))/cast(3.41 as 
decimal(5,4))").show(false)

++

|(CAST(1.3 AS DECIMAL(5,4)) / CAST(3.41 AS DECIMAL(5,4)))|

++

|0.3812316716    |

++

scala> spark.sql("select cast(1.3 as decimal(10,9))/cast(3.41 as 
decimal(10,9))").show(false)

+--+

|(CAST(1.3 AS DECIMAL(10,9)) / CAST(3.41 AS DECIMAL(10,9)))|

+--+

|0.38123167155425219941    |

+--+{noformat}
But if the user casts too high, the result's scale decreases. 
{noformat}
scala> spark.sql("select cast(1.3 as decimal(25,24))/cast(3.41 as 
decimal(25,24))").show(false)

++

|(CAST(1.3 AS DECIMAL(25,24)) / CAST(3.41 AS DECIMAL(25,24)))|

++

|0.3812316715543 |

++{noformat}
Thus, the user has no way of knowing how to cast to get the scale he wants. 
This problem is even harder to deal with when using variables instead of 
literals. 

The user should be able to explicitly set the desired scale of the result. 
MySQL offers this capability in the form of a system variable called 
"div_precision_increment."

>From the MySQL docs: "In division performed with 
>[{{/}}|https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_divide],
> the scale of the result when using two exact-value operands is the scale of 
>the first operand plus the value of the 
>[{{div_precision_increment}}|https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_div_precision_increment]
> system variable (which is 4 by default). For example, the result of the 
>expression {{5.05 / 0.014}} has a scale of six decimal places 
>({{360.714286}})."
{noformat}
mysql> SELECT 1/7;
++
| 1/7    |
++
| 0.1429 |
++
mysql> SET div_precision_increment = 12;
mysql> SELECT 1/7;
++
| 1/7    |
++
| 0.142857142857 |
++{noformat}
This gives the user full control of the result's scale after arithmetic and 
obviates the need for casting all over the place.

Since Spark 2.3, we already have DecimalType.MINIMUM_ADJUSTED_SCALE, which is 
similar to div_precision_increment. It just needs to be made modifiable by the 
user. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634534#comment-16634534
 ] 

Apache Spark commented on SPARK-25538:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22602

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25538:


Assignee: (was: Apache Spark)

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634532#comment-16634532
 ] 

Apache Spark commented on SPARK-25538:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22602

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25538:


Assignee: Apache Spark

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25578:
--
Issue Type: Bug  (was: Improvement)

OK. I'm proposing it for 2.4.0 mostly because it _might_ be a bug fix, given 
[~sadhen]'s comment. As far as I can tell it's not breaking 2.4, but, according 
to the Scala release the change was to support 2.4. Maybe that overstated 
things, but might be less confusing to get this one in if we have another RC.

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Minor
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634527#comment-16634527
 ] 

Apache Spark commented on SPARK-25062:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22603

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634524#comment-16634524
 ] 

Apache Spark commented on SPARK-25062:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22603

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25062:


Assignee: (was: Apache Spark)

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25062) Clean up BlockLocations in FileStatus objects

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25062:


Assignee: Apache Spark

> Clean up BlockLocations in FileStatus objects
> -
>
> Key: SPARK-25062
> URL: https://issues.apache.org/jira/browse/SPARK-25062
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: andrzej.stankev...@gmail.com
>Assignee: Apache Spark
>Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks 
> to list files depending on number of files. here 
> [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object 
> per file. Before sending  FileStatus to a driver Spark converts FileStatus to 
> SerializableFileStatus. On driver side Spark turns SerializableFileStatus 
> back to FileStatus and it also creates BlockLocation object for each 
> FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of 
> information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has 
> HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. 
> Because of this FileStatus objects takes more memory than if it would created 
> on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache 
> takes 2.2x more memory if files were listed on driver side than if they were 
> listed on executor side.
>  
> In our case _SharedInMemoryCache_ takes 125M when we do scan on executors  
> and 270M when we do it on a driver. It is for about 19000 files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634492#comment-16634492
 ] 

Dongjoon Hyun commented on SPARK-25578:
---

[~srowen]. Could you update the `Type` and `Priority` because we don't allow 
`Minor` and `Improvement` JIRA lands on `branch-2.4`?

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Minor
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634430#comment-16634430
 ] 

Dongjoon Hyun commented on SPARK-25538:
---

[~mgaido]'s PR, https://github.com/apache/spark/pull/22602, fixes the decimal 
issue and looks reasonable.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0

2018-10-01 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-25315.
--
Resolution: Not A Bug

> setting "auto.offset.reset" to "earliest" has no effect in Structured 
> Streaming with Spark 2.3.1 and Kafka 1.0
> --
>
> Key: SPARK-25315
> URL: https://issues.apache.org/jira/browse/SPARK-25315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: Standalone; running in IDEA
>Reporter: Zhenhao Li
>Priority: Major
>
> The following code won't read from the beginning of the topic
> ```
> {code:java}
> val kafkaOptions = Map[String, String](
>  "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS,
>  "subscribe" -> TOPIC,
>  "group.id" -> GROUP_ID,
>  "auto.offset.reset" -> "earliest"
> )
> val myStream = sparkSession
> .readStream
> .format("kafka")
> .options(kafkaOptions)
> .load()
> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   myStream
> .writeStream
> .format("console")
> .start()
> .awaitTermination()
> {code}
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25315) setting "auto.offset.reset" to "earliest" has no effect in Structured Streaming with Spark 2.3.1 and Kafka 1.0

2018-10-01 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634423#comment-16634423
 ] 

Shixiong Zhu commented on SPARK-25315:
--

Kafka’s own configurations should be set with "kafka." prefix. "group.id" and 
"auto.offset.reset" will be ignored.

In addition, after you add "kafka." prefix, you will see some error messages as 
"group.id" or "auto.offset.reset" is not supported. They are documented here: 
http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations

> setting "auto.offset.reset" to "earliest" has no effect in Structured 
> Streaming with Spark 2.3.1 and Kafka 1.0
> --
>
> Key: SPARK-25315
> URL: https://issues.apache.org/jira/browse/SPARK-25315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: Standalone; running in IDEA
>Reporter: Zhenhao Li
>Priority: Major
>
> The following code won't read from the beginning of the topic
> ```
> {code:java}
> val kafkaOptions = Map[String, String](
>  "kafka.bootstrap.servers" -> KAFKA_BOOTSTRAP_SERVERS,
>  "subscribe" -> TOPIC,
>  "group.id" -> GROUP_ID,
>  "auto.offset.reset" -> "earliest"
> )
> val myStream = sparkSession
> .readStream
> .format("kafka")
> .options(kafkaOptions)
> .load()
> .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
>   myStream
> .writeStream
> .format("console")
> .start()
> .awaitTermination()
> {code}
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634389#comment-16634389
 ] 

Marco Gaido commented on SPARK-25582:
-

Sorry, I linked the wrong JIRA in the PR. Please disregard it. I'll unlink asap 
as I am not in front of my laptop now. Sorry for the trouble.

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at 

[jira] [Updated] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-25583:
---
Priority: Minor  (was: Trivial)

> Add newly added History server related configurations in the documentation
> --
>
> Key: SPARK-25583
> URL: https://issues.apache.org/jira/browse/SPARK-25583
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: shahid
>Priority: Minor
>
> Some of the history server related configurations are missing in the 
> documentation.
> Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25576) Fix lint failure in 2.2

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25576:


Assignee: Apache Spark

> Fix lint failure in 2.2
> ---
>
> Key: SPARK-25576
> URL: https://issues.apache.org/jira/browse/SPARK-25576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> See the errors:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/913/console



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki edited comment on SPARK-25538 at 10/1/18 5:21 PM:
---

This test case does not print {{63}} using master branch.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}


was (Author: kiszk):
This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25576) Fix lint failure in 2.2

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25576:


Assignee: (was: Apache Spark)

> Fix lint failure in 2.2
> ---
>
> Key: SPARK-25576
> URL: https://issues.apache.org/jira/browse/SPARK-25576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.2
>Reporter: Xiao Li
>Priority: Major
>
> See the errors:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-branch-2.2-lint/913/console



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25322) ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25322.
---
Resolution: Done

> ML, Graph 2.4 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-25322
> URL: https://issues.apache.org/jira/browse/SPARK-25322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25319) Spark MLlib, GraphX 2.4 QA umbrella

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25319.
---
Resolution: Done
  Assignee: Weichen Xu  (was: Joseph K. Bradley)

> Spark MLlib, GraphX 2.4 QA umbrella
> ---
>
> Key: SPARK-25319
> URL: https://issues.apache.org/jira/browse/SPARK-25319
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 2.4.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25325.
---
Resolution: Won't Do

> ML, Graph 2.4 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25323:
--
Priority: Major  (was: Critical)

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25323) ML 2.4 QA: API: Python API coverage

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25323.
---
Resolution: Won't Do

> ML 2.4 QA: API: Python API coverage
> ---
>
> Key: SPARK-25323
> URL: https://issues.apache.org/jira/browse/SPARK-25323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25325) ML, Graph 2.4 QA: Update user guide for new features & APIs

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25325:
--
Priority: Major  (was: Critical)

> ML, Graph 2.4 QA: Update user guide for new features & APIs
> ---
>
> Key: SPARK-25325
> URL: https://issues.apache.org/jira/browse/SPARK-25325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> For MLlib:
> * This task does not include major reorganizations for the programming guide.
> * We should now begin copying algorithm details from the spark.mllib guide to 
> spark.ml as needed, rather than just linking back to the corresponding 
> algorithms in the spark.mllib user guide.
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25326.
---
Resolution: Won't Do

> ML, Graph 2.4 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-25326
> URL: https://issues.apache.org/jira/browse/SPARK-25326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25326) ML, Graph 2.4 QA: Programming guide update and migration guide

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25326:
--
Priority: Major  (was: Critical)

> ML, Graph 2.4 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-25326
> URL: https://issues.apache.org/jira/browse/SPARK-25326
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides. Updates will include:
>  * Add migration guide subsection.
>  ** Use the results of the QA audit JIRAs.
>  * Check phrasing, especially in main sections (for outdated items such as 
> "In this release, ...")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25584:
--
Component/s: ML

> Document libsvm data source in doc site
> ---
>
> Key: SPARK-25584
> URL: https://issues.apache.org/jira/browse/SPARK-25584
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25347) Document image data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25347:
--
Component/s: ML

> Document image data source in doc site
> --
>
> Key: SPARK-25347
> URL: https://issues.apache.org/jira/browse/SPARK-25347
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for image data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25524) Spark datasource for image/libsvm user guide

2018-10-01 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634347#comment-16634347
 ] 

Xiangrui Meng commented on SPARK-25524:
---

Marked as duplicate and create SPARK-25584 for libsvm separately.

> Spark datasource for image/libsvm user guide
> 
>
> Key: SPARK-25524
> URL: https://issues.apache.org/jira/browse/SPARK-25524
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add Spark datasource for image/libsvm user guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25584:
--
Description: Currently, we only have Scala/Java API docs for libsvm data 
source. It would be nice to have some documentation in the doc site. So 
Python/R users can also discover this feature.  (was: Currently, we only have 
Scala/Java API docs for image data source. It would be nice to have some 
documentation in the doc site. So Python/R users can also discover this 
feature.)

> Document libsvm data source in doc site
> ---
>
> Key: SPARK-25584
> URL: https://issues.apache.org/jira/browse/SPARK-25584
> Project: Spark
>  Issue Type: Story
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Currently, we only have Scala/Java API docs for libsvm data source. It would 
> be nice to have some documentation in the doc site. So Python/R users can 
> also discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25584) Document libsvm data source in doc site

2018-10-01 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-25584:
-

 Summary: Document libsvm data source in doc site
 Key: SPARK-25584
 URL: https://issues.apache.org/jira/browse/SPARK-25584
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


Currently, we only have Scala/Java API docs for image data source. It would be 
nice to have some documentation in the doc site. So Python/R users can also 
discover this feature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25524) Spark datasource for image/libsvm user guide

2018-10-01 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25524.
---
Resolution: Duplicate

> Spark datasource for image/libsvm user guide
> 
>
> Key: SPARK-25524
> URL: https://issues.apache.org/jira/browse/SPARK-25524
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add Spark datasource for image/libsvm user guide.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-10-01 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634335#comment-16634335
 ] 

Xiangrui Meng commented on SPARK-25378:
---

I don't think I'm the right person to decide here because I know little about 
how UTF8String is being used in Spark SQL. As a user, I do want to use 
spark-tensorflow-connector w/ the upcoming Spark 2.4 release. 

I already made the change in TF connector to use ObjectType: 
https://github.com/tensorflow/ecosystem/pull/100. But they need to wait for TF 
1.12 release, which might come out in the second half of Oct. If we won't make 
the final 2.4 release by then, maybe we don't have to fix 2.4 branch. The risk 
is other data sources might have similar usage that will break, which we don't 
really know.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2018-10-01 Thread Karthik Manamcheri (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634332#comment-16634332
 ] 

Karthik Manamcheri commented on SPARK-25561:


I am working on a patch for this and will post a PR as soon as possible.

> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634290#comment-16634290
 ] 

Apache Spark commented on SPARK-25582:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22602

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> 

[jira] [Assigned] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25582:


Assignee: (was: Apache Spark)

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:390)
>     

[jira] [Assigned] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25582:


Assignee: Apache Spark

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Assignee: Apache Spark
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> 

[jira] [Commented] (SPARK-25582) Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java library

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634284#comment-16634284
 ] 

Apache Spark commented on SPARK-25582:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22602

> Error in Spark logs when using the org.apache.spark:spark-sql_2.11:2.2.0 Java 
> library
> -
>
> Key: SPARK-25582
> URL: https://issues.apache.org/jira/browse/SPARK-25582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> I have noticed an error that appears in the Spark logs when using the Spark 
> SQL library in a Java 8 project.
> When I run the code below with the attached files as input, I can see the 
> ERROR below in the application logs.
> I am using the *org.apache.spark:spark-sql_2.11:2.2.0* library in my Java 
> project
> Note that the same logic implemented with the Python API (pyspark) doesn't 
> produce any Exception like this.
> *Code*
> {code:java}
> SparkConf conf = new SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", "colE", 
> "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> {code}
> *Error message*
> {code:java}
> 18/10/01 09:58:07 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:393)
>     at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:385)
>     at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1286)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:385)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1285)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:825)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:411)
>     at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:212)
>     at 
> 

[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-10-01 Thread Andrew Crosby (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634268#comment-16634268
 ] 

Andrew Crosby commented on SPARK-25544:
---

SPARK-23537 contains what might be another occurrence of this issue. The model 
in that case contains only binary features, so standardization shouldn't really 
be used. However, turning standardization off causes the model to take 4992 
iterations to converge as opposed to 37 iterations when standardization is 
turned on.

> Slow/failed convergence in Spark ML models due to internal predictor scaling
> 
>
> Key: SPARK-25544
> URL: https://issues.apache.org/jira/browse/SPARK-25544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
>Reporter: Andrew Crosby
>Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a 
> large number of iterations to converge, or fail to converge altogether, when 
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by 
> default. In SPARK-8522 the option to disable standardization was added. This 
> is implemented internally by changing the effective strength of 
> regularization rather than disabling the feature scaling. Mathematically, 
> both changing the effective regularizaiton strength, and disabling feature 
> scaling should give the same solution, but they can have very different 
> convergence properties.
> The normal justication given for scaling features is that it ensures that all 
> covariances are O(1) and should improve numerical convergence, but this 
> argument does not account for the regularization term. This doesn't cause any 
> issues if standardization is set to true, since all features will have an 
> O(1) regularization strength. But it does cause issues when standardization 
> is set to false, since the effecive regularization strength of feature i is 
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. 
> This means that predictors with small standard deviations (which can occur 
> legitimately e.g. via one hot encoding) will have very large effective 
> regularization strengths and consequently lead to very large gradients and 
> thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very 
> simple test case which builds a linear regression model with a categorical 
> feature, a numerical feature and their interaction. When fed the specified 
> training data, this model will fail to converge before it hits the maximum 
> iteration limit. In this case, it is the interaction between category "2" and 
> the numeric feature that leads to a feature with a small standard deviation.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>  
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category") 
> .setOutputCol("categoryIndex")
> val encoder = new 
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
> "interaction")).setOutputCol("features")
> val model = new 
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
> assembler, model))
> val pipelineModel  = pipeline.fit(df)
> val numIterations = 
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
>  *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when 
> standardization is set to false rather than using an effective regularization 
> strength. This can be hacked into LinearRegression.scala by simply replacing 
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization)) 
> featuresSummarizer.variance.toArray.map(math.sqrt) else 
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to 
> convergence after just 4 iterations instead of hitting the max iterations 
> limit!
> *Impact:*
> I 

[jira] [Commented] (SPARK-23537) Logistic Regression without standardization

2018-10-01 Thread Andrew Crosby (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634263#comment-16634263
 ] 

Andrew Crosby commented on SPARK-23537:
---

The different results for standardization=True vs standardization=False are to 
be expected. The resason for this difference is that the two settings lead to 
different effective regularization strengths.  With standardization=True, the 
regularization is applied to the scaled model coefficients. Whereas, with 
standardization=False, the regularization is applied to the unscaled model 
coefficients.

As it's implemented in Spark, the features actually get scaled regardless of 
whether standardization is set to true or false, but when standardization=False 
the strength of the regularization in the scaled space is altered to account 
for this. See the comment at 
[https://github.com/apache/spark/blob/a802c69b130b69a35b372ffe1b01289577f6fafb/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L685.]

 

As an aside, your results show a very slow rate of convergence when 
standardization is set to false. I believe this to be an issue caused by the 
continued application of feature scaling when standardization=False which can 
lead to very large gradients from the regularization terms in the solver. I've 
recently raised SPARK-25544 to cover this issue.

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 

[jira] [Assigned] (SPARK-18364) Expose metrics for YarnShuffleService

2018-10-01 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-18364:
-

Assignee: Marek Simunek

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Assignee: Marek Simunek
>Priority: Major
> Fix For: 2.5.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18364) Expose metrics for YarnShuffleService

2018-10-01 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-18364.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

> Expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Priority: Major
> Fix For: 2.5.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-16405, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634204#comment-16634204
 ] 

Apache Spark commented on SPARK-25583:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22601

> Add newly added History server related configurations in the documentation
> --
>
> Key: SPARK-25583
> URL: https://issues.apache.org/jira/browse/SPARK-25583
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: shahid
>Priority: Trivial
>
> Some of the history server related configurations are missing in the 
> documentation.
> Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634202#comment-16634202
 ] 

Apache Spark commented on SPARK-25583:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22601

> Add newly added History server related configurations in the documentation
> --
>
> Key: SPARK-25583
> URL: https://issues.apache.org/jira/browse/SPARK-25583
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: shahid
>Priority: Trivial
>
> Some of the history server related configurations are missing in the 
> documentation.
> Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25583:


Assignee: (was: Apache Spark)

> Add newly added History server related configurations in the documentation
> --
>
> Key: SPARK-25583
> URL: https://issues.apache.org/jira/browse/SPARK-25583
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: shahid
>Priority: Trivial
>
> Some of the history server related configurations are missing in the 
> documentation.
> Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25583:


Assignee: Apache Spark

> Add newly added History server related configurations in the documentation
> --
>
> Key: SPARK-25583
> URL: https://issues.apache.org/jira/browse/SPARK-25583
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2
>Reporter: shahid
>Assignee: Apache Spark
>Priority: Trivial
>
> Some of the history server related configurations are missing in the 
> documentation.
> Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25583) Add newly added History server related configurations in the documentation

2018-10-01 Thread shahid (JIRA)
shahid created SPARK-25583:
--

 Summary: Add newly added History server related configurations in 
the documentation
 Key: SPARK-25583
 URL: https://issues.apache.org/jira/browse/SPARK-25583
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 2.3.2, 2.3.1
Reporter: shahid


Some of the history server related configurations are missing in the 
documentation.

Like, 'spark.history.store.maxDiskUsage', 'spark.ui.liveUpdate.period' etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-25538:
--
Priority: Blocker  (was: Major)

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25578:


Assignee: (was: Apache Spark)

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Minor
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634111#comment-16634111
 ] 

Apache Spark commented on SPARK-25578:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22600

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Minor
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25578) Update to Scala 2.12.7

2018-10-01 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25578:


Assignee: Apache Spark

> Update to Scala 2.12.7
> --
>
> Key: SPARK-25578
> URL: https://issues.apache.org/jira/browse/SPARK-25578
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> We should use Scala 2.12.7 over 2.12.6 now, to pick up this fix. We ought to 
> be able to back out a workaround in Spark if so.
> [https://github.com/scala/scala/releases/tag/v2.12.7]
> [https://github.com/scala/scala/pull/7156] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634106#comment-16634106
 ] 

Marco Gaido commented on SPARK-25538:
-

I was able to reproduce also using limit instead of sort:
{code}
scala> df.limit(80).distinct.count
res83: Long = 72

scala> df.distinct.count
res84: Long = 64

scala> df.limit(20).distinct.count
res88: Long = 20

scala> df.limit(20).distinct.collect.distinct.length
res89: Int = 17
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25510:
-

Assignee: Yuming Wang

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25510) Create a new trait SqlBasedBenchmark

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25510:
--
Summary:  Create a new trait SqlBasedBenchmark  (was:  Create new trait 
replace BenchmarkWithCodegen)

>  Create a new trait SqlBasedBenchmark
> -
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25510) Create new trait replace BenchmarkWithCodegen

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25510.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22484
[https://github.com/apache/spark/pull/22484]

>  Create new trait replace BenchmarkWithCodegen
> --
>
> Key: SPARK-25510
> URL: https://issues.apache.org/jira/browse/SPARK-25510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25476.
---
   Resolution: Fixed
Fix Version/s: 2.5.0

Issue resolved by pull request 22484
[https://github.com/apache/spark/pull/22484]

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25476) Refactor AggregateBenchmark to use main method

2018-10-01 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25476:
-

Assignee: Yuming Wang

> Refactor AggregateBenchmark to use main method
> --
>
> Key: SPARK-25476
> URL: https://issues.apache.org/jira/browse/SPARK-25476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.5.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >