date:20200610

[jira] [Updated] (SPARK-28264) Revisiting Python / pandas UDF

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28264:
-
Labels: release-notes  (was: )

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
> -See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]-
>  New proposal: 
> https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

2020-06-10 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132930#comment-17132930
 ] 

zhengruifeng commented on SPARK-31948:
--

[~viirya] [~srowen]   I will test whether there is perfromance gain, maybe it 
will benefit aggregate on a partially aggregated dataset.

In ML impls, like above RobustScaler, there is nothing to combin in the map 
side, since each key is distinct on a partition. Same cases exist in impls like 
KMeans, BiKMeans, GMM, etc

> expose mapSideCombine in aggByKey/reduceByKey/foldByKey
> ---
>
> Key: SPARK-31948
> URL: https://issues.apache.org/jira/browse/SPARK-31948
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1. {{aggregateByKey}}, {{reduceByKey}} and  {{foldByKey}} will always perform 
> {{mapSideCombine}};
> However, this can be skiped sometime, specially in ML (RobustScaler):
> {code:java}
> vectors.mapPartitions { iter =>
>   if (iter.hasNext) {
> val summaries = Array.fill(numFeatures)(
>   new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
> relativeError))
> while (iter.hasNext) {
>   val vec = iter.next
>   vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = 
> summaries(i).insert(v) }
> }
> Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
>   } else Iterator.empty
> }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
>  
> This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at 
> all, similar places exist in {{KMeans}}, {{GMM}}, etc;
> To my knowledge, we do not need {{mapSideCombine}} if the reduction factor 
> isn't high;
>  
> 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the 
> {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided.
>  
> SPARK-772:
> {quote}
> Map side combine in group by key case does not reduce the amount of data 
> shuffled. Instead, it forces a lot more objects to go into old gen, and leads 
> to worse GC.
> {quote}
>  
> So what about:
> 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so 
> that user can disable unnecessary mapSideCombine
> 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in  
> {{treeAggregate}} and {{treeReduce}}
> 3. disabling the unnecessary {{mapSideCombine}} in ML;
> Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] 
> [~viirya]  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31966.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28798
[https://github.com/apache/spark/pull/28798]

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31964:
--
Issue Type: Improvement  (was: Bug)

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 3.1.0
>
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31965:
-

Assignee: Hyukjin Kwon

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31965.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28795
[https://github.com/apache/spark/pull/28795]

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132900#comment-17132900
 ] 

Apache Spark commented on SPARK-31966:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28798

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132898#comment-17132898
 ] 

Apache Spark commented on SPARK-31966:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28798

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31966:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31966:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31966:
-
Fix Version/s: (was: 3.0.0)

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31966:
-
Description: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/

{code}
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_prediction
self._eventually(condition)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
% (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition 
returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 
0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

--
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
{code}


  was:
Looks this test is flaky

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

{code}
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_prediction
self._eventually(condition)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
% (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition 
returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 
0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

--
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
{code}

This looks happening after increasing the parallelism in Jenkins to speed up. I 
am able to reproduce this manually when the resource usage is heavy with manual 
decrease of timeout.


> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-31966
> URL: https://issues.apache.org/jira/browse/SPARK-31966
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
>

[jira] [Created] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2020-06-10 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-31966:


 Summary: Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
 Key: SPARK-31966
 URL: https://issues.apache.org/jira/browse/SPARK-31966
 Project: Spark
  Issue Type: Test
  Components: MLlib, PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon
 Fix For: 3.0.0


Looks this test is flaky

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

{code}
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_prediction
self._eventually(condition)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
% (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition 
returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 
0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

--
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
{code}

This looks happening after increasing the parallelism in Jenkins to speed up. I 
am able to reproduce this manually when the resource usage is heavy with manual 
decrease of timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31915:
-
Fix Version/s: 3.0.1

> Resolve the grouping column properly per the case sensitivity in grouped and 
> cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132890#comment-17132890
 ] 

Apache Spark commented on SPARK-31926:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28797

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition

2020-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31942:
---

Assignee: Manu Zhang

> Revert SPARK-31864 Adjust AQE skew join trigger condition
> -
>
> Key: SPARK-31942
> URL: https://issues.apache.org/jira/browse/SPARK-31942
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> As discussed in 
> [https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert 
> SPARK-31864 for optimizing skew join to work for extremely clustered keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition

2020-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31942.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28770
[https://github.com/apache/spark/pull/28770]

> Revert SPARK-31864 Adjust AQE skew join trigger condition
> -
>
> Key: SPARK-31942
> URL: https://issues.apache.org/jira/browse/SPARK-31942
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.1.0
>
>
> As discussed in 
> [https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert 
> SPARK-31864 for optimizing skew join to work for extremely clustered keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31939) Fix Parsing day of year when year field pattern is missing

2020-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31939:
---

Assignee: Kent Yao

> Fix Parsing day of year when year field pattern is missing
> --
>
> Key: SPARK-31939
> URL: https://issues.apache.org/jira/browse/SPARK-31939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> If a datetime pattern contains no year field, the day of year field should 
> not be ignored if exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31939) Fix Parsing day of year when year field pattern is missing

2020-06-10 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31939.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28766
[https://github.com/apache/spark/pull/28766]

> Fix Parsing day of year when year field pattern is missing
> --
>
> Key: SPARK-31939
> URL: https://issues.apache.org/jira/browse/SPARK-31939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> If a datetime pattern contains no year field, the day of year field should 
> not be ignored if exists



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132882#comment-17132882
 ] 

Apache Spark commented on SPARK-31935:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28796

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 2.4.7
>Reporter: Gengliang Wang
>Priority: Major
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132878#comment-17132878
 ] 

Apache Spark commented on SPARK-31965:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28795

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132877#comment-17132877
 ] 

Apache Spark commented on SPARK-31965:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28795

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31965:


Assignee: (was: Apache Spark)

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31965:


Assignee: Apache Spark

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31965:
-
Summary: Run the tests in registerJavaFunction and registerJavaUDAF only 
when test classes are complied  (was: Run the tests in registerJavaFunction and 
registerJavaUDAF if test classes are not complied)

> Run the tests in registerJavaFunction and registerJavaUDAF only when test 
> classes are complied
> --
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF if test classes are not complied

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31965:
-
Summary: Run the tests in registerJavaFunction and registerJavaUDAF if test 
classes are not complied  (was: Skip tests in registerJavaFunction and 
registerJavaUDAF if test classes are not complied)

> Run the tests in registerJavaFunction and registerJavaUDAF if test classes 
> are not complied
> ---
>
> Key: SPARK-31965
> URL: https://issues.apache.org/jira/browse/SPARK-31965
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> If you do a plain package with sbt:
> {code}
> ./build/sbt -DskipTests -Phive-thriftserver clean package
> ./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
> UserDefinedFunction"
> {code}
> The doctests in registerJavaFunction and registerJavaUDAF fail because it 
> requires some classes from the test compilation.
> We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31965) Skip tests in registerJavaFunction and registerJavaUDAF if test classes are not complied

2020-06-10 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-31965:


 Summary: Skip tests in registerJavaFunction and registerJavaUDAF 
if test classes are not complied
 Key: SPARK-31965
 URL: https://issues.apache.org/jira/browse/SPARK-31965
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 2.4.6, 3.0.0
Reporter: Hyukjin Kwon


If you do a plain package with sbt:

{code}
./build/sbt -DskipTests -Phive-thriftserver clean package
./run-tests --python-executable=python3 --testname="pyspark.sql.udf 
UserDefinedFunction"
{code}

The doctests in registerJavaFunction and registerJavaUDAF fail because it 
requires some classes from the test compilation.

We should skip them conditionally



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31955:

Priority: Major  (was: Blocker)

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31956) Do not fail if there is no ambiguous self join

2020-06-10 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132845#comment-17132845
 ] 

Dongjoon Hyun commented on SPARK-31956:
---

Thanks, [~kabhwan].

> Do not fail if there is no ambiguous self join
> --
>
> Key: SPARK-31956
> URL: https://issues.apache.org/jira/browse/SPARK-31956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31956) Do not fail if there is no ambiguous self join

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31956:
--
Fix Version/s: (was: 3.0.0)
   3.0.1

> Do not fail if there is no ambiguous self join
> --
>
> Key: SPARK-31956
> URL: https://issues.apache.org/jira/browse/SPARK-31956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31926:


Assignee: (was: Apache Spark)

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31926:


Assignee: Apache Spark

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31964:


Assignee: Bryan Cutler

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31964.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28793
[https://github.com/apache/spark/pull/28793]

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 3.1.0
>
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread Lin Gang Deng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132812#comment-17132812
 ] 

Lin Gang Deng commented on SPARK-31955:
---

[~younggyuchun] No results.

 
{code:java}
// code placeholder
[test@192.168.0.1 denglg]$ cat -A test3.sql 
select * from info_dev.beeline_test where name='bbb'[test@192.168.0.1 denglg]$ 
{code}
{code:java}
// code placeholder
0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test where 
name='bbb'
Closing: 0: 
jdbc:hive2://spark-sql.hadoop.srv:1/;principal=xxx?mapreduce.job.queuename=xxx
[test@192.168.0.1 denglg]$
{code}
 

 

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132807#comment-17132807
 ] 

Dongjoon Hyun commented on SPARK-31926:
---

This is reverted because Maven test fails consistently.

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard

2020-06-10 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132806#comment-17132806
 ] 

Takeshi Yamamuro commented on SPARK-26905:
--

okay, I'll check if we can do so.

> Revisit reserved/non-reserved keywords based on the ANSI SQL standard
> -
>
> Key: SPARK-26905
> URL: https://issues.apache.org/jira/browse/SPARK-26905
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
> Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, 
> spark-nonReserved.txt, spark-strictNonReserved.txt, 
> sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, 
> sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, 
> sql2016-14-nonreserved.txt, sql2016-14-reserved.txt
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31926:
-

Assignee: (was: Kent Yao)

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31926:
--
Fix Version/s: (was: 3.0.1)
   (was: 3.1.0)

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-31926:
---

> Fix concurrency issue for ThriftCLIService to getPortNumber
> ---
>
> Key: SPARK-31926
> URL: https://issues.apache.org/jira/browse/SPARK-31926
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> When 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext 
> called,
> it starts ThriftCLIService in the background with a new Thread, at the same 
> time we call ThriftCLIService.getPortNumber, we might not get the bound port 
> if it's configured with 0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31964:


Assignee: (was: Apache Spark)

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132803#comment-17132803
 ] 

Apache Spark commented on SPARK-31964:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28793

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31964:


Assignee: Apache Spark

> Avoid Pandas import for CategoricalDtype with Arrow conversion
> --
>
> Key: SPARK-31964
> URL: https://issues.apache.org/jira/browse/SPARK-31964
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
> currently pyspark checks 2 places to import. It would be better check the 
> type as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion

2020-06-10 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-31964:


 Summary: Avoid Pandas import for CategoricalDtype with Arrow 
conversion
 Key: SPARK-31964
 URL: https://issues.apache.org/jira/browse/SPARK-31964
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Bryan Cutler


The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and 
currently pyspark checks 2 places to import. It would be better check the type 
as a string and avoid any imports.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31935) Hadoop file system config should be effective in data source options

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132788#comment-17132788
 ] 

Apache Spark commented on SPARK-31935:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28791

> Hadoop file system config should be effective in data source options 
> -
>
> Key: SPARK-31935
> URL: https://issues.apache.org/jira/browse/SPARK-31935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0, 2.4.7
>Reporter: Gengliang Wang
>Priority: Major
>
> Data source options should be propagated into the hadoop configuration of 
> method `checkAndGlobPathIfNecessary`
> From org.apache.hadoop.fs.FileSystem.java:
> {code:java}
>   public static FileSystem get(URI uri, Configuration conf) throws 
> IOException {
> String scheme = uri.getScheme();
> String authority = uri.getAuthority();
> if (scheme == null && authority == null) { // use default FS
>   return get(conf);
> }
> if (scheme != null && authority == null) { // no authority
>   URI defaultUri = getDefaultUri(conf);
>   if (scheme.equals(defaultUri.getScheme())// if scheme matches 
> default
>   && defaultUri.getAuthority() != null) {  // & default has authority
> return get(defaultUri, conf);  // return default
>   }
> }
> 
> String disableCacheName = String.format("fs.%s.impl.disable.cache", 
> scheme);
> if (conf.getBoolean(disableCacheName, false)) {
>   return createFileSystem(uri, conf);
> }
> return CACHE.get(uri, conf);
>   }
> {code}
> With this, we can specify URI schema and authority related configurations for 
> scanning file systems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25186) Stabilize Data Source V2 API

2020-06-10 Thread Jeremy Bloom (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132780#comment-17132780
 ] 

Jeremy Bloom edited comment on SPARK-25186 at 6/10/20, 11:18 PM:
-

I'm not sure whether I'm in the right place, but I want to discuss my 
experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm 
not a Spark contributor, but I am using Spark in several applications. If this 
is not the appropriate place for my comments, please direct me to the right 
place.

My current project is part of an open source program called COIN-OR (I can 
provide links if you like). Among other things, we are building a data exchange 
standard for mathematical optimization solvers, called MOSDEX. I am using Spark 
SQL as the backbone  of this application. I work mostly in Java, some in 
Python, and not at all in Scala.

I came to this point because a critical link in my application is to load data 
from JSON into a Spark Dataset. For various reasons, the standard JSON reader 
in Spark is not adequate for my purposes. I want to use a Java Stream to load 
the data into a Spark dataframe. It turns out that this is remarkably hard to 
do.

Spark Session provides a createDataFrame method whose argument is a Java List. 
However, because I am dealing with potentially vary large data sets, I do not 
want to create a list in memory. My first attempt was to wrap the stream in a 
"virtual list" that fulfilled most of the list "contract" but did not actually 
create a list object. The hang up with this approach is that it cannot count 
the elements in the stream unless and until the stream has been read, so the 
size of the list is undeterminable until that happens. Using the virtual list 
in the createDataFrame method some times succeeds but fails at other times when 
some component of Spark needs to call the size method on the list.

I raised this question on Stackoverflow, and someone suggested using DataSource 
V2. There isn't much out there, but I did find a couple of examples, and based 
on them, I was able to build a reader and a writer for Java Streams. However, 
there are a number of issues that arose that forced me to use some "unsavory" 
tricks in my code that I am uncomfortable with (they "smell" in Fowler's 
sense). Here are some of the issues I found.

1) The DataSource V2 still relies on the calling sequences of DataFrameReader 
and DataFrameWriter from V1. The reader is instantiated by the Spark Session 
read method while the writer is instantiated by the Dataset write method. The 
separation is confusing. Why not use a static factory in Dataset to instantiate 
the read method?

2) The actual calls to the reader and writer are contained in the 
DataframeReader and DataframeWriter format methods. Using "format" for the name 
of this method is misleading, but once you realize what it is doing, it makes 
even less sense. The parameter of the format method is a string, which isn't 
documented, but which turns out to be a class name. Then format creates a class 
using reflection. Use of reflection in this circumstance appears to be 
unnecessary and restrictive. What not make the parameter an instance of an 
appropriate interface, such as ReadSupport or WriteSupport?

3) Using reflection means that the reader or writer will be created by a 
parameterless constructor. Thus, it is not possible to pass any information to 
the reader and writer. In my case, I want to pass an instance of a Java Stream 
and a Spark schema to the reader and a Spark dataset to the writer. The only 
way I have found to do that is to create static fields in the reader and writer 
classes and set them before calling the format method to create the reader and 
writer objects. That's what I mean about "unsavory" programming.

4) There are other, less critical issues, but these are still annoying. Links 
to external files in the option method are undocumented strings, but they look 
like paths. Since the Path class has been part of Java just about forever, why 
not use it? And when reading from or writing to a file, why not use the Java 
InputStream and OutputStream classes, which also have been around forever?

I realize that programming style in Scala might be different than in Java, but 
I can't believe that the design of DataSource V2 requires these contortions 
even in Scala. Given the power that is available in Spark, it seems to me that 
basic input and output should be much better structured than it is.

Thanks for your attention on this matter.
chrome


was (Author: jeremybloom):
I'm not sure whether I'm in the right place, but I want to discuss my 
experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm 
not a Spark contributor, but I am using Spark in several applications. If this 
is not the appropriate place for my comments, please direct me to the right 
place.

My current project is

[jira] [Commented] (SPARK-25186) Stabilize Data Source V2 API

2020-06-10 Thread Jeremy Bloom (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132780#comment-17132780
 ] 

Jeremy Bloom commented on SPARK-25186:
--

I'm not sure whether I'm in the right place, but I want to discuss my 
experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm 
not a Spark contributor, but I am using Spark in several applications. If this 
is not the appropriate place for my comments, please direct me to the right 
place.

My current project is part of an open source program called COIN-OR (I can 
provide links if you like). Among other things, we are building a data exchange 
standard for mathematical optimization solvers, called MOSDEX. I am using Spark 
SQL as the backbone  of this application. I work mostly in Java, some in 
Python, and not at all in Scala.

I came to this point because a critical link in my application is to load data 
from JSON into a Spark Dataset. For various reasons, the standard JSON reader 
in Spark is not adequate for my purposes. I want to use a Java Stream to load 
the data into a Spark dataframe. It turns out that this is remarkably hard to 
do.

Spark Session provides a createDataFrame method whose argument is a Java List. 
However, because I am dealing with potentially vary large data sets, I do not 
want to create a list in memory. My first attempt was to wrap the stream in a 
"virtual list" that fulfilled most of the list "contract" but did not actually 
create a list object. The hang up with this approach is that it cannot count 
the elements in the stream unless and until the stream has been read, so the 
size of the list is undeterminable until that happens. Using the virtual list 
in the createDataFrame method some times succeeds but fails at other times when 
some component of Spark needs to call the size method on the list.

I raised this question on Stackoverflow, and someone suggested using DataSource 
V2. There isn't much out there, but I did find a couple of examples, and based 
on them, I was able to build a reader and a writer for Java Streams. However, 
there are a number of issues that arose that forced me to use some "unsavory" 
tricks in my code that I am uncomfortable with (they "smell" in Fowler's 
sense). Here are some of the issues I found.

1) The DataSource V2 still relies on the calling sequences of DataFrameReader 
and DataFrameWriter from V1. The reader is instantiated by the Spark Session 
read method while the writer is instantiated by the Dataset write method. The 
separation is confusing. Why not use a static factory in Dataset to instantiate 
the read method?

2) The actual calls to the reader and writer are contained in the 
DataframeReader and DataframeWriter format methods. Using "format" for the name 
of this method is misleading, but once you realize what it is doing, it makes 
even less sense. The parameter of the format method is a string, which isn't 
documented, but which turns out to be a class name. Then format creates a class 
using reflection. Use of reflection in this circumstance appears to be 
unnecessary and restrictive. What not make the parameter an instance of an 
appropriate interface, such as ReadSupport or WriteSupport?

3) Using reflection means that the reader or writer will be created by a 
parameterless constructor. Thus, it is not possible to pass any information to 
the reader and writer. In my case, I want to pass an instance of a Java Stream 
and a Spark schema to the reader and a Spark dataset to the writer. The only 
way I have found to do that is to create static fields in the reader and writer 
classes and set them before calling the format method to create the reader and 
writer objects. That's what I mean about "unsavory" programming.

4) There are other, less critical issues, but these are still annoying. Links 
to external files in the option method are undocumented strings, but they look 
like paths. Since the Path class has been part of Java just about forever, why 
not use it? And when reading from or writing to a file, why not use the Java 
InputStream and OutputStream classes, which also have been around forever?

I realize that programming style in Scala might be different than in Java, but 
I can't believe that the design of DataSource V2 requires these contortions 
even in Scala. Given the power that is available in Spark, it seems to me that 
basic input and output should be much better structured than it is.

Thanks for your attention on this matter.
chrome

> Stabilize Data Source V2 API 
> -
>
> Key: SPARK-25186
> URL: https://issues.apache.org/jira/browse/SPARK-25186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira

[jira] [Resolved] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs

2020-06-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-31915.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28777
[https://github.com/apache/spark/pull/28777]

> Resolve the grouping column properly per the case sensitivity in grouped and 
> cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs

2020-06-10 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-31915:


Assignee: Hyukjin Kwon

> Resolve the grouping column properly per the case sensitivity in grouped and 
> cogrouped pandas UDFs
> --
>
> Key: SPARK-31915
> URL: https://issues.apache.org/jira/browse/SPARK-31915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([[1, 1]], ["column", "Score"])
> @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
> def my_pandas_udf(pdf):
> return pdf.assign(Score=0.5)
> df.groupby('COLUMN').apply(my_pandas_udf).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could 
> be: COLUMN, COLUMN.;
> {code}
> {code}
> df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df2 = spark.createDataFrame([(1, 1)], ("column", "value"))
> df1.groupby("COLUMN").cogroup(
> df2.groupby("COLUMN")
> ).applyInPandas(lambda r, l: r + l, df1.schema).show()
> {code}
> {code}
> pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input 
> columns: [COLUMN, COLUMN, value, value];;
> 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, 
> column#13L, value#14L), [column#22L, value#23L]
> :- Project [COLUMN#9L, column#9L, value#10L]
> :  +- LogicalRDD [column#9L, value#10L], false
> +- Project [COLUMN#13L, column#13L, value#14L]
>+- LogicalRDD [column#13L, value#14L], false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132756#comment-17132756
 ] 

Apache Spark commented on SPARK-28199:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/28790

> Move Trigger implementations to Triggers.scala and avoid exposing these to 
> the end users
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> In fact, ProcessingTime is deprecated because we want to only expose 
> Trigger.xxx instead of exposing actual implementations, and I think we miss 
> some other implementations as well.
> This issue targets to move all Trigger implementations to Triggers.scala, and 
> hide from end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-06-10 Thread Everett Rush (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132743#comment-17132743
 ] 

Everett Rush edited comment on SPARK-27249 at 6/10/20, 9:52 PM:


I checked out the code and the design. Thanks for the great effort, but this 
doesn't quite meet the need. The transformer contract can still be met with a 
transform function that returns a Dataframe. 

I would like something more like this.

 

{code:java}
@DeveloperApi
abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]]
  extends Transformer with HasOutputCol with Logging {

  def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T]

  protected def outputDataType: DataType


  protected def transformFunc: Iterator[Row] => Iterator[Row]


  override def transformSchema(schema: StructType): StructType = {
if (schema.fieldNames.contains($(outputCol))) {
  throw new IllegalArgumentException(s"Output column ${$(outputCol)} 
already exists.")
}
val outputFields = schema.fields :+
  StructField($(outputCol), outputDataType, nullable = false)
StructType(outputFields)
  }

  def transform(dataset: DataFrame, targetSchema: StructType): DataFrame = {
val targetEncoder = RowEncoder(targetSchema)
dataset.mapPartitions(transformFunc)(targetEncoder)
  }

  override def transform(dataset: Dataset[_]): DataFrame = {
val dataframe = dataset.toDF()
val targetSchema = transformSchema(dataframe.schema, logging = true)
transform(dataframe, targetSchema)
  }

  override def copy(extra: ParamMap): T = defaultCopy(extra)
}
{code}



was (Author: enrush):
I checked out the code and the design. Thanks for the great effort, but this 
doesn't quite meet the need. The transformer contract can still be met with a 
transform function that returns a Dataframe. 

I would like something more like this.

 

{{@DeveloperApi}}
{{abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]]}}
{{ extends Transformer with HasOutputCol with Logging {}}


{{ def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T]}}

{{ /** Returns the data type of the output column. */}}
{{ protected def outputDataType: DataType}}

{{ protected def *transformFunc*: Iterator[Row] => Iterator[Row]}}


{{ override def *transformSchema*(schema: StructType): StructType = {}}
{{ }}}

{{ def *transform*(dataset: DataFrame, targetSchema: StructType): DataFrame = 
{}}

{{ val targetEncoder = RowEncoder(targetSchema)}}
{{ dataset.mapPartitions(transformFunc)(targetEncoder)}}
{{ }}}

{{ override def *transform*(dataset: Dataset[_]): DataFrame = {}}
{{ val dataframe = dataset.toDF()}}
{{ val targetSchema = transformSchema(dataframe.schema, logging = true)}}
{{ transform(dataframe, targetSchema)}}
{{ }}}

{{ override def copy(extra: ParamMap): T = defaultCopy(extra)}}
{{}}}

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-06-10 Thread Everett Rush (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132743#comment-17132743
 ] 

Everett Rush commented on SPARK-27249:
--

I checked out the code and the design. Thanks for the great effort, but this 
doesn't quite meet the need. The transformer contract can still be met with a 
transform function that returns a Dataframe. 

I would like something more like this.

 

{{@DeveloperApi}}
{{abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]]}}
{{ extends Transformer with HasOutputCol with Logging {}}


{{ def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T]}}

{{ /** Returns the data type of the output column. */}}
{{ protected def outputDataType: DataType}}

{{ protected def *transformFunc*: Iterator[Row] => Iterator[Row]}}


{{ override def *transformSchema*(schema: StructType): StructType = {}}
{{ }}}

{{ def *transform*(dataset: DataFrame, targetSchema: StructType): DataFrame = 
{}}

{{ val targetEncoder = RowEncoder(targetSchema)}}
{{ dataset.mapPartitions(transformFunc)(targetEncoder)}}
{{ }}}

{{ override def *transform*(dataset: Dataset[_]): DataFrame = {}}
{{ val dataframe = dataset.toDF()}}
{{ val targetSchema = transformSchema(dataframe.schema, logging = true)}}
{{ transform(dataframe, targetSchema)}}
{{ }}}

{{ override def copy(extra: ParamMap): T = defaultCopy(extra)}}
{{}}}

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132742#comment-17132742
 ] 

Jungtaek Lim commented on SPARK-31961:
--

Great. Then just construct the config you want use like "kafka." + 
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG. If you feel bad to add "kafka." as 
prefix then please submit a PR to propose the fix.

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31963:
-

Assignee: William Hyun

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31963.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28789
[https://github.com/apache/spark/pull/28789]

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Gunjan Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132739#comment-17132739
 ] 

Gunjan Kumar commented on SPARK-31961:
--

no,kafka users dont write down config like that there are special classes 
ProducerConfig and ConsumerConfig.

[https://kafka.apache.org/22/javadoc/index.html?org/apache/kafka/clients/consumer/ConsumerConfig.html]

 

 

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132736#comment-17132736
 ] 

Jungtaek Lim edited comment on SPARK-31961 at 6/10/20, 9:37 PM:


My point is that given they're Kafka config why you're requiring them from 
Spark.

If Kafka provides such constant then you can just use that. If not, how could 
you tolerate the fact Kafka users also have to write down config key as string? 
In either way, Kafka should provide it under such rationalization, not Spark.


was (Author: kabhwan):
My point is that given they're Kafka config why you're requiring they from 
Spark.

If Kafka provides such constant then you can just use that. If not, how could 
you tolerate the fact Kafka users also have to write down config key as string? 
In either way, Kafka should provide it under such rationalization, not Spark.

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132736#comment-17132736
 ] 

Jungtaek Lim commented on SPARK-31961:
--

My point is that given they're Kafka config why you're requiring they from 
Spark.

If Kafka provides such constant then you can just use that. If not, how could 
you tolerate the fact Kafka users also have to write down config key as string? 
In either way, Kafka should provide it under such rationalization, not Spark.

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Gunjan Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132725#comment-17132725
 ] 

Gunjan Kumar commented on SPARK-31961:
--

spark will not be tied to kafka version as these property keys never changes in 
any kafka version.

In the current scenarion, suppose i want to set poll timeout but i dont know 
the property name for same then i have to go to kafka doc search for that 
property and paste it in Option().

This is not how a programmer should code :)

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31963:
--
Component/s: SQL

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132714#comment-17132714
 ] 

Apache Spark commented on SPARK-31963:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28789

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31963:


Assignee: Apache Spark

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31963:


Assignee: (was: Apache Spark)

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132713#comment-17132713
 ] 

Apache Spark commented on SPARK-31963:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28789

> Support both pandas 0.23 and 1.0
> 
>
> Key: SPARK-31963
> URL: https://issues.apache.org/jira/browse/SPARK-31963
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>
> This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.
> {code}
> $ pip install pandas==0.23.2
> $ python -c "import pandas.CategoricalDtype"
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pandas.CategoricalDtype'
> $ python -c "from pandas.api.types import CategoricalDtype"
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31963) Support both pandas 0.23 and 1.0

2020-06-10 Thread William Hyun (Jira)

William Hyun created SPARK-31963:


 Summary: Support both pandas 0.23 and 1.0
 Key: SPARK-31963
 URL: https://issues.apache.org/jira/browse/SPARK-31963
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.0
Reporter: William Hyun


This issue aims to fix a bug by supporting both pandas 0.23 and 1.0.

{code}
$ pip install pandas==0.23.2

$ python -c "import pandas.CategoricalDtype"
Traceback (most recent call last):
  File "", line 1, in 
ModuleNotFoundError: No module named 'pandas.CategoricalDtype'

$ python -c "from pandas.api.types import CategoricalDtype"
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-10 Thread Christopher Highman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been created at or later than the specified 
time in order to be consumed for purposes of reading the files in general or 
for purposes of structured streaming.

 

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
> opportunity to

[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-10 Thread Christopher Highman (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132708#comment-17132708
 ] 

Christopher Highman commented on SPARK-31962:
-

I have a good, working version of this for our internal purposes and would love 
to contribute it back.

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
> opportunity to consider options here.
> Having the ability to provide an option specifying a timestamp by which to 
> begin globbing files would result in quite a bit of less complexity needed on 
> a consumer who leverages the ability to stream from a folder path but does 
> not have an interest in reading what could be thousands of files that are not 
> relevant.
> One example to could be "createdFileTime" accepting a UTC datetime like below.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("createdFileTime", "2020-05-01 00:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have a created been created at or later than 
> the specified time in order to be consumed for purposes of reading the files 
> in general or for purposes of structured streaming.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-10 Thread Christopher Highman (Jira)

Christopher Highman created SPARK-31962:
---

 Summary: Provide option to load files after a specified date when 
reading from a folder path
 Key: SPARK-31962
 URL: https://issues.apache.org/jira/browse/SPARK-31962
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: Christopher Highman


When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas/")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas/")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-10 Thread Christopher Highman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas/")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("createdFileTime", "2020-05-01 00:00:00")
 .format("csv")
 .load("/mnt/Deltas/")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
>

[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132707#comment-17132707
 ] 

Jungtaek Lim commented on SPARK-31961:
--

Don't you think the constant should be available in Kafka instead of Spark? 
Spark should try to avoid being tie to the specific version of Kafka in 
codebase. If Kafka provides it, what you need is just add "kafka." as prefix.

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31961:
-
Fix Version/s: (was: 2.4.7)

> Add a class in spark with all Kafka configuration key available as string
> -
>
> Key: SPARK-31961
> URL: https://issues.apache.org/jira/browse/SPARK-31961
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Gunjan Kumar
>Priority: Minor
>  Labels: kafka, sql, structured-streaming
>
> Add a class in spark with all Kafka configuration key available as string.
> see the highligted class which i want.
> eg:-
> Current code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
> .option("subscribe", "topic1")
> Expected code:-
> val df_cluster1 = spark
> .read
> .format("kafka")
> .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
> .option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string

2020-06-10 Thread Gunjan Kumar (Jira)

Gunjan Kumar created SPARK-31961:


 Summary: Add a class in spark with all Kafka configuration key 
available as string
 Key: SPARK-31961
 URL: https://issues.apache.org/jira/browse/SPARK-31961
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Affects Versions: 2.4.6
Reporter: Gunjan Kumar
 Fix For: 2.4.7


Add a class in spark with all Kafka configuration key available as string.

see the highligted class which i want.

eg:-

Current code:-

val df_cluster1 = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers","cluster1_host:cluster1_port)
.option("subscribe", "topic1")

Expected code:-

val df_cluster1 = spark
.read
.format("kafka")
.option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port)
.option(*KafkaConstantClass*.subscribe, "topic1")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time

2020-06-10 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132694#comment-17132694
 ] 

YoungGyu Chun commented on SPARK-7101:
--

I will try to get this done but there are a ton of work ;)

> Spark SQL should support java.sql.Time
> --
>
> Key: SPARK-7101
> URL: https://issues.apache.org/jira/browse/SPARK-7101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: All
>Reporter: Peter Hagelund
>Priority: Major
>
> Several RDBMSes support the TIME data type; for more exact mapping between 
> those and Spark SQL, support for java.sql.Time with an associated 
> DataType.TimeType would be helpful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31956) Do not fail if there is no ambiguous self join

2020-06-10 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132687#comment-17132687
 ] 

Jungtaek Lim commented on SPARK-31956:
--

Probably the correct fix version would be 3.0.1, as 3.0.0 RC3 vote is already 
passed.

> Do not fail if there is no ambiguous self join
> --
>
> Key: SPARK-31956
> URL: https://issues.apache.org/jira/browse/SPARK-31956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31956) Do not fail if there is no ambiguous self join

2020-06-10 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31956.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28783
[https://github.com/apache/spark/pull/28783]

> Do not fail if there is no ambiguous self join
> --
>
> Key: SPARK-31956
> URL: https://issues.apache.org/jira/browse/SPARK-31956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31960:


Assignee: Apache Spark

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31960:


Assignee: (was: Apache Spark)

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31960:


Assignee: Apache Spark

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132682#comment-17132682
 ] 

Apache Spark commented on SPARK-31960:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/28788

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31960:

Summary: Only populate Hadoop classpath for no-hadoop build  (was: Only 
populate Hadoop classpath when hadoop is provided)

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31960) Only populate Hadoop classpath when hadoop is provided

2020-06-10 Thread DB Tsai (Jira)

DB Tsai created SPARK-31960:
---

 Summary: Only populate Hadoop classpath when hadoop is provided
 Key: SPARK-31960
 URL: https://issues.apache.org/jira/browse/SPARK-31960
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132593#comment-17132593
 ] 

Apache Spark commented on SPARK-31959:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28787

> Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian 
> to Julian"
> 
>
> Key: SPARK-31959
> URL: https://issues.apache.org/jira/browse/SPARK-31959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31959:


Assignee: Apache Spark

> Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian 
> to Julian"
> 
>
> Key: SPARK-31959
> URL: https://issues.apache.org/jira/browse/SPARK-31959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31959:


Assignee: (was: Apache Spark)

> Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian 
> to Julian"
> 
>
> Key: SPARK-31959
> URL: https://issues.apache.org/jira/browse/SPARK-31959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31925) Summary.totalIterations greater than maxIters

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31925:


Assignee: (was: Apache Spark)

> Summary.totalIterations greater than maxIters
> -
>
> Key: SPARK-31925
> URL: https://issues.apache.org/jira/browse/SPARK-31925
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, 
> the model.summary.totalIterations will return *n+1* if the training procedure 
> does not drop out.
>  
> friendly ping [~huaxingao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31925) Summary.totalIterations greater than maxIters

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31925:


Assignee: Apache Spark

> Summary.totalIterations greater than maxIters
> -
>
> Key: SPARK-31925
> URL: https://issues.apache.org/jira/browse/SPARK-31925
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, 
> the model.summary.totalIterations will return *n+1* if the training procedure 
> does not drop out.
>  
> friendly ping [~huaxingao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31925) Summary.totalIterations greater than maxIters

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132582#comment-17132582
 ] 

Apache Spark commented on SPARK-31925:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28786

> Summary.totalIterations greater than maxIters
> -
>
> Key: SPARK-31925
> URL: https://issues.apache.org/jira/browse/SPARK-31925
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, 
> the model.summary.totalIterations will return *n+1* if the training procedure 
> does not drop out.
>  
> friendly ping [~huaxingao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

2020-06-10 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132569#comment-17132569
 ] 

L. C. Hsieh commented on SPARK-31948:
-

I guess I have similar question. When the reduction factor isn't high, map-side 
combining will be any disadvantage to run? I think it still can reduce shuffle 
data, although it might not be too much.

> expose mapSideCombine in aggByKey/reduceByKey/foldByKey
> ---
>
> Key: SPARK-31948
> URL: https://issues.apache.org/jira/browse/SPARK-31948
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> 1. {{aggregateByKey}}, {{reduceByKey}} and  {{foldByKey}} will always perform 
> {{mapSideCombine}};
> However, this can be skiped sometime, specially in ML (RobustScaler):
> {code:java}
> vectors.mapPartitions { iter =>
>   if (iter.hasNext) {
> val summaries = Array.fill(numFeatures)(
>   new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, 
> relativeError))
> while (iter.hasNext) {
>   val vec = iter.next
>   vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = 
> summaries(i).insert(v) }
> }
> Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
>   } else Iterator.empty
> }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
>  
> This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at 
> all, similar places exist in {{KMeans}}, {{GMM}}, etc;
> To my knowledge, we do not need {{mapSideCombine}} if the reduction factor 
> isn't high;
>  
> 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the 
> {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided.
>  
> SPARK-772:
> {quote}
> Map side combine in group by key case does not reduce the amount of data 
> shuffled. Instead, it forces a lot more objects to go into old gen, and leads 
> to worse GC.
> {quote}
>  
> So what about:
> 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so 
> that user can disable unnecessary mapSideCombine
> 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in  
> {{treeAggregate}} and {{treeReduce}}
> 3. disabling the unnecessary {{mapSideCombine}} in ML;
> Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] 
> [~viirya]  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread YoungGyu Chun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YoungGyu Chun updated SPARK-31955:
--
Comment: was deleted

(was: Thanks, I will work on this)

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"

2020-06-10 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31959:
--

 Summary: Test failure "RebaseDateTimeSuite.optimization of micros 
rebasing - Gregorian to Julian"
 Key: SPARK-31959
 URL: https://issues.apache.org/jira/browse/SPARK-31959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Maxim Gekk


See 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132566#comment-17132566
 ] 

YoungGyu Chun commented on SPARK-31955:
---

Try this rather than add a space between select statement and where statement:
select * from info_dev.beeline_test where name='bbb'

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132549#comment-17132549
 ] 

YoungGyu Chun commented on SPARK-31955:
---

Thanks, I will work on this

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31890) Decommissioning Test Does not wait long enough for executors to launch

2020-06-10 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31890.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in one of the decom PRs

> Decommissioning Test Does not wait long enough for executors to launch
> --
>
> Key: SPARK-31890
> URL: https://issues.apache.org/jira/browse/SPARK-31890
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> I've seen some decommissioning test failures where the initial executors are 
> not launching in time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread Lin Gang Deng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130795#comment-17130795
 ] 

Lin Gang Deng edited comment on SPARK-31955 at 6/10/20, 3:29 PM:
-

{code:java}
// code placeholder
0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test;
+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.402 seconds)
{code}
Then sql file as bellows,
{code:java}
// code placeholder
[test@192.168.0.1 denglg]$ cat -A test2.sql 
select * from info_dev.beeline_test$
where name='bbb';[test@192.168.0.1 denglg]$ 
{code}
Result as bellows,
{code:java}
// code placeholder
0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test
0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.594 seconds)
{code}
As you can see，it got wrong result.


was (Author: denglg):
{code:java}
// code placeholder
0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test;
+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.402 seconds)
{code}
Then sql as bellows,
{code:java}
// code placeholder
[test@192.168.0.1 denglg]$ cat -A test2.sql 
select * from info_dev.beeline_test$
where name='bbb';[test@192.168.0.1 denglg]$ 
{code}
Result as bellows,
{code:java}
// code placeholder
0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test
0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.594 seconds)
{code}
As you can see，it got wrong result.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread Lin Gang Deng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130795#comment-17130795
 ] 

Lin Gang Deng commented on SPARK-31955:
---

{code:java}
// code placeholder
0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test;
+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.402 seconds)
{code}
Then sql as bellows,
{code:java}
// code placeholder
[test@192.168.0.1 denglg]$ cat -A test2.sql 
select * from info_dev.beeline_test$
where name='bbb';[test@192.168.0.1 denglg]$ 
{code}
Result as bellows,
{code:java}
// code placeholder
0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test
0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+
| id  | name  |
+-+---+--+
| 3   | ccc   |
| 2   | bbb   |
| 1   | aaa   |
+-+---+--+
3 rows selected (1.594 seconds)
{code}
As you can see，it got wrong result.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31958) normalize special floating numbers in subquery

2020-06-10 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130789#comment-17130789
 ] 

Apache Spark commented on SPARK-31958:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28785

> normalize special floating numbers in subquery
> --
>
> Key: SPARK-31958
> URL: https://issues.apache.org/jira/browse/SPARK-31958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31958) normalize special floating numbers in subquery

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31958:


Assignee: Apache Spark  (was: Wenchen Fan)

> normalize special floating numbers in subquery
> --
>
> Key: SPARK-31958
> URL: https://issues.apache.org/jira/browse/SPARK-31958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31958) normalize special floating numbers in subquery

2020-06-10 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31958:


Assignee: Wenchen Fan  (was: Apache Spark)

> normalize special floating numbers in subquery
> --
>
> Key: SPARK-31958
> URL: https://issues.apache.org/jira/browse/SPARK-31958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31958) normalize special floating numbers in subquery

2020-06-10 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31958:
---

 Summary: normalize special floating numbers in subquery
 Key: SPARK-31958
 URL: https://issues.apache.org/jira/browse/SPARK-31958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-10 Thread YoungGyu Chun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130750#comment-17130750
 ] 

YoungGyu Chun edited comment on SPARK-31955 at 6/10/20, 2:45 PM:
-

[~denglg] can you add screenshots or examples that discard the last line and 
the SQL file you are testing?


was (Author: younggyuchun):
[~denglg] can you add a screenshot that discards the last line and the sql file 
you are testing?

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Blocker
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 162 matches

Mail list logo