[jira] [Updated] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28264: - Labels: release-notes (was: ) > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > -See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]- > New proposal: > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey
[ https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132930#comment-17132930 ] zhengruifeng commented on SPARK-31948: -- [~viirya] [~srowen] I will test whether there is perfromance gain, maybe it will benefit aggregate on a partially aggregated dataset. In ML impls, like above RobustScaler, there is nothing to combin in the map side, since each key is distinct on a partition. Same cases exist in impls like KMeans, BiKMeans, GMM, etc > expose mapSideCombine in aggByKey/reduceByKey/foldByKey > --- > > Key: SPARK-31948 > URL: https://issues.apache.org/jira/browse/SPARK-31948 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > 1. {{aggregateByKey}}, {{reduceByKey}} and {{foldByKey}} will always perform > {{mapSideCombine}}; > However, this can be skiped sometime, specially in ML (RobustScaler): > {code:java} > vectors.mapPartitions { iter => > if (iter.hasNext) { > val summaries = Array.fill(numFeatures)( > new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, > relativeError)) > while (iter.hasNext) { > val vec = iter.next > vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = > summaries(i).insert(v) } > } > Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress)) > } else Iterator.empty > }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code} > > This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at > all, similar places exist in {{KMeans}}, {{GMM}}, etc; > To my knowledge, we do not need {{mapSideCombine}} if the reduction factor > isn't high; > > 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}}, the > {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided. > > SPARK-772: > {quote} > Map side combine in group by key case does not reduce the amount of data > shuffled. Instead, it forces a lot more objects to go into old gen, and leads > to worse GC. > {quote} > > So what about: > 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so > that user can disable unnecessary mapSideCombine > 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in > {{treeAggregate}} and {{treeReduce}} > 3. disabling the unnecessary {{mapSideCombine}} in ML; > Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] > [~viirya] > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31966. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28798 [https://github.com/apache/spark/pull/28798] > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31964: -- Issue Type: Improvement (was: Bug) > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 3.1.0 > > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31965: - Assignee: Hyukjin Kwon > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31965. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28795 [https://github.com/apache/spark/pull/28795] > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132900#comment-17132900 ] Apache Spark commented on SPARK-31966: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28798 > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132898#comment-17132898 ] Apache Spark commented on SPARK-31966: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28798 > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31966: Assignee: Hyukjin Kwon (was: Apache Spark) > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31966: Assignee: Apache Spark (was: Hyukjin Kwon) > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31966: - Fix Version/s: (was: 3.0.0) > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-31966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31966: - Description: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ {code} == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 -- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) {code} was: Looks this test is flaky https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console {code} == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 -- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) {code} This looks happening after increasing the parallelism in Jenkins to speed up. I am able to reproduce this manually when the resource usage is heavy with manual decrease of timeout. > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-31966 > URL: https://issues.apache.org/jira/browse/SPARK-31966 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, >
[jira] [Created] (SPARK-31966) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
Hyukjin Kwon created SPARK-31966: Summary: Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction Key: SPARK-31966 URL: https://issues.apache.org/jira/browse/SPARK-31966 Project: Spark Issue Type: Test Components: MLlib, PySpark Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Fix For: 3.0.0 Looks this test is flaky https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console {code} == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 -- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) {code} This looks happening after increasing the parallelism in Jenkins to speed up. I am able to reproduce this manually when the resource usage is heavy with manual decrease of timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31915: - Fix Version/s: 3.0.1 > Resolve the grouping column properly per the case sensitivity in grouped and > cogrouped pandas UDFs > -- > > Key: SPARK-31915 > URL: https://issues.apache.org/jira/browse/SPARK-31915 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > {code} > from pyspark.sql.functions import * > df = spark.createDataFrame([[1, 1]], ["column", "Score"]) > @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) > def my_pandas_udf(pdf): > return pdf.assign(Score=0.5) > df.groupby('COLUMN').apply(my_pandas_udf).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could > be: COLUMN, COLUMN.; > {code} > {code} > df1 = spark.createDataFrame([(1, 1)], ("column", "value")) > df2 = spark.createDataFrame([(1, 1)], ("column", "value")) > df1.groupby("COLUMN").cogroup( > df2.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, df1.schema).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input > columns: [COLUMN, COLUMN, value, value];; > 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, > column#13L, value#14L), [column#22L, value#23L] > :- Project [COLUMN#9L, column#9L, value#10L] > : +- LogicalRDD [column#9L, value#10L], false > +- Project [COLUMN#13L, column#13L, value#14L] >+- LogicalRDD [column#13L, value#14L], false > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132890#comment-17132890 ] Apache Spark commented on SPARK-31926: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28797 > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition
[ https://issues.apache.org/jira/browse/SPARK-31942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31942: --- Assignee: Manu Zhang > Revert SPARK-31864 Adjust AQE skew join trigger condition > - > > Key: SPARK-31942 > URL: https://issues.apache.org/jira/browse/SPARK-31942 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > > As discussed in > [https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert > SPARK-31864 for optimizing skew join to work for extremely clustered keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition
[ https://issues.apache.org/jira/browse/SPARK-31942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31942. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28770 [https://github.com/apache/spark/pull/28770] > Revert SPARK-31864 Adjust AQE skew join trigger condition > - > > Key: SPARK-31942 > URL: https://issues.apache.org/jira/browse/SPARK-31942 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.1.0 > > > As discussed in > [https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert > SPARK-31864 for optimizing skew join to work for extremely clustered keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31939) Fix Parsing day of year when year field pattern is missing
[ https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31939: --- Assignee: Kent Yao > Fix Parsing day of year when year field pattern is missing > -- > > Key: SPARK-31939 > URL: https://issues.apache.org/jira/browse/SPARK-31939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > If a datetime pattern contains no year field, the day of year field should > not be ignored if exists -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31939) Fix Parsing day of year when year field pattern is missing
[ https://issues.apache.org/jira/browse/SPARK-31939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31939. - Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28766 [https://github.com/apache/spark/pull/28766] > Fix Parsing day of year when year field pattern is missing > -- > > Key: SPARK-31939 > URL: https://issues.apache.org/jira/browse/SPARK-31939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > If a datetime pattern contains no year field, the day of year field should > not be ignored if exists -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31935) Hadoop file system config should be effective in data source options
[ https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132882#comment-17132882 ] Apache Spark commented on SPARK-31935: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/28796 > Hadoop file system config should be effective in data source options > - > > Key: SPARK-31935 > URL: https://issues.apache.org/jira/browse/SPARK-31935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 2.4.7 >Reporter: Gengliang Wang >Priority: Major > > Data source options should be propagated into the hadoop configuration of > method `checkAndGlobPathIfNecessary` > From org.apache.hadoop.fs.FileSystem.java: > {code:java} > public static FileSystem get(URI uri, Configuration conf) throws > IOException { > String scheme = uri.getScheme(); > String authority = uri.getAuthority(); > if (scheme == null && authority == null) { // use default FS > return get(conf); > } > if (scheme != null && authority == null) { // no authority > URI defaultUri = getDefaultUri(conf); > if (scheme.equals(defaultUri.getScheme())// if scheme matches > default > && defaultUri.getAuthority() != null) { // & default has authority > return get(defaultUri, conf); // return default > } > } > > String disableCacheName = String.format("fs.%s.impl.disable.cache", > scheme); > if (conf.getBoolean(disableCacheName, false)) { > return createFileSystem(uri, conf); > } > return CACHE.get(uri, conf); > } > {code} > With this, we can specify URI schema and authority related configurations for > scanning file systems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132878#comment-17132878 ] Apache Spark commented on SPARK-31965: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28795 > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132877#comment-17132877 ] Apache Spark commented on SPARK-31965: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28795 > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31965: Assignee: (was: Apache Spark) > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31965: Assignee: Apache Spark > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31965: - Summary: Run the tests in registerJavaFunction and registerJavaUDAF only when test classes are complied (was: Run the tests in registerJavaFunction and registerJavaUDAF if test classes are not complied) > Run the tests in registerJavaFunction and registerJavaUDAF only when test > classes are complied > -- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31965) Run the tests in registerJavaFunction and registerJavaUDAF if test classes are not complied
[ https://issues.apache.org/jira/browse/SPARK-31965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31965: - Summary: Run the tests in registerJavaFunction and registerJavaUDAF if test classes are not complied (was: Skip tests in registerJavaFunction and registerJavaUDAF if test classes are not complied) > Run the tests in registerJavaFunction and registerJavaUDAF if test classes > are not complied > --- > > Key: SPARK-31965 > URL: https://issues.apache.org/jira/browse/SPARK-31965 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.6, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > If you do a plain package with sbt: > {code} > ./build/sbt -DskipTests -Phive-thriftserver clean package > ./run-tests --python-executable=python3 --testname="pyspark.sql.udf > UserDefinedFunction" > {code} > The doctests in registerJavaFunction and registerJavaUDAF fail because it > requires some classes from the test compilation. > We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31965) Skip tests in registerJavaFunction and registerJavaUDAF if test classes are not complied
Hyukjin Kwon created SPARK-31965: Summary: Skip tests in registerJavaFunction and registerJavaUDAF if test classes are not complied Key: SPARK-31965 URL: https://issues.apache.org/jira/browse/SPARK-31965 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 2.4.6, 3.0.0 Reporter: Hyukjin Kwon If you do a plain package with sbt: {code} ./build/sbt -DskipTests -Phive-thriftserver clean package ./run-tests --python-executable=python3 --testname="pyspark.sql.udf UserDefinedFunction" {code} The doctests in registerJavaFunction and registerJavaUDAF fail because it requires some classes from the test compilation. We should skip them conditionally -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31955: Priority: Major (was: Blocker) > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31956) Do not fail if there is no ambiguous self join
[ https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132845#comment-17132845 ] Dongjoon Hyun commented on SPARK-31956: --- Thanks, [~kabhwan]. > Do not fail if there is no ambiguous self join > -- > > Key: SPARK-31956 > URL: https://issues.apache.org/jira/browse/SPARK-31956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31956) Do not fail if there is no ambiguous self join
[ https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31956: -- Fix Version/s: (was: 3.0.0) 3.0.1 > Do not fail if there is no ambiguous self join > -- > > Key: SPARK-31956 > URL: https://issues.apache.org/jira/browse/SPARK-31956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31926: Assignee: (was: Apache Spark) > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31926: Assignee: Apache Spark > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31964: Assignee: Bryan Cutler > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31964. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28793 [https://github.com/apache/spark/pull/28793] > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 3.1.0 > > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132812#comment-17132812 ] Lin Gang Deng commented on SPARK-31955: --- [~younggyuchun] No results. {code:java} // code placeholder [test@192.168.0.1 denglg]$ cat -A test3.sql select * from info_dev.beeline_test where name='bbb'[test@192.168.0.1 denglg]$ {code} {code:java} // code placeholder 0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test where name='bbb' Closing: 0: jdbc:hive2://spark-sql.hadoop.srv:1/;principal=xxx?mapreduce.job.queuename=xxx [test@192.168.0.1 denglg]$ {code} > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132807#comment-17132807 ] Dongjoon Hyun commented on SPARK-31926: --- This is reverted because Maven test fails consistently. > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26905) Revisit reserved/non-reserved keywords based on the ANSI SQL standard
[ https://issues.apache.org/jira/browse/SPARK-26905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132806#comment-17132806 ] Takeshi Yamamuro commented on SPARK-26905: -- okay, I'll check if we can do so. > Revisit reserved/non-reserved keywords based on the ANSI SQL standard > - > > Key: SPARK-26905 > URL: https://issues.apache.org/jira/browse/SPARK-26905 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xiao Li >Priority: Major > Attachments: spark-ansiNonReserved.txt, spark-keywords-list.txt, > spark-nonReserved.txt, spark-strictNonReserved.txt, > sql2016-02-nonreserved.txt, sql2016-02-reserved.txt, > sql2016-09-nonreserved.txt, sql2016-09-reserved.txt, > sql2016-14-nonreserved.txt, sql2016-14-reserved.txt > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31926: - Assignee: (was: Kent Yao) > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31926: -- Fix Version/s: (was: 3.0.1) (was: 3.1.0) > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31926) Fix concurrency issue for ThriftCLIService to getPortNumber
[ https://issues.apache.org/jira/browse/SPARK-31926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-31926: --- > Fix concurrency issue for ThriftCLIService to getPortNumber > --- > > Key: SPARK-31926 > URL: https://issues.apache.org/jira/browse/SPARK-31926 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > When > org.apache.spark.sql.hive.thriftserver.HiveThriftServer2#startWithContext > called, > it starts ThriftCLIService in the background with a new Thread, at the same > time we call ThriftCLIService.getPortNumber, we might not get the bound port > if it's configured with 0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31964: Assignee: (was: Apache Spark) > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Priority: Minor > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132803#comment-17132803 ] Apache Spark commented on SPARK-31964: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28793 > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Priority: Minor > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
[ https://issues.apache.org/jira/browse/SPARK-31964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31964: Assignee: Apache Spark > Avoid Pandas import for CategoricalDtype with Arrow conversion > -- > > Key: SPARK-31964 > URL: https://issues.apache.org/jira/browse/SPARK-31964 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > > The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and > currently pyspark checks 2 places to import. It would be better check the > type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31964) Avoid Pandas import for CategoricalDtype with Arrow conversion
Bryan Cutler created SPARK-31964: Summary: Avoid Pandas import for CategoricalDtype with Arrow conversion Key: SPARK-31964 URL: https://issues.apache.org/jira/browse/SPARK-31964 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: Bryan Cutler The import for CategoricalDtype changed in Pandas from 0.23 to 1.0 and currently pyspark checks 2 places to import. It would be better check the type as a string and avoid any imports. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31935) Hadoop file system config should be effective in data source options
[ https://issues.apache.org/jira/browse/SPARK-31935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132788#comment-17132788 ] Apache Spark commented on SPARK-31935: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28791 > Hadoop file system config should be effective in data source options > - > > Key: SPARK-31935 > URL: https://issues.apache.org/jira/browse/SPARK-31935 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1, 3.1.0, 2.4.7 >Reporter: Gengliang Wang >Priority: Major > > Data source options should be propagated into the hadoop configuration of > method `checkAndGlobPathIfNecessary` > From org.apache.hadoop.fs.FileSystem.java: > {code:java} > public static FileSystem get(URI uri, Configuration conf) throws > IOException { > String scheme = uri.getScheme(); > String authority = uri.getAuthority(); > if (scheme == null && authority == null) { // use default FS > return get(conf); > } > if (scheme != null && authority == null) { // no authority > URI defaultUri = getDefaultUri(conf); > if (scheme.equals(defaultUri.getScheme())// if scheme matches > default > && defaultUri.getAuthority() != null) { // & default has authority > return get(defaultUri, conf); // return default > } > } > > String disableCacheName = String.format("fs.%s.impl.disable.cache", > scheme); > if (conf.getBoolean(disableCacheName, false)) { > return createFileSystem(uri, conf); > } > return CACHE.get(uri, conf); > } > {code} > With this, we can specify URI schema and authority related configurations for > scanning file systems. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25186) Stabilize Data Source V2 API
[ https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132780#comment-17132780 ] Jeremy Bloom edited comment on SPARK-25186 at 6/10/20, 11:18 PM: - I'm not sure whether I'm in the right place, but I want to discuss my experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm not a Spark contributor, but I am using Spark in several applications. If this is not the appropriate place for my comments, please direct me to the right place. My current project is part of an open source program called COIN-OR (I can provide links if you like). Among other things, we are building a data exchange standard for mathematical optimization solvers, called MOSDEX. I am using Spark SQL as the backbone of this application. I work mostly in Java, some in Python, and not at all in Scala. I came to this point because a critical link in my application is to load data from JSON into a Spark Dataset. For various reasons, the standard JSON reader in Spark is not adequate for my purposes. I want to use a Java Stream to load the data into a Spark dataframe. It turns out that this is remarkably hard to do. Spark Session provides a createDataFrame method whose argument is a Java List. However, because I am dealing with potentially vary large data sets, I do not want to create a list in memory. My first attempt was to wrap the stream in a "virtual list" that fulfilled most of the list "contract" but did not actually create a list object. The hang up with this approach is that it cannot count the elements in the stream unless and until the stream has been read, so the size of the list is undeterminable until that happens. Using the virtual list in the createDataFrame method some times succeeds but fails at other times when some component of Spark needs to call the size method on the list. I raised this question on Stackoverflow, and someone suggested using DataSource V2. There isn't much out there, but I did find a couple of examples, and based on them, I was able to build a reader and a writer for Java Streams. However, there are a number of issues that arose that forced me to use some "unsavory" tricks in my code that I am uncomfortable with (they "smell" in Fowler's sense). Here are some of the issues I found. 1) The DataSource V2 still relies on the calling sequences of DataFrameReader and DataFrameWriter from V1. The reader is instantiated by the Spark Session read method while the writer is instantiated by the Dataset write method. The separation is confusing. Why not use a static factory in Dataset to instantiate the read method? 2) The actual calls to the reader and writer are contained in the DataframeReader and DataframeWriter format methods. Using "format" for the name of this method is misleading, but once you realize what it is doing, it makes even less sense. The parameter of the format method is a string, which isn't documented, but which turns out to be a class name. Then format creates a class using reflection. Use of reflection in this circumstance appears to be unnecessary and restrictive. What not make the parameter an instance of an appropriate interface, such as ReadSupport or WriteSupport? 3) Using reflection means that the reader or writer will be created by a parameterless constructor. Thus, it is not possible to pass any information to the reader and writer. In my case, I want to pass an instance of a Java Stream and a Spark schema to the reader and a Spark dataset to the writer. The only way I have found to do that is to create static fields in the reader and writer classes and set them before calling the format method to create the reader and writer objects. That's what I mean about "unsavory" programming. 4) There are other, less critical issues, but these are still annoying. Links to external files in the option method are undocumented strings, but they look like paths. Since the Path class has been part of Java just about forever, why not use it? And when reading from or writing to a file, why not use the Java InputStream and OutputStream classes, which also have been around forever? I realize that programming style in Scala might be different than in Java, but I can't believe that the design of DataSource V2 requires these contortions even in Scala. Given the power that is available in Spark, it seems to me that basic input and output should be much better structured than it is. Thanks for your attention on this matter. chrome was (Author: jeremybloom): I'm not sure whether I'm in the right place, but I want to discuss my experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm not a Spark contributor, but I am using Spark in several applications. If this is not the appropriate place for my comments, please direct me to the right place. My current project is
[jira] [Commented] (SPARK-25186) Stabilize Data Source V2 API
[ https://issues.apache.org/jira/browse/SPARK-25186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132780#comment-17132780 ] Jeremy Bloom commented on SPARK-25186: -- I'm not sure whether I'm in the right place, but I want to discuss my experience using the DataSource V2 API (using Spark 2.4.5). To be clear, I'm not a Spark contributor, but I am using Spark in several applications. If this is not the appropriate place for my comments, please direct me to the right place. My current project is part of an open source program called COIN-OR (I can provide links if you like). Among other things, we are building a data exchange standard for mathematical optimization solvers, called MOSDEX. I am using Spark SQL as the backbone of this application. I work mostly in Java, some in Python, and not at all in Scala. I came to this point because a critical link in my application is to load data from JSON into a Spark Dataset. For various reasons, the standard JSON reader in Spark is not adequate for my purposes. I want to use a Java Stream to load the data into a Spark dataframe. It turns out that this is remarkably hard to do. Spark Session provides a createDataFrame method whose argument is a Java List. However, because I am dealing with potentially vary large data sets, I do not want to create a list in memory. My first attempt was to wrap the stream in a "virtual list" that fulfilled most of the list "contract" but did not actually create a list object. The hang up with this approach is that it cannot count the elements in the stream unless and until the stream has been read, so the size of the list is undeterminable until that happens. Using the virtual list in the createDataFrame method some times succeeds but fails at other times when some component of Spark needs to call the size method on the list. I raised this question on Stackoverflow, and someone suggested using DataSource V2. There isn't much out there, but I did find a couple of examples, and based on them, I was able to build a reader and a writer for Java Streams. However, there are a number of issues that arose that forced me to use some "unsavory" tricks in my code that I am uncomfortable with (they "smell" in Fowler's sense). Here are some of the issues I found. 1) The DataSource V2 still relies on the calling sequences of DataFrameReader and DataFrameWriter from V1. The reader is instantiated by the Spark Session read method while the writer is instantiated by the Dataset write method. The separation is confusing. Why not use a static factory in Dataset to instantiate the read method? 2) The actual calls to the reader and writer are contained in the DataframeReader and DataframeWriter format methods. Using "format" for the name of this method is misleading, but once you realize what it is doing, it makes even less sense. The parameter of the format method is a string, which isn't documented, but which turns out to be a class name. Then format creates a class using reflection. Use of reflection in this circumstance appears to be unnecessary and restrictive. What not make the parameter an instance of an appropriate interface, such as ReadSupport or WriteSupport? 3) Using reflection means that the reader or writer will be created by a parameterless constructor. Thus, it is not possible to pass any information to the reader and writer. In my case, I want to pass an instance of a Java Stream and a Spark schema to the reader and a Spark dataset to the writer. The only way I have found to do that is to create static fields in the reader and writer classes and set them before calling the format method to create the reader and writer objects. That's what I mean about "unsavory" programming. 4) There are other, less critical issues, but these are still annoying. Links to external files in the option method are undocumented strings, but they look like paths. Since the Path class has been part of Java just about forever, why not use it? And when reading from or writing to a file, why not use the Java InputStream and OutputStream classes, which also have been around forever? I realize that programming style in Scala might be different than in Java, but I can't believe that the design of DataSource V2 requires these contortions even in Scala. Given the power that is available in Spark, it seems to me that basic input and output should be much better structured than it is. Thanks for your attention on this matter. chrome > Stabilize Data Source V2 API > - > > Key: SPARK-25186 > URL: https://issues.apache.org/jira/browse/SPARK-25186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira
[jira] [Resolved] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-31915. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28777 [https://github.com/apache/spark/pull/28777] > Resolve the grouping column properly per the case sensitivity in grouped and > cogrouped pandas UDFs > -- > > Key: SPARK-31915 > URL: https://issues.apache.org/jira/browse/SPARK-31915 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > {code} > from pyspark.sql.functions import * > df = spark.createDataFrame([[1, 1]], ["column", "Score"]) > @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) > def my_pandas_udf(pdf): > return pdf.assign(Score=0.5) > df.groupby('COLUMN').apply(my_pandas_udf).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could > be: COLUMN, COLUMN.; > {code} > {code} > df1 = spark.createDataFrame([(1, 1)], ("column", "value")) > df2 = spark.createDataFrame([(1, 1)], ("column", "value")) > df1.groupby("COLUMN").cogroup( > df2.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, df1.schema).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input > columns: [COLUMN, COLUMN, value, value];; > 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, > column#13L, value#14L), [column#22L, value#23L] > :- Project [COLUMN#9L, column#9L, value#10L] > : +- LogicalRDD [column#9L, value#10L], false > +- Project [COLUMN#13L, column#13L, value#14L] >+- LogicalRDD [column#13L, value#14L], false > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31915) Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-31915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-31915: Assignee: Hyukjin Kwon > Resolve the grouping column properly per the case sensitivity in grouped and > cogrouped pandas UDFs > -- > > Key: SPARK-31915 > URL: https://issues.apache.org/jira/browse/SPARK-31915 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > {code} > from pyspark.sql.functions import * > df = spark.createDataFrame([[1, 1]], ["column", "Score"]) > @pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) > def my_pandas_udf(pdf): > return pdf.assign(Score=0.5) > df.groupby('COLUMN').apply(my_pandas_udf).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could > be: COLUMN, COLUMN.; > {code} > {code} > df1 = spark.createDataFrame([(1, 1)], ("column", "value")) > df2 = spark.createDataFrame([(1, 1)], ("column", "value")) > df1.groupby("COLUMN").cogroup( > df2.groupby("COLUMN") > ).applyInPandas(lambda r, l: r + l, df1.schema).show() > {code} > {code} > pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input > columns: [COLUMN, COLUMN, value, value];; > 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], (column#9L, value#10L, > column#13L, value#14L), [column#22L, value#23L] > :- Project [COLUMN#9L, column#9L, value#10L] > : +- LogicalRDD [column#9L, value#10L], false > +- Project [COLUMN#13L, column#13L, value#14L] >+- LogicalRDD [column#13L, value#14L], false > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28199) Move Trigger implementations to Triggers.scala and avoid exposing these to the end users
[ https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132756#comment-17132756 ] Apache Spark commented on SPARK-28199: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/28790 > Move Trigger implementations to Triggers.scala and avoid exposing these to > the end users > > > Key: SPARK-28199 > URL: https://issues.apache.org/jira/browse/SPARK-28199 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark > codebase, and actually the alternative Spark proposes use deprecated methods > which feels like circular - never be able to remove usage. > In fact, ProcessingTime is deprecated because we want to only expose > Trigger.xxx instead of exposing actual implementations, and I think we miss > some other implementations as well. > This issue targets to move all Trigger implementations to Triggers.scala, and > hide from end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132743#comment-17132743 ] Everett Rush edited comment on SPARK-27249 at 6/10/20, 9:52 PM: I checked out the code and the design. Thanks for the great effort, but this doesn't quite meet the need. The transformer contract can still be met with a transform function that returns a Dataframe. I would like something more like this. {code:java} @DeveloperApi abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]] extends Transformer with HasOutputCol with Logging { def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T] protected def outputDataType: DataType protected def transformFunc: Iterator[Row] => Iterator[Row] override def transformSchema(schema: StructType): StructType = { if (schema.fieldNames.contains($(outputCol))) { throw new IllegalArgumentException(s"Output column ${$(outputCol)} already exists.") } val outputFields = schema.fields :+ StructField($(outputCol), outputDataType, nullable = false) StructType(outputFields) } def transform(dataset: DataFrame, targetSchema: StructType): DataFrame = { val targetEncoder = RowEncoder(targetSchema) dataset.mapPartitions(transformFunc)(targetEncoder) } override def transform(dataset: Dataset[_]): DataFrame = { val dataframe = dataset.toDF() val targetSchema = transformSchema(dataframe.schema, logging = true) transform(dataframe, targetSchema) } override def copy(extra: ParamMap): T = defaultCopy(extra) } {code} was (Author: enrush): I checked out the code and the design. Thanks for the great effort, but this doesn't quite meet the need. The transformer contract can still be met with a transform function that returns a Dataframe. I would like something more like this. {{@DeveloperApi}} {{abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]]}} {{ extends Transformer with HasOutputCol with Logging {}} {{ def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T]}} {{ /** Returns the data type of the output column. */}} {{ protected def outputDataType: DataType}} {{ protected def *transformFunc*: Iterator[Row] => Iterator[Row]}} {{ override def *transformSchema*(schema: StructType): StructType = {}} {{ }}} {{ def *transform*(dataset: DataFrame, targetSchema: StructType): DataFrame = {}} {{ val targetEncoder = RowEncoder(targetSchema)}} {{ dataset.mapPartitions(transformFunc)(targetEncoder)}} {{ }}} {{ override def *transform*(dataset: Dataset[_]): DataFrame = {}} {{ val dataframe = dataset.toDF()}} {{ val targetSchema = transformSchema(dataframe.schema, logging = true)}} {{ transform(dataframe, targetSchema)}} {{ }}} {{ override def copy(extra: ParamMap): T = defaultCopy(extra)}} {{}}} > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer
[ https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132743#comment-17132743 ] Everett Rush commented on SPARK-27249: -- I checked out the code and the design. Thanks for the great effort, but this doesn't quite meet the need. The transformer contract can still be met with a transform function that returns a Dataframe. I would like something more like this. {{@DeveloperApi}} {{abstract class MultiColumnTransformer[T<: MultiColumnTransformer[T]]}} {{ extends Transformer with HasOutputCol with Logging {}} {{ def setOutputCol(value: String): T = set(outputCol, value).asInstanceOf[T]}} {{ /** Returns the data type of the output column. */}} {{ protected def outputDataType: DataType}} {{ protected def *transformFunc*: Iterator[Row] => Iterator[Row]}} {{ override def *transformSchema*(schema: StructType): StructType = {}} {{ }}} {{ def *transform*(dataset: DataFrame, targetSchema: StructType): DataFrame = {}} {{ val targetEncoder = RowEncoder(targetSchema)}} {{ dataset.mapPartitions(transformFunc)(targetEncoder)}} {{ }}} {{ override def *transform*(dataset: Dataset[_]): DataFrame = {}} {{ val dataframe = dataset.toDF()}} {{ val targetSchema = transformSchema(dataframe.schema, logging = true)}} {{ transform(dataframe, targetSchema)}} {{ }}} {{ override def copy(extra: ParamMap): T = defaultCopy(extra)}} {{}}} > Developers API for Transformers beyond UnaryTransformer > --- > > Key: SPARK-27249 > URL: https://issues.apache.org/jira/browse/SPARK-27249 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Everett Rush >Priority: Minor > Labels: starter > Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png > > Original Estimate: 96h > Remaining Estimate: 96h > > It would be nice to have a developers' API for dataset transformations that > need more than one column from a row (ie UnaryTransformer inputs one column > and outputs one column) or that contain objects too expensive to initialize > repeatedly in a UDF such as a database connection. > > Design: > Abstract class PartitionTransformer extends Transformer and defines the > partition transformation function as Iterator[Row] => Iterator[Row] > NB: This parallels the UnaryTransformer createTransformFunc method > > When developers subclass this transformer, they can provide their own schema > for the output Row in which case the PartitionTransformer creates a row > encoder and executes the transformation. Alternatively the developer can set > output Datatype and output col name. Then the PartitionTransformer class will > create a new schema, a row encoder, and execute the transformation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132742#comment-17132742 ] Jungtaek Lim commented on SPARK-31961: -- Great. Then just construct the config you want use like "kafka." + ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG. If you feel bad to add "kafka." as prefix then please submit a PR to propose the fix. > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31963: - Assignee: William Hyun > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31963. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28789 [https://github.com/apache/spark/pull/28789] > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.1.0 > > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132739#comment-17132739 ] Gunjan Kumar commented on SPARK-31961: -- no,kafka users dont write down config like that there are special classes ProducerConfig and ConsumerConfig. [https://kafka.apache.org/22/javadoc/index.html?org/apache/kafka/clients/consumer/ConsumerConfig.html] > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132736#comment-17132736 ] Jungtaek Lim edited comment on SPARK-31961 at 6/10/20, 9:37 PM: My point is that given they're Kafka config why you're requiring them from Spark. If Kafka provides such constant then you can just use that. If not, how could you tolerate the fact Kafka users also have to write down config key as string? In either way, Kafka should provide it under such rationalization, not Spark. was (Author: kabhwan): My point is that given they're Kafka config why you're requiring they from Spark. If Kafka provides such constant then you can just use that. If not, how could you tolerate the fact Kafka users also have to write down config key as string? In either way, Kafka should provide it under such rationalization, not Spark. > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132736#comment-17132736 ] Jungtaek Lim commented on SPARK-31961: -- My point is that given they're Kafka config why you're requiring they from Spark. If Kafka provides such constant then you can just use that. If not, how could you tolerate the fact Kafka users also have to write down config key as string? In either way, Kafka should provide it under such rationalization, not Spark. > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132725#comment-17132725 ] Gunjan Kumar commented on SPARK-31961: -- spark will not be tied to kafka version as these property keys never changes in any kafka version. In the current scenarion, suppose i want to set poll timeout but i dont know the property name for same then i have to go to kafka doc search for that property and paste it in Option(). This is not how a programmer should code :) > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31963: -- Component/s: SQL > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132714#comment-17132714 ] Apache Spark commented on SPARK-31963: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28789 > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31963: Assignee: Apache Spark > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: William Hyun >Assignee: Apache Spark >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31963: Assignee: (was: Apache Spark) > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31963) Support both pandas 0.23 and 1.0
[ https://issues.apache.org/jira/browse/SPARK-31963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132713#comment-17132713 ] Apache Spark commented on SPARK-31963: -- User 'williamhyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28789 > Support both pandas 0.23 and 1.0 > > > Key: SPARK-31963 > URL: https://issues.apache.org/jira/browse/SPARK-31963 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.0 >Reporter: William Hyun >Priority: Major > > This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. > {code} > $ pip install pandas==0.23.2 > $ python -c "import pandas.CategoricalDtype" > Traceback (most recent call last): > File "", line 1, in > ModuleNotFoundError: No module named 'pandas.CategoricalDtype' > $ python -c "from pandas.api.types import CategoricalDtype" > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31963) Support both pandas 0.23 and 1.0
William Hyun created SPARK-31963: Summary: Support both pandas 0.23 and 1.0 Key: SPARK-31963 URL: https://issues.apache.org/jira/browse/SPARK-31963 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: William Hyun This issue aims to fix a bug by supporting both pandas 0.23 and 1.0. {code} $ pip install pandas==0.23.2 $ python -c "import pandas.CategoricalDtype" Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'pandas.CategoricalDtype' $ python -c "from pandas.api.types import CategoricalDtype" {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical delta files in CSV format. When I start > reading from a folder, however, I might only care about files were created > after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], > there is a method, _checkAndGlobPathIfNecessary,_ which appears create an > in-memory index of files for a given path. There may a rather clean > opportunity to
[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132708#comment-17132708 ] Christopher Highman commented on SPARK-31962: - I have a good, working version of this for our internal purposes and would love to contribute it back. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical delta files in CSV format. When I start > reading from a folder, however, I might only care about files were created > after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], > there is a method, _checkAndGlobPathIfNecessary,_ which appears create an > in-memory index of files for a given path. There may a rather clean > opportunity to consider options here. > Having the ability to provide an option specifying a timestamp by which to > begin globbing files would result in quite a bit of less complexity needed on > a consumer who leverages the ability to stream from a folder path but does > not have an interest in reading what could be thousands of files that are not > relevant. > One example to could be "createdFileTime" accepting a UTC datetime like below. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .option("createdFileTime", "2020-05-01 00:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have a created been created at or later than > the specified time in order to be consumed for purposes of reading the files > in general or for purposes of structured streaming. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
Christopher Highman created SPARK-31962: --- Summary: Provide option to load files after a specified date when reading from a folder path Key: SPARK-31962 URL: https://issues.apache.org/jira/browse/SPARK-31962 Project: Spark Issue Type: Improvement Components: SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: Christopher Highman When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas/") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas/") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas/") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here. Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant. One example to could be "createdFileTime" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("createdFileTime", "2020-05-01 00:00:00") .format("csv") .load("/mnt/Deltas/") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical delta files in CSV format. When I start > reading from a folder, however, I might only care about files were created > after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], > there is a method, _checkAndGlobPathIfNecessary,_ which appears create an > in-memory index of files for a given path. There may a rather clean >
[jira] [Commented] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132707#comment-17132707 ] Jungtaek Lim commented on SPARK-31961: -- Don't you think the constant should be available in Kafka instead of Spark? Spark should try to avoid being tie to the specific version of Kafka in codebase. If Kafka provides it, what you need is just add "kafka." as prefix. > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
[ https://issues.apache.org/jira/browse/SPARK-31961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-31961: - Fix Version/s: (was: 2.4.7) > Add a class in spark with all Kafka configuration key available as string > - > > Key: SPARK-31961 > URL: https://issues.apache.org/jira/browse/SPARK-31961 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.6 >Reporter: Gunjan Kumar >Priority: Minor > Labels: kafka, sql, structured-streaming > > Add a class in spark with all Kafka configuration key available as string. > see the highligted class which i want. > eg:- > Current code:- > val df_cluster1 = spark > .read > .format("kafka") > .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) > .option("subscribe", "topic1") > Expected code:- > val df_cluster1 = spark > .read > .format("kafka") > .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) > .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31961) Add a class in spark with all Kafka configuration key available as string
Gunjan Kumar created SPARK-31961: Summary: Add a class in spark with all Kafka configuration key available as string Key: SPARK-31961 URL: https://issues.apache.org/jira/browse/SPARK-31961 Project: Spark Issue Type: New Feature Components: SQL, Structured Streaming Affects Versions: 2.4.6 Reporter: Gunjan Kumar Fix For: 2.4.7 Add a class in spark with all Kafka configuration key available as string. see the highligted class which i want. eg:- Current code:- val df_cluster1 = spark .read .format("kafka") .option("kafka.bootstrap.servers","cluster1_host:cluster1_port) .option("subscribe", "topic1") Expected code:- val df_cluster1 = spark .read .format("kafka") .option(*KafkaConstantClass*.KAFKA_BOOTSTRAP_SERVERS,"cluster1_host:cluster1_port) .option(*KafkaConstantClass*.subscribe, "topic1") -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time
[ https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132694#comment-17132694 ] YoungGyu Chun commented on SPARK-7101: -- I will try to get this done but there are a ton of work ;) > Spark SQL should support java.sql.Time > -- > > Key: SPARK-7101 > URL: https://issues.apache.org/jira/browse/SPARK-7101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 > Environment: All >Reporter: Peter Hagelund >Priority: Major > > Several RDBMSes support the TIME data type; for more exact mapping between > those and Spark SQL, support for java.sql.Time with an associated > DataType.TimeType would be helpful. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31956) Do not fail if there is no ambiguous self join
[ https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132687#comment-17132687 ] Jungtaek Lim commented on SPARK-31956: -- Probably the correct fix version would be 3.0.1, as 3.0.0 RC3 vote is already passed. > Do not fail if there is no ambiguous self join > -- > > Key: SPARK-31956 > URL: https://issues.apache.org/jira/browse/SPARK-31956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31956) Do not fail if there is no ambiguous self join
[ https://issues.apache.org/jira/browse/SPARK-31956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31956. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28783 [https://github.com/apache/spark/pull/28783] > Do not fail if there is no ambiguous self join > -- > > Key: SPARK-31956 > URL: https://issues.apache.org/jira/browse/SPARK-31956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build
[ https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31960: Assignee: Apache Spark > Only populate Hadoop classpath for no-hadoop build > -- > > Key: SPARK-31960 > URL: https://issues.apache.org/jira/browse/SPARK-31960 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build
[ https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31960: Assignee: (was: Apache Spark) > Only populate Hadoop classpath for no-hadoop build > -- > > Key: SPARK-31960 > URL: https://issues.apache.org/jira/browse/SPARK-31960 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build
[ https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31960: Assignee: Apache Spark > Only populate Hadoop classpath for no-hadoop build > -- > > Key: SPARK-31960 > URL: https://issues.apache.org/jira/browse/SPARK-31960 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build
[ https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132682#comment-17132682 ] Apache Spark commented on SPARK-31960: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/28788 > Only populate Hadoop classpath for no-hadoop build > -- > > Key: SPARK-31960 > URL: https://issues.apache.org/jira/browse/SPARK-31960 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build
[ https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-31960: Summary: Only populate Hadoop classpath for no-hadoop build (was: Only populate Hadoop classpath when hadoop is provided) > Only populate Hadoop classpath for no-hadoop build > -- > > Key: SPARK-31960 > URL: https://issues.apache.org/jira/browse/SPARK-31960 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31960) Only populate Hadoop classpath when hadoop is provided
DB Tsai created SPARK-31960: --- Summary: Only populate Hadoop classpath when hadoop is provided Key: SPARK-31960 URL: https://issues.apache.org/jira/browse/SPARK-31960 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.0.0 Reporter: DB Tsai -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"
[ https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132593#comment-17132593 ] Apache Spark commented on SPARK-31959: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28787 > Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian > to Julian" > > > Key: SPARK-31959 > URL: https://issues.apache.org/jira/browse/SPARK-31959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"
[ https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31959: Assignee: Apache Spark > Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian > to Julian" > > > Key: SPARK-31959 > URL: https://issues.apache.org/jira/browse/SPARK-31959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"
[ https://issues.apache.org/jira/browse/SPARK-31959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31959: Assignee: (was: Apache Spark) > Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian > to Julian" > > > Key: SPARK-31959 > URL: https://issues.apache.org/jira/browse/SPARK-31959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > See > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31925) Summary.totalIterations greater than maxIters
[ https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31925: Assignee: (was: Apache Spark) > Summary.totalIterations greater than maxIters > - > > Key: SPARK-31925 > URL: https://issues.apache.org/jira/browse/SPARK-31925 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: zhengruifeng >Priority: Trivial > > I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, > the model.summary.totalIterations will return *n+1* if the training procedure > does not drop out. > > friendly ping [~huaxingao] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31925) Summary.totalIterations greater than maxIters
[ https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31925: Assignee: Apache Spark > Summary.totalIterations greater than maxIters > - > > Key: SPARK-31925 > URL: https://issues.apache.org/jira/browse/SPARK-31925 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, > the model.summary.totalIterations will return *n+1* if the training procedure > does not drop out. > > friendly ping [~huaxingao] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31925) Summary.totalIterations greater than maxIters
[ https://issues.apache.org/jira/browse/SPARK-31925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132582#comment-17132582 ] Apache Spark commented on SPARK-31925: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28786 > Summary.totalIterations greater than maxIters > - > > Key: SPARK-31925 > URL: https://issues.apache.org/jira/browse/SPARK-31925 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.6, 3.0.0 >Reporter: zhengruifeng >Priority: Trivial > > I am not sure whether it is a bug, but if we set *maxIter=n* in LiR/LiR/etc, > the model.summary.totalIterations will return *n+1* if the training procedure > does not drop out. > > friendly ping [~huaxingao] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey
[ https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132569#comment-17132569 ] L. C. Hsieh commented on SPARK-31948: - I guess I have similar question. When the reduction factor isn't high, map-side combining will be any disadvantage to run? I think it still can reduce shuffle data, although it might not be too much. > expose mapSideCombine in aggByKey/reduceByKey/foldByKey > --- > > Key: SPARK-31948 > URL: https://issues.apache.org/jira/browse/SPARK-31948 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > 1. {{aggregateByKey}}, {{reduceByKey}} and {{foldByKey}} will always perform > {{mapSideCombine}}; > However, this can be skiped sometime, specially in ML (RobustScaler): > {code:java} > vectors.mapPartitions { iter => > if (iter.hasNext) { > val summaries = Array.fill(numFeatures)( > new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, > relativeError)) > while (iter.hasNext) { > val vec = iter.next > vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = > summaries(i).insert(v) } > } > Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress)) > } else Iterator.empty > }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code} > > This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at > all, similar places exist in {{KMeans}}, {{GMM}}, etc; > To my knowledge, we do not need {{mapSideCombine}} if the reduction factor > isn't high; > > 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}}, the > {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided. > > SPARK-772: > {quote} > Map side combine in group by key case does not reduce the amount of data > shuffled. Instead, it forces a lot more objects to go into old gen, and leads > to worse GC. > {quote} > > So what about: > 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so > that user can disable unnecessary mapSideCombine > 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in > {{treeAggregate}} and {{treeReduce}} > 3. disabling the unnecessary {{mapSideCombine}} in ML; > Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] > [~viirya] > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YoungGyu Chun updated SPARK-31955: -- Comment: was deleted (was: Thanks, I will work on this) > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31959) Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian"
Maxim Gekk created SPARK-31959: -- Summary: Test failure "RebaseDateTimeSuite.optimization of micros rebasing - Gregorian to Julian" Key: SPARK-31959 URL: https://issues.apache.org/jira/browse/SPARK-31959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.1.0 Reporter: Maxim Gekk See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123688/testReport/org.apache.spark.sql.catalyst.util/RebaseDateTimeSuite/optimization_of_micros_rebasing___Gregorian_to_Julian/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132566#comment-17132566 ] YoungGyu Chun commented on SPARK-31955: --- Try this rather than add a space between select statement and where statement: select * from info_dev.beeline_test where name='bbb' > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132549#comment-17132549 ] YoungGyu Chun commented on SPARK-31955: --- Thanks, I will work on this > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31890) Decommissioning Test Does not wait long enough for executors to launch
[ https://issues.apache.org/jira/browse/SPARK-31890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31890. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in one of the decom PRs > Decommissioning Test Does not wait long enough for executors to launch > -- > > Key: SPARK-31890 > URL: https://issues.apache.org/jira/browse/SPARK-31890 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > I've seen some decommissioning test failures where the initial executors are > not launching in time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130795#comment-17130795 ] Lin Gang Deng edited comment on SPARK-31955 at 6/10/20, 3:29 PM: - {code:java} // code placeholder 0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test; +-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.402 seconds) {code} Then sql file as bellows, {code:java} // code placeholder [test@192.168.0.1 denglg]$ cat -A test2.sql select * from info_dev.beeline_test$ where name='bbb';[test@192.168.0.1 denglg]$ {code} Result as bellows, {code:java} // code placeholder 0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test 0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.594 seconds) {code} As you can see,it got wrong result. was (Author: denglg): {code:java} // code placeholder 0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test; +-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.402 seconds) {code} Then sql as bellows, {code:java} // code placeholder [test@192.168.0.1 denglg]$ cat -A test2.sql select * from info_dev.beeline_test$ where name='bbb';[test@192.168.0.1 denglg]$ {code} Result as bellows, {code:java} // code placeholder 0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test 0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.594 seconds) {code} As you can see,it got wrong result. > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130795#comment-17130795 ] Lin Gang Deng commented on SPARK-31955: --- {code:java} // code placeholder 0: jdbc:hive2://hadoop.spark-sql.hadoo> select * from info_dev.beeline_test; +-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.402 seconds) {code} Then sql as bellows, {code:java} // code placeholder [test@192.168.0.1 denglg]$ cat -A test2.sql select * from info_dev.beeline_test$ where name='bbb';[test@192.168.0.1 denglg]$ {code} Result as bellows, {code:java} // code placeholder 0: jdbc:hive2://spark-sql.hadoo> select * from info_dev.beeline_test 0: jdbc:hive2://spark-sql.hadoo> where name='bbb';+-+---+--+ | id | name | +-+---+--+ | 3 | ccc | | 2 | bbb | | 1 | aaa | +-+---+--+ 3 rows selected (1.594 seconds) {code} As you can see,it got wrong result. > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31958) normalize special floating numbers in subquery
[ https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130789#comment-17130789 ] Apache Spark commented on SPARK-31958: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28785 > normalize special floating numbers in subquery > -- > > Key: SPARK-31958 > URL: https://issues.apache.org/jira/browse/SPARK-31958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31958) normalize special floating numbers in subquery
[ https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31958: Assignee: Apache Spark (was: Wenchen Fan) > normalize special floating numbers in subquery > -- > > Key: SPARK-31958 > URL: https://issues.apache.org/jira/browse/SPARK-31958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31958) normalize special floating numbers in subquery
[ https://issues.apache.org/jira/browse/SPARK-31958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31958: Assignee: Wenchen Fan (was: Apache Spark) > normalize special floating numbers in subquery > -- > > Key: SPARK-31958 > URL: https://issues.apache.org/jira/browse/SPARK-31958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31958) normalize special floating numbers in subquery
Wenchen Fan created SPARK-31958: --- Summary: normalize special floating numbers in subquery Key: SPARK-31958 URL: https://issues.apache.org/jira/browse/SPARK-31958 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130750#comment-17130750 ] YoungGyu Chun edited comment on SPARK-31955 at 6/10/20, 2:45 PM: - [~denglg] can you add screenshots or examples that discard the last line and the SQL file you are testing? was (Author: younggyuchun): [~denglg] can you add a screenshot that discards the last line and the sql file you are testing? > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Blocker > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org