[jira] [Commented] (SPARK-15719) Disable writing Parquet summary files by default
[ https://issues.apache.org/jira/browse/SPARK-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814061#comment-16814061 ] Ruslan Dautkhanov commented on SPARK-15719: --- [~lian cheng] quick question on this part from the description - {quote} when schema merging is enabled, we need to read footers of all files anyway to do the merge {quote} Is that still accurate in current Spark 2.3/ 2.4? I was looking ParquetFileFormat.inferSchema and it does look at `_common_metadata` and `_metadata` files here - https://github.com/apache/spark/blob/v2.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L231 or Spark would still need to look at all files in all partitions, not actually all parquet files? Thank you. > Disable writing Parquet summary files by default > > > Key: SPARK-15719 > URL: https://issues.apache.org/jira/browse/SPARK-15719 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Major > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Parquet summary files are not particular useful nowadays since > # when schema merging is disabled, we assume schema of all Parquet part-files > are identical, thus we can read the footer from any part-files. > # when schema merging is enabled, we need to read footers of all files anyway > to do the merge. > On the other hand, writing summary files can be expensive because footers of > all part-files must be read and merged. This is particularly costly when > appending small dataset to large existing Parquet dataset. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- Labels: core (was: shuffle) > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: core > Attachments: PmemShuffleManager-DesignDoc.pdf > > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods
[ https://issues.apache.org/jira/browse/SPARK-27410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27410: -- Docs Text: In Spark 3.0, the methods getRuns and setRuns in org.apache.spark.mllib.cluster.KMeans have been removed. They have been no-ops and deprecated since Spark 2.1.0. Labels: release-notes (was: ) > Remove deprecated/no-op mllib.Kmeans get/setRuns methods > > > Key: SPARK-27410 > URL: https://issues.apache.org/jira/browse/SPARK-27410 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Trivial > Labels: release-notes > Fix For: 3.0.0 > > > mllib.KMeans has getRuns, setRuns methods which haven't done anything since > Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods
[ https://issues.apache.org/jira/browse/SPARK-27410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27410. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24320 [https://github.com/apache/spark/pull/24320] > Remove deprecated/no-op mllib.Kmeans get/setRuns methods > > > Key: SPARK-27410 > URL: https://issues.apache.org/jira/browse/SPARK-27410 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Trivial > Fix For: 3.0.0 > > > mllib.KMeans has getRuns, setRuns methods which haven't done anything since > Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813896#comment-16813896 ] Bryan Cutler commented on SPARK-27389: -- Thanks [~shaneknapp] for the fix. I couldn't come up with any idea why this was happening all of a sudden either, but at least we are up and running again! > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-27389. - Resolution: Fixed > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp reassigned SPARK-27389: --- Assignee: shane knapp > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
[ https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813883#comment-16813883 ] Shivu Sondur commented on SPARK-27421: -- i am checking this issue > RuntimeException when querying a view on a partitioned parquet table > > > Key: SPARK-27421 > URL: https://issues.apache.org/jira/browse/SPARK-27421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit > Server VM, Java 1.8.0_141) >Reporter: Eric Maynard >Priority: Minor > > When running a simple query, I get the following stacktrace: > {code} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at >
[jira] [Resolved] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
[ https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27387. -- Resolution: Fixed Fix Version/s: 3.0.0 Fixed in https://github.com/apache/spark/pull/24306 > Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests > -- > > Key: SPARK-27387 > URL: https://issues.apache.org/jira/browse/SPARK-27387 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant > to check if 2 pandas.DataFrames are equal but it seems for later versions of > Pandas, this can fail if the DataFrame has an array column. This method can > be replaced by {{assert_frame_equal}} from pandas.util.testing. This is what > it is meant for and it will give a better assertion message as well. > The test failure I have seen is: > {noformat} > == > ERROR: test_supported_types > (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) > -- > Traceback (most recent call last): > File > "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", > line 128, in test_supported_types > self.assertPandasEqual(expected1, result1) > File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, > in assertPandasEqual > self.assertTrue(expected.equals(result), msg=msg) > File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas > ... > File "pandas/_libs/lib.pyx", line 523, in > pandas._libs.lib.array_equivalent_object > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813880#comment-16813880 ] shane knapp edited comment on SPARK-27389 at 4/9/19 10:54 PM: -- btw, the total impact of this problem "only" failed 73 builds over the past seven days and was limited to two workers, amp-jenkins-worker-03 and -05. {noformat} 1 NewSparkPullRequestBuilder 2 spark-branch-2.3-test-sbt-hadoop-2.6 3 spark-branch-2.3-test-sbt-hadoop-2.7 1 spark-branch-2.4-test-sbt-hadoop-2.6 6 spark-branch-2.4-test-sbt-hadoop-2.7 11 spark-master-test-sbt-hadoop-2.7 49 SparkPullRequestBuilder {noformat} i still haven't figured out *why* things broke... it wasn't an errant package install by a build as i have the anaconda dirs locked down and the only way to add/update packages there is to use sudo. was (Author: shaneknapp): btw, the total impact of this problem "only" failed 73 builds over the past seven days and was limited to two workers, amp-jenkins-worker-03 and -05. i still haven't figured out *why* things broke... it wasn't an errant package install by a build as i have the anaconda dirs locked down and the only way to add/update packages there is to use sudo. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813880#comment-16813880 ] shane knapp commented on SPARK-27389: - btw, the total impact of this problem "only" failed 73 builds over the past seven days and was limited to two workers, amp-jenkins-worker-03 and -05. i still haven't figured out *why* things broke... it wasn't an errant package install by a build as i have the anaconda dirs locked down and the only way to add/update packages there is to use sudo. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27401) Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp
[ https://issues.apache.org/jira/browse/SPARK-27401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27401. --- Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24311 > Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp > > > Key: SPARK-27401 > URL: https://issues.apache.org/jira/browse/SPARK-27401 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The fromJavaTimestamp/toJavaTimestamp and toJavaDate/fromJavaDate can be > implemented using existing methods DateTimeUtils like > instantToMicros/microsToInstant and daysToLocalDate/localDateToDays. This > should allow: > # To avoid invocation of millisToDays and time zone offset calculation > # To simplify implementation of toJavaTimestamp, and properly handle > negative inputs > # Detect arithmetic overflow of Long -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813873#comment-16813873 ] shane knapp commented on SPARK-27389: - we are most definitely good to go... this build is running on amp-jenkins-worker-05 and the python2.7 pyspark.sql.tests.test_dataframe tests successfully passed: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5718 this build was previously failing on the same worker w/the TZ issue: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5699/console > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard
Maxim Gekk created SPARK-27423: -- Summary: Cast DATE to/from TIMESTAMP according to SQL standard Key: SPARK-27423 URL: https://issues.apache.org/jira/browse/SPARK-27423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk According to SQL standard, DATE is union of (year, month, day). To convert it to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be extended by time at midnight - (year, month, day, hour = 0, minute = 0, seconds = 0). The former timestamp should be considered as a timestamp at the session time zone, and transformed to microseconds since epoch in UTC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813834#comment-16813834 ] Xiangrui Meng commented on SPARK-25348: --- Sampling could be supported later. > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modification_time: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25348: -- Description: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] Proposed API: Format name: "binary-file" Schema: * content: BinaryType * status (following Hadoop FIleStatus): ** path: StringType ** modification_time: Timestamp ** length: LongType (size limit 2GB) Options: * pathFilterRegex: only include files with path matching the regex pattern * maxBytesPerPartition: The max total file size for each partition unless the partition only contains one file We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as convenience aliases. was: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] Proposed API: Format name: "binary-file" Schema: * content: BinaryType * status (following Hadoop FIleStatus): * path: StringType * modification_time: Timestamp * length: LongType (size limit 2GB) Options: * pathFilterRegex: only include files with path matching the regex pattern * maxBytesPerPartition: The max total file size for each partition unless the partition only contains one file We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as convenience aliases. > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > ** path: StringType > ** modification_time: Timestamp > ** length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813832#comment-16813832 ] Xiangrui Meng commented on SPARK-25348: --- Updated the description and proposed APIs. > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > * path: StringType > * modification_time: Timestamp > * length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25348: -- Description: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] Proposed API: Format name: "binary-file" Schema: * content: BinaryType * status (following Hadoop FIleStatus): * path: StringType * modification_time: Timestamp * length: LongType (size limit 2GB) Options: * pathFilterRegex: only include files with path matching the regex pattern * maxBytesPerPartition: The max total file size for each partition unless the partition only contains one file We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as convenience aliases. was: It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] > Proposed API: > Format name: "binary-file" > Schema: > * content: BinaryType > * status (following Hadoop FIleStatus): > * path: StringType > * modification_time: Timestamp > * length: LongType (size limit 2GB) > Options: > * pathFilterRegex: only include files with path matching the regex pattern > * maxBytesPerPartition: The max total file size for each partition unless the > partition only contains one file > We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as > convenience aliases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25348) Data source for binary files
[ https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-25348: - Assignee: Weichen Xu > Data source for binary files > > > Key: SPARK-25348 > URL: https://issues.apache.org/jira/browse/SPARK-25348 > Project: Spark > Issue Type: Story > Components: ML, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Weichen Xu >Priority: Major > > It would be useful to have a data source implementation for binary files, > which can be used to build features to load images, audio, and videos. > Microsoft has an implementation at > [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be > great if we can merge it into Spark main repo. > cc: [~mhamilton] and [~imatiach] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27357) Cast timestamps to/from dates independently from time zones
[ https://issues.apache.org/jira/browse/SPARK-27357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-27357. Resolution: Not A Problem > Cast timestamps to/from dates independently from time zones > --- > > Key: SPARK-27357 > URL: https://issues.apache.org/jira/browse/SPARK-27357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Both Catalyst's types TIMESTAMP and DATE internally represent time intervals > since epoch in UTC time zone. The TIMESTAMP type contains number of > microseconds since epoch, and DATE is number of days since epoch (00:00:00 1 > January 1970). As a consequence of that, the conversion should be independent > from session or local time zone. The ticket aims to fix current behavior and > makes the conversion independent from time zones. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813812#comment-16813812 ] shane knapp edited comment on SPARK-27389 at 4/9/19 9:09 PM: - ok, this should be fixed now... i got all the workers to recognize US/Pacific-New w/python2.7 and the python/run-tests script now passes! the following was run on amp-jenkins-worker-05, which was failing continuously w/the unknown tz error: {noformat} -bash-4.1$ python/run-tests --python-executables=python2.7 Running PySpark tests. Output is in /home/jenkins/src/spark/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] ...trimming a bunch to get to the failing tests... Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests were skipped ...yay! it passed! now skipping more output to get to the end... Tests passed in 797 seconds Skipped tests in pyspark.sql.tests.test_dataframe with python2.7: test_create_dataframe_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' test_to_pandas_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' {noformat} turns out that a couple of workers were missing the US/Pacific-New tzinfo file in the pytz libdir. a quick scp + python2.7 -m compileall later and things seem to be happy! i'll leave this open for now, and if anyone notices other builds failing in this way please link to them here. was (Author: shaneknapp): ok, this should be fixed now... i got all the workers to recognize US/Pacific-New w/python2.7 and the python/run-tests script now passes! {noformat} -bash-4.1$ python/run-tests --python-executables=python2.7 Running PySpark tests. Output is in /home/jenkins/src/spark/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] ...trimming a bunch to get to the failing tests... Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests were skipped ...yay! it passed! now skipping more output to get to the end... Tests passed in 797 seconds Skipped tests in pyspark.sql.tests.test_dataframe with python2.7: test_create_dataframe_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' test_to_pandas_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' {noformat} turns out that a couple of workers were missing the US/Pacific-New tzinfo file in the pytz libdir. a quick scp + python2.7 -m compileall later and things seem to be happy! i'll leave this open for now, and if anyone notices other builds failing in this way please link to them here. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File >
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813812#comment-16813812 ] shane knapp commented on SPARK-27389: - ok, this should be fixed now... i got all the workers to recognize US/Pacific-New w/python2.7 and the python/run-tests script now passes! {noformat} -bash-4.1$ python/run-tests --python-executables=python2.7 Running PySpark tests. Output is in /home/jenkins/src/spark/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] ...trimming a bunch to get to the failing tests... Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests were skipped ...yay! it passed! now skipping more output to get to the end... Tests passed in 797 seconds Skipped tests in pyspark.sql.tests.test_dataframe with python2.7: test_create_dataframe_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' test_to_pandas_required_pandas_not_found (pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas was found.' {noformat} turns out that a couple of workers were missing the US/Pacific-New tzinfo file in the pytz libdir. a quick scp + python2.7 -m compileall later and things seem to be happy! i'll leave this open for now, and if anyone notices other builds failing in this way please link to them here. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx",
[jira] [Created] (SPARK-27422) CurrentDate should return local date
Maxim Gekk created SPARK-27422: -- Summary: CurrentDate should return local date Key: SPARK-27422 URL: https://issues.apache.org/jira/browse/SPARK-27422 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk According to SQL standard, DATE type is union of (year, month, days), and current date should return a triple of (year, month, days) in session local time zone. The ticket aims to follow the requirement, and calculate a local date for session time zone. The local date should be converted to epoch day, and stored internally in as DATE value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
[ https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Maynard updated SPARK-27421: - Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141) > RuntimeException when querying a view on a partitioned parquet table > > > Key: SPARK-27421 > URL: https://issues.apache.org/jira/browse/SPARK-27421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit > Server VM, Java 1.8.0_141) >Reporter: Eric Maynard >Priority: Minor > > When running a simple query, I get the following stacktrace: > {code} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at >
[jira] [Created] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
Eric Maynard created SPARK-27421: Summary: RuntimeException when querying a view on a partitioned parquet table Key: SPARK-27421 URL: https://issues.apache.org/jira/browse/SPARK-27421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Eric Maynard When running a simple query, I get the following stacktrace: {code} java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at
[jira] [Updated] (SPARK-27420) KinesisInputDStream should expose a way to disable CloudWatch metrics
[ https://issues.apache.org/jira/browse/SPARK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jerome Gagnon updated SPARK-27420: -- Priority: Major (was: Minor) > KinesisInputDStream should expose a way to disable CloudWatch metrics > - > > Key: SPARK-27420 > URL: https://issues.apache.org/jira/browse/SPARK-27420 > Project: Spark > Issue Type: Improvement > Components: DStreams, Input/Output >Affects Versions: 2.3.3 >Reporter: Jerome Gagnon >Priority: Major > > KinesisInputDStream currently does not provide a way to disable CloudWatch > metrics push. Kinesis client library (KCL) which is used under the hood > provide the ability through `withMetrics` methods. > To make things worse the default level is "DETAILED" which pushes 10s of > metrics every 10 seconds. When dealing with multiple streaming jobs this add > up pretty quickly, leading to thousands of dollar in cost. > Exposing a way to disable/set the proper level of monitoring is critical to > us. We had to send invalid credentials and suppress log as a less-than-ideal > workaround : see > [https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27420) KinesisInputDStream should expose a way to disable CloudWatch metrics
Jerome Gagnon created SPARK-27420: - Summary: KinesisInputDStream should expose a way to disable CloudWatch metrics Key: SPARK-27420 URL: https://issues.apache.org/jira/browse/SPARK-27420 Project: Spark Issue Type: Improvement Components: DStreams, Input/Output Affects Versions: 2.3.3 Reporter: Jerome Gagnon KinesisInputDStream currently does not provide a way to disable CloudWatch metrics push. Kinesis client library (KCL) which is used under the hood provide the ability through `withMetrics` methods. To make things worse the default level is "DETAILED" which pushes 10s of metrics every 10 seconds. When dealing with multiple streaming jobs this add up pretty quickly, leading to thousands of dollar in cost. Exposing a way to disable/set the proper level of monitoring is critical to us. We had to send invalid credentials and suppress log as a less-than-ideal workaround : see [https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail
Shixiong Zhu created SPARK-27419: Summary: When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail Key: SPARK-27419 URL: https://issues.apache.org/jira/browse/SPARK-27419 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.1, 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu When setting spark.executor.heartbeatInterval to a value less than 1 seconds in branch-2.4, it will always fail because the value will be converted to 0 and the heartbeat will always timeout and finally kill the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27418) Migrate Parquet to File Data Source V2
Gengliang Wang created SPARK-27418: -- Summary: Migrate Parquet to File Data Source V2 Key: SPARK-27418 URL: https://issues.apache.org/jira/browse/SPARK-27418 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813625#comment-16813625 ] shane knapp edited comment on SPARK-27389 at 4/9/19 5:08 PM: - done. {noformat} $ pssh -h jenkins_workers.txt "cp /root/python/__init__.py /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" [1] 10:06:19 [SUCCESS] amp-jenkins-worker-02 [2] 10:06:19 [SUCCESS] amp-jenkins-worker-06 [3] 10:06:19 [SUCCESS] amp-jenkins-worker-05 [4] 10:06:19 [SUCCESS] amp-jenkins-worker-03 [5] 10:06:19 [SUCCESS] amp-jenkins-worker-01 [6] 10:06:19 [SUCCESS] amp-jenkins-worker-04 $ ssh amp-jenkins-worker-03 "grep Pacific-New /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" 'US/Pacific-New', 'US/Pacific-New', {noformat} and {noformat} [sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c "import pytz; print 'US/Pacific-New' in pytz.all_timezones" True {noformat} was (Author: shaneknapp): done. {noformat} $ pssh -h jenkins_workers.txt "cp /root/python/__init__.py /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" [1] 10:06:19 [SUCCESS] amp-jenkins-worker-02 [2] 10:06:19 [SUCCESS] amp-jenkins-worker-06 [3] 10:06:19 [SUCCESS] amp-jenkins-worker-05 [4] 10:06:19 [SUCCESS] amp-jenkins-worker-03 [5] 10:06:19 [SUCCESS] amp-jenkins-worker-01 [6] 10:06:19 [SUCCESS] amp-jenkins-worker-04 $ ssh amp-jenkins-worker-03 "grep Pacific-New /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" 'US/Pacific-New', 'US/Pacific-New', {noformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813625#comment-16813625 ] shane knapp commented on SPARK-27389: - done. {noformat} $ pssh -h jenkins_workers.txt "cp /root/python/__init__.py /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" [1] 10:06:19 [SUCCESS] amp-jenkins-worker-02 [2] 10:06:19 [SUCCESS] amp-jenkins-worker-06 [3] 10:06:19 [SUCCESS] amp-jenkins-worker-05 [4] 10:06:19 [SUCCESS] amp-jenkins-worker-03 [5] 10:06:19 [SUCCESS] amp-jenkins-worker-01 [6] 10:06:19 [SUCCESS] amp-jenkins-worker-04 $ ssh amp-jenkins-worker-03 "grep Pacific-New /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py" 'US/Pacific-New', 'US/Pacific-New', {noformat} > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813620#comment-16813620 ] shane knapp commented on SPARK-27389: - ok, i am going to go all cowboy on this and manually update: {noformat} /home/anaconda/lib/python2.7/site-packages/pytz/__init__.py {noformat} and add the US/Pacific-New TZ. this should definitely fix the problem, and if it doesn't, i can very quickly roll back. > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813554#comment-16813554 ] Sean Owen commented on SPARK-25150: --- What happens on master, and what happens if you run the SQL query in your example -- is it different? Your second example is unexpected to me, so I think there is probably an issue here somewhere, especially if ANSI SQL mandates a different behavior here (does it? I don't know) > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish
[ https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27394. Resolution: Fixed Fix Version/s: 3.0.0 > The staleness of UI may last minutes or hours when no tasks start or finish > --- > > Key: SPARK-27394 > URL: https://issues.apache.org/jira/browse/SPARK-27394 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 2.4.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 3.0.0 > > > Run the following codes on a cluster that has at least 2 cores. > {code} > sc.makeRDD(1 to 1000, 1000).foreach { i => > Thread.sleep(30) > } > {code} > The jobs page will just show one running task. > This is because when the second task event calls > "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap > between two events is smaller than `spark.ui.liveUpdate.period`. > After the second task event, in the above case, because there won't be any > other task events, the Spark UI will be always stale until the next task > event gets fired (after 300 seconds). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813511#comment-16813511 ] Brandon Perry commented on SPARK-25150: --- [~srowen], I ran into this situation yesterday as well, and I think there may be some miscommunication about expected behavior vs actual here. Many people are accustomed to writing joins in a sequential manner in SQL; using the sample scenario here: {code:SQL|borderstyle=solid} SELECT a.State, a.`Total Population`, b.count AS `Total Humans`, c.count AS `Total Zombies` FROM states AS a JOIN total_humans AS b ON a.state = b.state JOIN total_zombies AS c ON a.state = c.state ORDER BY a.state ASC; {code} On virtually all ANSI SQL systems, this will result in the output which [~nchammas] mentions is expected. However, it looks like Spark actually evaluates the chained joins by doing something like (states JOIN humans ON state) JOIN (states JOIN zombies ON state) ON (_no condition specified_). Part of the problem is that even when you attempt to fix the states['State'] join, you get the "trivially inferred" warning with inappropriate output, as they share the same lineage and Spark optimizes past the intended logic: {code:Python|borderstyle=solid} states_with_humans = states \ .join( total_humans, on=(states['State'] == total_humans['State']) ) analysis = states_with_humans \ .join( total_zombies, on=(states_with_humans['State'] == total_zombies['State']) ) \ .orderBy(states['State'], ascending=True) \ .select( states_with_humans['State'], states_with_humans['Total Population'], states_with_humans['count'].alias('Total Humans'), total_zombies['count'].alias('Total Zombies'), ) ) {code} Is there something we're all missing here? This seems to be a cookie-cutter example of a three-way join not functioning as expected without explicit aliasing. Is there a reason this behavior is desirable? > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27361: - Assignee: Thomas Graves > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * YARN can pass GPU info to Spark executor. > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods
[ https://issues.apache.org/jira/browse/SPARK-27417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangpengyu resolved SPARK-27417. Resolution: Fixed > CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory > in their stop() methods > --- > > Key: SPARK-27417 > URL: https://issues.apache.org/jira/browse/SPARK-27417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: yangpengyu >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods
[ https://issues.apache.org/jira/browse/SPARK-27417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangpengyu closed SPARK-27417. -- > CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory > in their stop() methods > --- > > Key: SPARK-27417 > URL: https://issues.apache.org/jira/browse/SPARK-27417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: yangpengyu >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods
yangpengyu created SPARK-27417: -- Summary: CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods Key: SPARK-27417 URL: https://issues.apache.org/jira/browse/SPARK-27417 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 Reporter: yangpengyu Assignee: Josh Rosen Fix For: 1.6.0 I discovered multiple leaks of shuffle memory while working on my memory manager consolidation patch, which added the ability to do strict memory leak detection for the bookkeeping that used to be performed by the ShuffleMemoryManager. This uncovered a handful of places where tasks can acquire execution/shuffle memory but never release it, starving themselves of memory. Problems that I found: * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution memory. * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a {{CompletionIterator}}. * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing its resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813495#comment-16813495 ] yangpengyu commented on SPARK-11293: I hit the sam problem when I run TPCH test on spark1.6.0. my dataset scale is SF=1000, Environment as follows: 1Master 3 Worker, onHeapMemory=10g, offHeapMemory=20g, 24threads/Worker the query3 and query17 detected memory leak.Some logs are as follows: 9/04/09 21:57:59 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2685 41 19/04/09 21:58:16 WARN TaskMemoryManager: leak 512.0 MB memory from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@385b3b3f 42 19/04/09 21:58:16 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2683 43 19/04/09 21:58:16 WARN TaskMemoryManager: leak 512.0 MB memory from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@35be7b55 44 19/04/09 21:58:16 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2703 45 19/04/09 21:58:20 WARN TaskMemoryManager: leak 512.0 MB memory from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@50f93582 46 19/04/09 21:58:20 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2709 47 19/04/09 21:58:21 WARN TaskMemoryManager: leak 512.0 MB memory from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@28e3ec7a 48 19/04/09 21:58:21 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2723 49 19/04/09 21:59:50 WARN TaskMemoryManager: leak 512.0 MB memory from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5b2f5dbc 50 19/04/09 21:59:50 ERROR Executor: Managed memory leak detected; size = 536870912 bytes, TID = 2687 51 19/04/09 22:00:50 WARN TransportChannelHandler: Exception in connection from hw083/172.18.11.83:42989 > ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their > stop() methods > --- > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo closed SPARK-27415. --- > UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines > have different Oops size > > > Key: SPARK-27415 > URL: https://issues.apache.org/jira/browse/SPARK-27415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Priority: Major > > Actually this's follow up for > https://issues.apache.org/jira/browse/SPARK-27406, > https://issues.apache.org/jira/browse/SPARK-10914 > This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization > issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] peng bo resolved SPARK-27415. - Resolution: Invalid duplicate one due to network issue.. > UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines > have different Oops size > > > Key: SPARK-27415 > URL: https://issues.apache.org/jira/browse/SPARK-27415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Priority: Major > > Actually this's follow up for > https://issues.apache.org/jira/browse/SPARK-27406, > https://issues.apache.org/jira/browse/SPARK-10914 > This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization > issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
peng bo created SPARK-27416: --- Summary: UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size Key: SPARK-27416 URL: https://issues.apache.org/jira/browse/SPARK-27416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Reporter: peng bo Actually this's follow up for https://issues.apache.org/jira/browse/SPARK-27406, https://issues.apache.org/jira/browse/SPARK-10914 This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size
peng bo created SPARK-27415: --- Summary: UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size Key: SPARK-27415 URL: https://issues.apache.org/jira/browse/SPARK-27415 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.1 Reporter: peng bo Actually this's follow up for https://issues.apache.org/jira/browse/SPARK-27406, https://issues.apache.org/jira/browse/SPARK-10914 This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization issue when two machines have different Oops size. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery
[ https://issues.apache.org/jira/browse/SPARK-27411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27411: --- Assignee: Mingcong Han > DataSourceV2Strategy should not eliminate subquery > -- > > Key: SPARK-27411 > URL: https://issues.apache.org/jira/browse/SPARK-27411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mingcong Han >Assignee: Mingcong Han >Priority: Major > Fix For: 3.0.0 > > > In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake > after normalizing filters. Here is an example: > We have an sql with a scalar subquery: > {code:scala} > val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") > plan.explain(true) > {code} > And we get the log info of DataSourceV2Strategy: > {noformat} > Pushing operators to csv:examples/src/main/resources/t2.txt > Pushed Filters: > Post-Scan Filters: isnotnull(t2a#30) > Output: t2a#30, t2b#31 > {noformat} > The `Post-Scan Filters` should contain the scalar subquery, but we eliminate > it by mistake. > {noformat} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter ('t2a > scalar-subquery#56 []) >: +- 'Project [unresolvedalias('max('t1a), None)] >: +- 'UnresolvedRelation `t1` >+- 'UnresolvedRelation `t2` > == Analyzed Logical Plan == > t2a: string, t2b: string > Project [t2a#30, t2b#31] > +- Filter (t2a#30 > scalar-subquery#56 []) >: +- Aggregate [max(t1a#13) AS max(t1a)#63] >: +- SubqueryAlias `t1` >:+- RelationV2[t1a#13, t1b#14] > csv:examples/src/main/resources/t1.txt >+- SubqueryAlias `t2` > +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt > == Optimized Logical Plan == > Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) > : +- Aggregate [max(t1a#13) AS max(t1a)#63] > : +- Project [t1a#13] > :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt > +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt > == Physical Plan == > *(1) Project [t2a#30, t2b#31] > +- *(1) Filter isnotnull(t2a#30) >+- *(1) BatchScan[t2a#30, t2b#31] class > org.apache.spark.sql.execution.datasources.v2.csv.CSVScan > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery
[ https://issues.apache.org/jira/browse/SPARK-27411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27411. - Resolution: Fixed Issue resolved by pull request 24321 [https://github.com/apache/spark/pull/24321] > DataSourceV2Strategy should not eliminate subquery > -- > > Key: SPARK-27411 > URL: https://issues.apache.org/jira/browse/SPARK-27411 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mingcong Han >Priority: Major > Fix For: 3.0.0 > > > In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake > after normalizing filters. Here is an example: > We have an sql with a scalar subquery: > {code:scala} > val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") > plan.explain(true) > {code} > And we get the log info of DataSourceV2Strategy: > {noformat} > Pushing operators to csv:examples/src/main/resources/t2.txt > Pushed Filters: > Post-Scan Filters: isnotnull(t2a#30) > Output: t2a#30, t2b#31 > {noformat} > The `Post-Scan Filters` should contain the scalar subquery, but we eliminate > it by mistake. > {noformat} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter ('t2a > scalar-subquery#56 []) >: +- 'Project [unresolvedalias('max('t1a), None)] >: +- 'UnresolvedRelation `t1` >+- 'UnresolvedRelation `t2` > == Analyzed Logical Plan == > t2a: string, t2b: string > Project [t2a#30, t2b#31] > +- Filter (t2a#30 > scalar-subquery#56 []) >: +- Aggregate [max(t1a#13) AS max(t1a)#63] >: +- SubqueryAlias `t1` >:+- RelationV2[t1a#13, t1b#14] > csv:examples/src/main/resources/t1.txt >+- SubqueryAlias `t2` > +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt > == Optimized Logical Plan == > Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) > : +- Aggregate [max(t1a#13) AS max(t1a)#63] > : +- Project [t1a#13] > :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt > +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt > == Physical Plan == > *(1) Project [t2a#30, t2b#31] > +- *(1) Filter isnotnull(t2a#30) >+- *(1) BatchScan[t2a#30, t2b#31] class > org.apache.spark.sql.execution.datasources.v2.csv.CSVScan > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27414) make it clear that date type is timezone independent
Wenchen Fan created SPARK-27414: --- Summary: make it clear that date type is timezone independent Key: SPARK-27414 URL: https://issues.apache.org/jira/browse/SPARK-27414 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- Attachment: PmemShuffleManager-DesignDoc.pdf > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: shuffle > Attachments: PmemShuffleManager-DesignDoc.pdf > > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27413) Keep the same epoch pace between driver and executor.
Genmao Yu created SPARK-27413: - Summary: Keep the same epoch pace between driver and executor. Key: SPARK-27413 URL: https://issues.apache.org/jira/browse/SPARK-27413 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Genmao Yu The pace of epoch generation in driver and epoch pulling in executor is different. It will result in many empty epochs for partition if the epoch pulling interval is larger than epoch generation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic
[ https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813195#comment-16813195 ] Valeria Vasylieva commented on SPARK-20597: --- [~jlaskowski] kindly ask you to review the PR, so that this task would not hang on. > KafkaSourceProvider falls back on path as synonym for topic > --- > > Key: SPARK-20597 > URL: https://issues.apache.org/jira/browse/SPARK-20597 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > Labels: starter > > # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic > to save a DataFrame's rows to > # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka > topics for writing > What seems a quite interesting option is to support {{start(path: String)}} > as the least precedence option in which {{path}} would designate the default > topic when no other options are used. > {code} > df.writeStream.format("kafka").start("topic") > {code} > See > http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html > for discussion -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.
[ https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813158#comment-16813158 ] ketan kunde commented on SPARK-9858: [~aroberts] : did this exchangecordinator suite test cases pass for your big endian environment, exclusively test cases by the following name test(s"determining the number of reducers: complex query 1 test(s"determining the number of reducers: complex query 2 The above test cases are also seen failing on my big endian environment with the below respective logs * determining the number of reducers: complex query 1 *** FAILED *** Set(1, 2) did not equal Set(2, 3) (ExchangeCoordinatorSuite.scala:424) - determining the number of reducers: complex query 2 *** FAILED *** Set(4, 2) did not equal Set(5, 3) (ExchangeCoordinatorSuite.scala:476) Since this ticket is RESOLVED i would like to know from you what is the change u did to ensure passing of this test cases Also could you also highlight which exact feature of spark does this test case test I would be very greatful for your reply. Regards Ketan > Introduce an ExchangeCoordinator to estimate the number of post-shuffle > partitions. > --- > > Key: SPARK-9858 > URL: https://issues.apache.org/jira/browse/SPARK-9858 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Major > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms
[ https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813156#comment-16813156 ] Martin Junghanns commented on SPARK-25994: -- [~kanjilal] Thanks for the initial comments on the doc. Looking forward to you PR comments. Let's discuss tasks after the API is settled and https://issues.apache.org/jira/browse/SPARK-27300 is merged. How familiar are you with pyspark? Would implementing the Python API be something of interest for you? > SPIP: Property Graphs, Cypher Queries, and Algorithms > - > > Key: SPARK-25994 > URL: https://issues.apache.org/jira/browse/SPARK-25994 > Project: Spark > Issue Type: Epic > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > Labels: SPIP > > Copied from the SPIP doc: > {quote} > GraphX was one of the foundational pillars of the Spark project, and is the > current graph component. This reflects the importance of the graphs data > model, which naturally pairs with an important class of analytic function, > the network or graph algorithm. > However, GraphX is not actively maintained. It is based on RDDs, and cannot > exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala > users. > GraphFrames is a Spark package, which implements DataFrame-based graph > algorithms, and also incorporates simple graph pattern matching with fixed > length patterns (called “motifs”). GraphFrames is based on DataFrames, but > has a semantically weak graph data model (based on untyped edges and > vertices). The motif pattern matching facility is very limited by comparison > with the well-established Cypher language. > The Property Graph data model has become quite widespread in recent years, > and is the primary focus of commercial graph data management and of graph > data research, both for on-premises and cloud data management. Many users of > transactional graph databases also wish to work with immutable graphs in > Spark. > The idea is to define a Cypher-compatible Property Graph type based on > DataFrames; to replace GraphFrames querying with Cypher; to reimplement > GraphX/GraphFrames algos on the PropertyGraph type. > To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), > reusing existing proven designs and code, will be employed in Spark 3.0. This > graph query processor, like CAPS, will overlay and drive the SparkSQL > Catalyst query engine, using the CAPS graph query planner. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20984) Reading back from ORC format gives error on big endian systems.
[ https://issues.apache.org/jira/browse/SPARK-20984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813126#comment-16813126 ] ketan kunde commented on SPARK-20984: - Hi I understand that ORC file format is not well read on big endian systems. I am looking to build spark as spark standalone, since orc related test cases are exclusive to Hive module which will not be part of spark standalone build Can i neglect all orc related test cases for spark standalone build and ensure that i am not compromising on any of the spark standalone features? Regards Ketan Kunde > Reading back from ORC format gives error on big endian systems. > --- > > Key: SPARK-20984 > URL: https://issues.apache.org/jira/browse/SPARK-20984 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Redhat 7 on power 7 Big endian platform. > [testuser@soe10-vm12 spark]$ cat /etc/redhat- > redhat-access-insights/ redhat-release > [testuser@soe10-vm12 spark]$ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) > [testuser@soe10-vm12 spark]$ lscpu > Architecture: ppc64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Big Endian > CPU(s):8 > On-line CPU(s) list: 0-7 > Thread(s) per core:1 > Core(s) per socket:1 > Socket(s): 8 > NUMA node(s): 1 > Model: IBM pSeries (emulated by qemu) > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-7 > [testuser@soe10-vm12 spark]$ >Reporter: Mahesh >Priority: Major > Labels: big-endian > Attachments: hive_test_failure_log.txt > > > All orc test cases seem to be failing here. Looks like spark is not able to > read back what is written. Following is a way to check it on spark shell. I > am also pasting the test case which probably passes on x86. > All test cases in OrcHadoopFsRelationSuite.scala are failing. > test("SPARK-12218: 'Not' is included in ORC filter pushdown") { > import testImplicits._ > withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { > withTempPath { dir => > val path = s"${dir.getCanonicalPath}/table1" > (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", > "b").write.orc(path) > checkAnswer( > spark.read.orc(path).where("not (a = 2) or not(b in ('1'))"), > (1 to 5).map(i => Row(i, (i % 2).toString))) > checkAnswer( > spark.read.orc(path).where("not (a = 2 and b in ('1'))"), > (1 to 5).map(i => Row(i, (i % 2).toString))) > } > } > } > Same can be reproduced on spark shell > **Create a DF and write it in orc > scala> (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", > "b").write.orc("test") > **Now try to read it back > scala> spark.read.orc("test").where("not (a = 2) or not(b in ('1'))").show > 17/06/05 04:20:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.iq80.snappy.CorruptionException: Invalid copy offset for opcode starting > at 13 > at > org.iq80.snappy.SnappyDecompressor.decompressAllTags(SnappyDecompressor.java:165) > at > org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:76) > at org.iq80.snappy.Snappy.uncompress(Snappy.java:43) > at > org.apache.hadoop.hive.ql.io.orc.SnappyCodec.decompress(SnappyCodec.java:71) > at > org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:214) > at > org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238) > at java.io.InputStream.read(InputStream.java:101) > at > org.apache.hive.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) > at > org.apache.hive.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) > at > org.apache.hive.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10661) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10625) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730) > at > org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725) > at > org.apache.hive.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) > at > org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) > at > org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) > at >
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813094#comment-16813094 ] Gabor Somogyi commented on SPARK-27409: --- It would definitely help if you can provide your application code. > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) > E ... 23 more > E Caused by: java.io.FileNotFoundException: non-existent (No such file or > directory) > E at java.io.FileInputStream.open0(Native Method) > E at java.io.FileInputStream.open(FileInputStream.java:195) > E at java.io.FileInputStream.(FileInputStream.java:138) > E at java.io.FileInputStream.(FileInputStream.java:93) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) > E at > org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) > E ... 24 more{code} > When running a simple data stream loader for kafka without an SSL cert, it > goes through this code block - > > {code:java} > ... > ... > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > ... > ...{code} > > Note that I
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- External issue URL: (was: https://github.com/apache/spark/pull/24322) > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: shuffle > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chendi.Xue updated SPARK-27412: --- External issue URL: https://github.com/apache/spark/pull/24322 > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: shuffle > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
Chendi.Xue created SPARK-27412: -- Summary: Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage Key: SPARK-27412 URL: https://issues.apache.org/jira/browse/SPARK-27412 Project: Spark Issue Type: New Feature Components: Shuffle, Spark Core Affects Versions: 3.0.0 Reporter: Chendi.Xue Add a new shuffle manager called "PmemShuffleManager", by using which, we can use Persistent Memory Device as storage for shuffle and external sorter spilling. In this implementation, we leveraged Persistent Memory Development Kit(PMDK) to support transaction write with high performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery
Mingcong Han created SPARK-27411: Summary: DataSourceV2Strategy should not eliminate subquery Key: SPARK-27411 URL: https://issues.apache.org/jira/browse/SPARK-27411 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Mingcong Han Fix For: 3.0.0 In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. Here is an example: We have an sql with a scalar subquery: {code:scala} val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) {code} And we get the log info of DataSourceV2Strategy: {noformat} Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 {noformat} The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. {noformat} == Parsed Logical Plan == 'Project [*] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == *(1) Project [t2a#30, t2b#31] +- *(1) Filter isnotnull(t2a#30) +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org