[jira] [Commented] (SPARK-15719) Disable writing Parquet summary files by default

2019-04-09 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814061#comment-16814061
 ] 

Ruslan Dautkhanov commented on SPARK-15719:
---

[~lian cheng] quick question on this part from the description -

{quote}
when schema merging is enabled, we need to read footers of all files anyway to 
do the merge
{quote}
Is that still accurate in current Spark 2.3/  2.4? 
I was looking ParquetFileFormat.inferSchema and it does look at 
`_common_metadata` and `_metadata` files here - 

https://github.com/apache/spark/blob/v2.4.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L231

or Spark would still need to look at all files in all partitions, not actually 
all parquet files? 

Thank you.

> Disable writing Parquet summary files by default
> 
>
> Key: SPARK-15719
> URL: https://issues.apache.org/jira/browse/SPARK-15719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Parquet summary files are not particular useful nowadays since
> # when schema merging is disabled, we assume schema of all Parquet part-files 
> are identical, thus we can read the footer from any part-files.
> # when schema merging is enabled, we need to read footers of all files anyway 
> to do the merge.
> On the other hand, writing summary files can be expensive because footers of 
> all part-files must be read and merged. This is particularly costly when 
> appending small dataset to large existing Parquet dataset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-09 Thread Chendi.Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
Labels: core  (was: shuffle)

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: core
> Attachments: PmemShuffleManager-DesignDoc.pdf
>
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods

2019-04-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27410:
--
Docs Text: In Spark 3.0, the methods getRuns and setRuns in 
org.apache.spark.mllib.cluster.KMeans have been removed. They have been no-ops 
and deprecated since Spark 2.1.0.
   Labels: release-notes  (was: )

> Remove deprecated/no-op mllib.Kmeans get/setRuns methods
> 
>
> Key: SPARK-27410
> URL: https://issues.apache.org/jira/browse/SPARK-27410
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> mllib.KMeans has getRuns, setRuns methods which haven't done anything since 
> Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27410) Remove deprecated/no-op mllib.Kmeans get/setRuns methods

2019-04-09 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27410.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24320
[https://github.com/apache/spark/pull/24320]

> Remove deprecated/no-op mllib.Kmeans get/setRuns methods
> 
>
> Key: SPARK-27410
> URL: https://issues.apache.org/jira/browse/SPARK-27410
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 3.0.0
>
>
> mllib.KMeans has getRuns, setRuns methods which haven't done anything since 
> Spark 2.1. They're deprecated, and no-ops, and should be removed for Spark 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813896#comment-16813896
 ] 

Bryan Cutler commented on SPARK-27389:
--

Thanks [~shaneknapp] for the fix. I couldn't come up with any idea why this was 
happening all of a sudden either, but at least we are up and running again!

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-27389.
-
Resolution: Fixed

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reassigned SPARK-27389:
---

Assignee: shane knapp

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-09 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813883#comment-16813883
 ] 

Shivu Sondur commented on SPARK-27421:
--

i am checking this issue

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at 
> 

[jira] [Resolved] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests

2019-04-09 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27387.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/24306

> Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
> --
>
> Key: SPARK-27387
> URL: https://issues.apache.org/jira/browse/SPARK-27387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant 
> to check if 2 pandas.DataFrames are equal but it seems for later versions of 
> Pandas, this can fail if the DataFrame has an array column. This method can 
> be replaced by {{assert_frame_equal}} from pandas.util.testing.  This is what 
> it is meant for and it will give a better assertion message as well.
> The test failure I have seen is:
>  {noformat}
> ==
> ERROR: test_supported_types 
> (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py",
>  line 128, in test_supported_types
>     self.assertPandasEqual(expected1, result1)
>   File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, 
> in assertPandasEqual
>     self.assertTrue(expected.equals(result), msg=msg)
>   File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas
> ...
>   File "pandas/_libs/lib.pyx", line 523, in 
> pandas._libs.lib.array_equivalent_object
> ValueError: The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all()
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813880#comment-16813880
 ] 

shane knapp edited comment on SPARK-27389 at 4/9/19 10:54 PM:
--

btw, the total impact of this problem "only" failed 73 builds over the past 
seven days and was limited to two workers, amp-jenkins-worker-03 and -05.

{noformat}
  1 NewSparkPullRequestBuilder
  2 spark-branch-2.3-test-sbt-hadoop-2.6
  3 spark-branch-2.3-test-sbt-hadoop-2.7
  1 spark-branch-2.4-test-sbt-hadoop-2.6
  6 spark-branch-2.4-test-sbt-hadoop-2.7
 11 spark-master-test-sbt-hadoop-2.7
 49 SparkPullRequestBuilder
{noformat}

i still haven't figured out *why* things broke...  it wasn't an errant package 
install by a build as i have the anaconda dirs locked down and the only way to 
add/update packages there is to use sudo.


was (Author: shaneknapp):
btw, the total impact of this problem "only" failed 73 builds over the past 
seven days and was limited to two workers, amp-jenkins-worker-03 and -05.

i still haven't figured out *why* things broke...  it wasn't an errant package 
install by a build as i have the anaconda dirs locked down and the only way to 
add/update packages there is to use sudo.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was 

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813880#comment-16813880
 ] 

shane knapp commented on SPARK-27389:
-

btw, the total impact of this problem "only" failed 73 builds over the past 
seven days and was limited to two workers, amp-jenkins-worker-03 and -05.

i still haven't figured out *why* things broke...  it wasn't an errant package 
install by a build as i have the anaconda dirs locked down and the only way to 
add/update packages there is to use sudo.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27401) Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp

2019-04-09 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27401.
---
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24311

> Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp
> 
>
> Key: SPARK-27401
> URL: https://issues.apache.org/jira/browse/SPARK-27401
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The fromJavaTimestamp/toJavaTimestamp and toJavaDate/fromJavaDate can be 
> implemented using existing methods DateTimeUtils like 
> instantToMicros/microsToInstant and daysToLocalDate/localDateToDays. This 
> should allow:
>  # To avoid invocation of millisToDays and time zone offset calculation
>  # To simplify implementation of toJavaTimestamp, and properly handle 
> negative inputs
>  # Detect arithmetic overflow of Long



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813873#comment-16813873
 ] 

shane knapp commented on SPARK-27389:
-

we are most definitely good to go...  this build is running on 
amp-jenkins-worker-05 and the python2.7 pyspark.sql.tests.test_dataframe tests 
successfully passed:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5718

this build was previously failing on the same worker w/the TZ issue:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/5699/console

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard

2019-04-09 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27423:
--

 Summary: Cast DATE to/from TIMESTAMP according to SQL standard
 Key: SPARK-27423
 URL: https://issues.apache.org/jira/browse/SPARK-27423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


According to SQL standard, DATE is union of (year, month, day). To convert it 
to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be 
extended by time at midnight - (year, month, day, hour = 0, minute = 0, seconds 
= 0). The former timestamp should be considered as a timestamp at the session 
time zone, and transformed to microseconds since epoch in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25348) Data source for binary files

2019-04-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813834#comment-16813834
 ] 

Xiangrui Meng commented on SPARK-25348:
---

Sampling could be supported later.

> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]
> Proposed API:
> Format name: "binary-file"
> Schema:
> * content: BinaryType
> * status (following Hadoop FIleStatus):
> ** path: StringType
> ** modification_time: Timestamp
> ** length: LongType (size limit 2GB)
> Options:
> * pathFilterRegex: only include files with path matching the regex pattern
> * maxBytesPerPartition: The max total file size for each partition unless the 
> partition only contains one file
> We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
> convenience aliases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25348) Data source for binary files

2019-04-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25348:
--
Description: 
It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]

Proposed API:

Format name: "binary-file"

Schema:
* content: BinaryType
* status (following Hadoop FIleStatus):
** path: StringType
** modification_time: Timestamp
** length: LongType (size limit 2GB)

Options:
* pathFilterRegex: only include files with path matching the regex pattern
* maxBytesPerPartition: The max total file size for each partition unless the 
partition only contains one file

We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
convenience aliases.

  was:
It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]

Proposed API:

Format name: "binary-file"

Schema:
* content: BinaryType
* status (following Hadoop FIleStatus):
 * path: StringType
 * modification_time: Timestamp
 * length: LongType (size limit 2GB)

Options:
* pathFilterRegex: only include files with path matching the regex pattern
* maxBytesPerPartition: The max total file size for each partition unless the 
partition only contains one file

We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
convenience aliases.


> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]
> Proposed API:
> Format name: "binary-file"
> Schema:
> * content: BinaryType
> * status (following Hadoop FIleStatus):
> ** path: StringType
> ** modification_time: Timestamp
> ** length: LongType (size limit 2GB)
> Options:
> * pathFilterRegex: only include files with path matching the regex pattern
> * maxBytesPerPartition: The max total file size for each partition unless the 
> partition only contains one file
> We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
> convenience aliases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25348) Data source for binary files

2019-04-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813832#comment-16813832
 ] 

Xiangrui Meng commented on SPARK-25348:
---

Updated the description and proposed APIs.

> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]
> Proposed API:
> Format name: "binary-file"
> Schema:
> * content: BinaryType
> * status (following Hadoop FIleStatus):
>  * path: StringType
>  * modification_time: Timestamp
>  * length: LongType (size limit 2GB)
> Options:
> * pathFilterRegex: only include files with path matching the regex pattern
> * maxBytesPerPartition: The max total file size for each partition unless the 
> partition only contains one file
> We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
> convenience aliases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25348) Data source for binary files

2019-04-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25348:
--
Description: 
It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]

Proposed API:

Format name: "binary-file"

Schema:
* content: BinaryType
* status (following Hadoop FIleStatus):
 * path: StringType
 * modification_time: Timestamp
 * length: LongType (size limit 2GB)

Options:
* pathFilterRegex: only include files with path matching the regex pattern
* maxBytesPerPartition: The max total file size for each partition unless the 
partition only contains one file

We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
convenience aliases.

  was:
It would be useful to have a data source implementation for binary files, which 
can be used to build features to load images, audio, and videos.

Microsoft has an implementation at 
[https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
great if we can merge it into Spark main repo.

cc: [~mhamilton] and [~imatiach]


> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]
> Proposed API:
> Format name: "binary-file"
> Schema:
> * content: BinaryType
> * status (following Hadoop FIleStatus):
>  * path: StringType
>  * modification_time: Timestamp
>  * length: LongType (size limit 2GB)
> Options:
> * pathFilterRegex: only include files with path matching the regex pattern
> * maxBytesPerPartition: The max total file size for each partition unless the 
> partition only contains one file
> We will also add `binaryFile` to `DataFrameReader` and `DataStreamReader` as 
> convenience aliases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25348) Data source for binary files

2019-04-09 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-25348:
-

Assignee: Weichen Xu

> Data source for binary files
> 
>
> Key: SPARK-25348
> URL: https://issues.apache.org/jira/browse/SPARK-25348
> Project: Spark
>  Issue Type: Story
>  Components: ML, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> It would be useful to have a data source implementation for binary files, 
> which can be used to build features to load images, audio, and videos.
> Microsoft has an implementation at 
> [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be 
> great if we can merge it into Spark main repo.
> cc: [~mhamilton] and [~imatiach]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27357) Cast timestamps to/from dates independently from time zones

2019-04-09 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-27357.

Resolution: Not A Problem

> Cast timestamps to/from dates independently from time zones
> ---
>
> Key: SPARK-27357
> URL: https://issues.apache.org/jira/browse/SPARK-27357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Both Catalyst's types TIMESTAMP and DATE internally represent time intervals 
> since epoch in UTC time zone. The TIMESTAMP type contains number of 
> microseconds since epoch, and DATE is number of days since epoch (00:00:00  1 
> January 1970). As a consequence of that, the conversion should be independent 
> from session or local time zone. The ticket aims to fix current behavior and 
> makes the conversion independent from time zones.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813812#comment-16813812
 ] 

shane knapp edited comment on SPARK-27389 at 4/9/19 9:09 PM:
-

ok, this should be fixed now...  i got all the workers to recognize 
US/Pacific-New w/python2.7 and the python/run-tests script now passes!

the following was run on amp-jenkins-worker-05, which was failing continuously 
w/the unknown tz error:
{noformat}
-bash-4.1$ python/run-tests --python-executables=python2.7
Running PySpark tests. Output is in 
/home/jenkins/src/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']

...trimming a bunch to get to the failing tests...

Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests 
were skipped

...yay!  it passed!  now skipping more output to get to the end...

Tests passed in 797 seconds

Skipped tests in pyspark.sql.tests.test_dataframe with python2.7:
test_create_dataframe_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
test_to_pandas_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
{noformat}

turns out that a couple of workers were missing the US/Pacific-New tzinfo file 
in the pytz libdir.  a quick scp + python2.7 -m compileall later and things 
seem to be happy!

i'll leave this open for now, and if anyone notices other builds failing in 
this way please link to them here.


was (Author: shaneknapp):
ok, this should be fixed now...  i got all the workers to recognize 
US/Pacific-New w/python2.7 and the python/run-tests script now passes!


{noformat}
-bash-4.1$ python/run-tests --python-executables=python2.7
Running PySpark tests. Output is in 
/home/jenkins/src/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']

...trimming a bunch to get to the failing tests...

Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests 
were skipped

...yay!  it passed!  now skipping more output to get to the end...

Tests passed in 797 seconds

Skipped tests in pyspark.sql.tests.test_dataframe with python2.7:
test_create_dataframe_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
test_to_pandas_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
{noformat}

turns out that a couple of workers were missing the US/Pacific-New tzinfo file 
in the pytz libdir.  a quick scp + python2.7 -m compileall later and things 
seem to be happy!

i'll leave this open for now, and if anyone notices other builds failing in 
this way please link to them here.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> 

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813812#comment-16813812
 ] 

shane knapp commented on SPARK-27389:
-

ok, this should be fixed now...  i got all the workers to recognize 
US/Pacific-New w/python2.7 and the python/run-tests script now passes!


{noformat}
-bash-4.1$ python/run-tests --python-executables=python2.7
Running PySpark tests. Output is in 
/home/jenkins/src/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 
'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']

...trimming a bunch to get to the failing tests...

Finished test(python2.7): pyspark.sql.tests.test_dataframe (32s) ... 2 tests 
were skipped

...yay!  it passed!  now skipping more output to get to the end...

Tests passed in 797 seconds

Skipped tests in pyspark.sql.tests.test_dataframe with python2.7:
test_create_dataframe_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
test_to_pandas_required_pandas_not_found 
(pyspark.sql.tests.test_dataframe.DataFrameTests) ... skipped 'Required Pandas 
was found.'
{noformat}

turns out that a couple of workers were missing the US/Pacific-New tzinfo file 
in the pytz libdir.  a quick scp + python2.7 -m compileall later and things 
seem to be happy!

i'll leave this open for now, and if anyone notices other builds failing in 
this way please link to them here.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", 

[jira] [Created] (SPARK-27422) CurrentDate should return local date

2019-04-09 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27422:
--

 Summary: CurrentDate should return local date
 Key: SPARK-27422
 URL: https://issues.apache.org/jira/browse/SPARK-27422
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


According to SQL standard, DATE type is union of (year, month, days), and 
current date should return a triple of (year, month, days) in session local 
time zone. The ticket aims to follow the requirement, and calculate a local 
date for session time zone. The local date should be converted to epoch day, 
and stored internally in as DATE value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-09 Thread Eric Maynard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-27421:
-
Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server 
VM, Java 1.8.0_141)

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at 
> 

[jira] [Created] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-09 Thread Eric Maynard (JIRA)
Eric Maynard created SPARK-27421:


 Summary: RuntimeException when querying a view on a partitioned 
parquet table
 Key: SPARK-27421
 URL: https://issues.apache.org/jira/browse/SPARK-27421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Eric Maynard


When running a simple query, I get the following stacktrace:


{code}
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
 at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
 at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
 at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
 at scala.collection.immutable.List.foreach(List.scala:392)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
 at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
 at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
 at 

[jira] [Updated] (SPARK-27420) KinesisInputDStream should expose a way to disable CloudWatch metrics

2019-04-09 Thread Jerome Gagnon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerome Gagnon updated SPARK-27420:
--
Priority: Major  (was: Minor)

> KinesisInputDStream should expose a way to disable CloudWatch metrics
> -
>
> Key: SPARK-27420
> URL: https://issues.apache.org/jira/browse/SPARK-27420
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output
>Affects Versions: 2.3.3
>Reporter: Jerome Gagnon
>Priority: Major
>
> KinesisInputDStream currently does not provide a way to disable CloudWatch 
> metrics push. Kinesis client library (KCL) which is used under the hood 
> provide the ability through `withMetrics` methods.
> To make things worse the default level is "DETAILED" which pushes 10s of 
> metrics every 10 seconds. When dealing with multiple streaming jobs this add 
> up pretty quickly, leading to thousands of dollar in cost. 
> Exposing a way to disable/set the proper level of monitoring is critical to 
> us. We had to send invalid credentials and suppress log as a less-than-ideal 
> workaround : see 
> [https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27420) KinesisInputDStream should expose a way to disable CloudWatch metrics

2019-04-09 Thread Jerome Gagnon (JIRA)
Jerome Gagnon created SPARK-27420:
-

 Summary: KinesisInputDStream should expose a way to disable 
CloudWatch metrics
 Key: SPARK-27420
 URL: https://issues.apache.org/jira/browse/SPARK-27420
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, Input/Output
Affects Versions: 2.3.3
Reporter: Jerome Gagnon


KinesisInputDStream currently does not provide a way to disable CloudWatch 
metrics push. Kinesis client library (KCL) which is used under the hood provide 
the ability through `withMetrics` methods.

To make things worse the default level is "DETAILED" which pushes 10s of 
metrics every 10 seconds. When dealing with multiple streaming jobs this add up 
pretty quickly, leading to thousands of dollar in cost. 

Exposing a way to disable/set the proper level of monitoring is critical to us. 
We had to send invalid credentials and suppress log as a less-than-ideal 
workaround : see 
[https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002]

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail

2019-04-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27419:


 Summary: When setting spark.executor.heartbeatInterval to a value 
less than 1 seconds, it will always fail
 Key: SPARK-27419
 URL: https://issues.apache.org/jira/browse/SPARK-27419
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1, 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When setting spark.executor.heartbeatInterval to a value less than 1 seconds in 
branch-2.4, it will always fail because the value will be converted to 0 and 
the heartbeat will always timeout and finally kill the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27418) Migrate Parquet to File Data Source V2

2019-04-09 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27418:
--

 Summary: Migrate Parquet to File Data Source V2
 Key: SPARK-27418
 URL: https://issues.apache.org/jira/browse/SPARK-27418
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813625#comment-16813625
 ] 

shane knapp edited comment on SPARK-27389 at 4/9/19 5:08 PM:
-

done.


{noformat}
$ pssh -h jenkins_workers.txt "cp /root/python/__init__.py 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
[1] 10:06:19 [SUCCESS] amp-jenkins-worker-02
[2] 10:06:19 [SUCCESS] amp-jenkins-worker-06
[3] 10:06:19 [SUCCESS] amp-jenkins-worker-05
[4] 10:06:19 [SUCCESS] amp-jenkins-worker-03
[5] 10:06:19 [SUCCESS] amp-jenkins-worker-01
[6] 10:06:19 [SUCCESS] amp-jenkins-worker-04
$ ssh amp-jenkins-worker-03 "grep Pacific-New 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
 'US/Pacific-New',
 'US/Pacific-New',
{noformat}

and


{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ python2.7 -c "import pytz; print 
'US/Pacific-New' in pytz.all_timezones"
True
{noformat}



was (Author: shaneknapp):
done.


{noformat}
$ pssh -h jenkins_workers.txt "cp /root/python/__init__.py 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
[1] 10:06:19 [SUCCESS] amp-jenkins-worker-02
[2] 10:06:19 [SUCCESS] amp-jenkins-worker-06
[3] 10:06:19 [SUCCESS] amp-jenkins-worker-05
[4] 10:06:19 [SUCCESS] amp-jenkins-worker-03
[5] 10:06:19 [SUCCESS] amp-jenkins-worker-01
[6] 10:06:19 [SUCCESS] amp-jenkins-worker-04
$ ssh amp-jenkins-worker-03 "grep Pacific-New 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
 'US/Pacific-New',
 'US/Pacific-New',
{noformat}


> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in 

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813625#comment-16813625
 ] 

shane knapp commented on SPARK-27389:
-

done.


{noformat}
$ pssh -h jenkins_workers.txt "cp /root/python/__init__.py 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
[1] 10:06:19 [SUCCESS] amp-jenkins-worker-02
[2] 10:06:19 [SUCCESS] amp-jenkins-worker-06
[3] 10:06:19 [SUCCESS] amp-jenkins-worker-05
[4] 10:06:19 [SUCCESS] amp-jenkins-worker-03
[5] 10:06:19 [SUCCESS] amp-jenkins-worker-01
[6] 10:06:19 [SUCCESS] amp-jenkins-worker-04
$ ssh amp-jenkins-worker-03 "grep Pacific-New 
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py"
 'US/Pacific-New',
 'US/Pacific-New',
{noformat}


> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813620#comment-16813620
 ] 

shane knapp commented on SPARK-27389:
-

ok, i am going to go all cowboy on this and manually update:

{noformat}
/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py
{noformat}

and add the US/Pacific-New TZ.  this should definitely fix the problem, and if 
it doesn't, i can very quickly roll back.

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-04-09 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813554#comment-16813554
 ] 

Sean Owen commented on SPARK-25150:
---

What happens on master, and what happens if you run the SQL query in your 
example -- is it different?
Your second example is unexpected to me, so I think there is probably an issue 
here somewhere, especially if ANSI SQL mandates a different behavior here (does 
it? I don't know)

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish

2019-04-09 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27394.

   Resolution: Fixed
Fix Version/s: 3.0.0

> The staleness of UI may last minutes or hours when no tasks start or finish
> ---
>
> Key: SPARK-27394
> URL: https://issues.apache.org/jira/browse/SPARK-27394
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 3.0.0
>
>
> Run the following codes on a cluster that has at least 2 cores.
> {code}
> sc.makeRDD(1 to 1000, 1000).foreach { i =>
>   Thread.sleep(30)
> }
> {code}
> The jobs page will just show one running task.
> This is because when the second task event calls 
> "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap 
> between two events is smaller than `spark.ui.liveUpdate.period`.
> After the second task event, in the above case, because there won't be any 
> other task events, the Spark UI will be always stale until the next task 
> event gets fired (after 300 seconds).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2019-04-09 Thread Brandon Perry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813511#comment-16813511
 ] 

Brandon Perry commented on SPARK-25150:
---

[~srowen], I ran into this situation yesterday as well, and I think there may 
be some miscommunication about expected behavior vs actual here.  Many people 
are accustomed to writing joins in a sequential manner in SQL; using the sample 
scenario here:

{code:SQL|borderstyle=solid}
SELECT 
a.State, 
a.`Total Population`,
b.count AS `Total Humans`,
c.count AS `Total Zombies`
FROM states AS a
JOIN total_humans AS b
ON a.state = b.state
JOIN total_zombies AS c
ON a.state = c.state
ORDER BY a.state ASC;
{code}

On virtually all ANSI SQL systems, this will result in the output which 
[~nchammas] mentions is expected.  However, it looks like Spark actually 
evaluates the chained joins by doing something like (states JOIN humans ON 
state) JOIN (states JOIN zombies ON state) ON (_no condition specified_).

Part of the problem is that even when you attempt to fix the states['State'] 
join, you get the "trivially inferred" warning with inappropriate output, as 
they share the same lineage and Spark optimizes past the intended logic:

{code:Python|borderstyle=solid}
states_with_humans = states \
.join(
total_humans,
on=(states['State'] == total_humans['State'])
)
analysis = states_with_humans \
.join(
total_zombies,
on=(states_with_humans['State'] == total_zombies['State'])
) \
.orderBy(states['State'], ascending=True) \
.select(
states_with_humans['State'],
states_with_humans['Total Population'],
states_with_humans['count'].alias('Total Humans'),
total_zombies['count'].alias('Total Zombies'),
)
)
{code}

Is there something we're all missing here?  This seems to be a cookie-cutter 
example of a three-way join not functioning as expected without explicit 
aliasing.  Is there a reason this behavior is desirable?

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27361) YARN support for GPU-aware scheduling

2019-04-09 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27361:
-

Assignee: Thomas Graves

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement YARN support for GPU-aware scheduling:
> * User can request GPU resources at Spark application level.
> * YARN can pass GPU info to Spark executor.
> * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2019-04-09 Thread yangpengyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangpengyu resolved SPARK-27417.

Resolution: Fixed

> CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory 
> in their stop() methods
> ---
>
> Key: SPARK-27417
> URL: https://issues.apache.org/jira/browse/SPARK-27417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: yangpengyu
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2019-04-09 Thread yangpengyu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangpengyu closed SPARK-27417.
--

> CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory 
> in their stop() methods
> ---
>
> Key: SPARK-27417
> URL: https://issues.apache.org/jira/browse/SPARK-27417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: yangpengyu
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27417) CLONE - ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2019-04-09 Thread yangpengyu (JIRA)
yangpengyu created SPARK-27417:
--

 Summary: CLONE - ExternalSorter and ExternalAppendOnlyMap should 
free shuffle memory in their stop() methods
 Key: SPARK-27417
 URL: https://issues.apache.org/jira/browse/SPARK-27417
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
Reporter: yangpengyu
Assignee: Josh Rosen
 Fix For: 1.6.0


I discovered multiple leaks of shuffle memory while working on my memory 
manager consolidation patch, which added the ability to do strict memory leak 
detection for the bookkeeping that used to be performed by the 
ShuffleMemoryManager. This uncovered a handful of places where tasks can 
acquire execution/shuffle memory but never release it, starving themselves of 
memory.

Problems that I found:

* {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
memory.
* BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
{{CompletionIterator}}.
* {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing its 
resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2019-04-09 Thread yangpengyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813495#comment-16813495
 ] 

yangpengyu commented on SPARK-11293:


I hit the sam problem when I run TPCH test on spark1.6.0.

my dataset scale is SF=1000,

Environment as follows:

1Master 3 Worker,

onHeapMemory=10g,

offHeapMemory=20g,

24threads/Worker

the query3 and query17 detected memory leak.Some  logs are as follows: 

9/04/09 21:57:59 ERROR Executor: Managed memory leak detected; size = 536870912 
bytes, TID = 2685
 41 19/04/09 21:58:16 WARN TaskMemoryManager: leak 512.0 MB memory from 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@385b3b3f
 42 19/04/09 21:58:16 ERROR Executor: Managed memory leak detected; size = 
536870912 bytes, TID = 2683
 43 19/04/09 21:58:16 WARN TaskMemoryManager: leak 512.0 MB memory from 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@35be7b55
 44 19/04/09 21:58:16 ERROR Executor: Managed memory leak detected; size = 
536870912 bytes, TID = 2703
 45 19/04/09 21:58:20 WARN TaskMemoryManager: leak 512.0 MB memory from 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@50f93582
 46 19/04/09 21:58:20 ERROR Executor: Managed memory leak detected; size = 
536870912 bytes, TID = 2709
 47 19/04/09 21:58:21 WARN TaskMemoryManager: leak 512.0 MB memory from 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@28e3ec7a
 48 19/04/09 21:58:21 ERROR Executor: Managed memory leak detected; size = 
536870912 bytes, TID = 2723
 49 19/04/09 21:59:50 WARN TaskMemoryManager: leak 512.0 MB memory from 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5b2f5dbc
 50 19/04/09 21:59:50 ERROR Executor: Managed memory leak detected; size = 
536870912 bytes, TID = 2687
 51 19/04/09 22:00:50 WARN TransportChannelHandler: Exception in connection 
from hw083/172.18.11.83:42989

> ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their 
> stop() methods
> ---
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-09 Thread peng bo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo closed SPARK-27415.
---

> UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines 
> have different Oops size
> 
>
> Key: SPARK-27415
> URL: https://issues.apache.org/jira/browse/SPARK-27415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Priority: Major
>
> Actually this's follow up for 
> https://issues.apache.org/jira/browse/SPARK-27406, 
> https://issues.apache.org/jira/browse/SPARK-10914
> This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
> issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-09 Thread peng bo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo resolved SPARK-27415.
-
Resolution: Invalid

duplicate one due to network issue..

> UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines 
> have different Oops size
> 
>
> Key: SPARK-27415
> URL: https://issues.apache.org/jira/browse/SPARK-27415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Priority: Major
>
> Actually this's follow up for 
> https://issues.apache.org/jira/browse/SPARK-27406, 
> https://issues.apache.org/jira/browse/SPARK-10914
> This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
> issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27416) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-09 Thread peng bo (JIRA)
peng bo created SPARK-27416:
---

 Summary: UnsafeMapData & UnsafeArrayData Kryo serialization breaks 
when two machines have different Oops size
 Key: SPARK-27416
 URL: https://issues.apache.org/jira/browse/SPARK-27416
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1
Reporter: peng bo


Actually this's follow up for 
https://issues.apache.org/jira/browse/SPARK-27406, 
https://issues.apache.org/jira/browse/SPARK-10914

This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27415) UnsafeMapData & UnsafeArrayData Kryo serialization breaks when two machines have different Oops size

2019-04-09 Thread peng bo (JIRA)
peng bo created SPARK-27415:
---

 Summary: UnsafeMapData & UnsafeArrayData Kryo serialization breaks 
when two machines have different Oops size
 Key: SPARK-27415
 URL: https://issues.apache.org/jira/browse/SPARK-27415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1
Reporter: peng bo


Actually this's follow up for 
https://issues.apache.org/jira/browse/SPARK-27406, 
https://issues.apache.org/jira/browse/SPARK-10914

This issue is to fix the UnsafeMapData & UnsafeArrayData Kryo serialization 
issue when two machines have different Oops size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery

2019-04-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27411:
---

Assignee: Mingcong Han

> DataSourceV2Strategy should not eliminate subquery
> --
>
> Key: SPARK-27411
> URL: https://issues.apache.org/jira/browse/SPARK-27411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mingcong Han
>Assignee: Mingcong Han
>Priority: Major
> Fix For: 3.0.0
>
>
> In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake 
> after normalizing filters. Here is an example:
> We have an sql with a scalar subquery:
> {code:scala}
> val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
> plan.explain(true)
> {code}
> And we get the log info of DataSourceV2Strategy:
> {noformat}
> Pushing operators to csv:examples/src/main/resources/t2.txt
> Pushed Filters: 
> Post-Scan Filters: isnotnull(t2a#30)
> Output: t2a#30, t2b#31
> {noformat}
> The `Post-Scan Filters` should contain the scalar subquery, but we eliminate 
> it by mistake.
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter ('t2a > scalar-subquery#56 [])
>:  +- 'Project [unresolvedalias('max('t1a), None)]
>: +- 'UnresolvedRelation `t1`
>+- 'UnresolvedRelation `t2`
> == Analyzed Logical Plan ==
> t2a: string, t2b: string
> Project [t2a#30, t2b#31]
> +- Filter (t2a#30 > scalar-subquery#56 [])
>:  +- Aggregate [max(t1a#13) AS max(t1a)#63]
>: +- SubqueryAlias `t1`
>:+- RelationV2[t1a#13, t1b#14] 
> csv:examples/src/main/resources/t1.txt
>+- SubqueryAlias `t2`
>   +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
> == Optimized Logical Plan ==
> Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
> :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
> : +- Project [t1a#13]
> :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
> +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
> == Physical Plan ==
> *(1) Project [t2a#30, t2b#31]
> +- *(1) Filter isnotnull(t2a#30)
>+- *(1) BatchScan[t2a#30, t2b#31] class 
> org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery

2019-04-09 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27411.
-
Resolution: Fixed

Issue resolved by pull request 24321
[https://github.com/apache/spark/pull/24321]

> DataSourceV2Strategy should not eliminate subquery
> --
>
> Key: SPARK-27411
> URL: https://issues.apache.org/jira/browse/SPARK-27411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mingcong Han
>Priority: Major
> Fix For: 3.0.0
>
>
> In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake 
> after normalizing filters. Here is an example:
> We have an sql with a scalar subquery:
> {code:scala}
> val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
> plan.explain(true)
> {code}
> And we get the log info of DataSourceV2Strategy:
> {noformat}
> Pushing operators to csv:examples/src/main/resources/t2.txt
> Pushed Filters: 
> Post-Scan Filters: isnotnull(t2a#30)
> Output: t2a#30, t2b#31
> {noformat}
> The `Post-Scan Filters` should contain the scalar subquery, but we eliminate 
> it by mistake.
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter ('t2a > scalar-subquery#56 [])
>:  +- 'Project [unresolvedalias('max('t1a), None)]
>: +- 'UnresolvedRelation `t1`
>+- 'UnresolvedRelation `t2`
> == Analyzed Logical Plan ==
> t2a: string, t2b: string
> Project [t2a#30, t2b#31]
> +- Filter (t2a#30 > scalar-subquery#56 [])
>:  +- Aggregate [max(t1a#13) AS max(t1a)#63]
>: +- SubqueryAlias `t1`
>:+- RelationV2[t1a#13, t1b#14] 
> csv:examples/src/main/resources/t1.txt
>+- SubqueryAlias `t2`
>   +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
> == Optimized Logical Plan ==
> Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
> :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
> : +- Project [t1a#13]
> :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
> +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
> == Physical Plan ==
> *(1) Project [t2a#30, t2b#31]
> +- *(1) Filter isnotnull(t2a#30)
>+- *(1) BatchScan[t2a#30, t2b#31] class 
> org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27414) make it clear that date type is timezone independent

2019-04-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27414:
---

 Summary: make it clear that date type is timezone independent
 Key: SPARK-27414
 URL: https://issues.apache.org/jira/browse/SPARK-27414
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-09 Thread Chendi.Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
Attachment: PmemShuffleManager-DesignDoc.pdf

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: shuffle
> Attachments: PmemShuffleManager-DesignDoc.pdf
>
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27413) Keep the same epoch pace between driver and executor.

2019-04-09 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-27413:
-

 Summary: Keep the same epoch pace between driver and executor.
 Key: SPARK-27413
 URL: https://issues.apache.org/jira/browse/SPARK-27413
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Genmao Yu


The pace of epoch generation in driver and epoch pulling in executor is 
different. It will result in many empty epochs for partition if the epoch 
pulling interval is larger than epoch generation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20597) KafkaSourceProvider falls back on path as synonym for topic

2019-04-09 Thread Valeria Vasylieva (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813195#comment-16813195
 ] 

Valeria Vasylieva commented on SPARK-20597:
---

[~jlaskowski] kindly ask you to review the PR, so that this task would not hang 
on.

> KafkaSourceProvider falls back on path as synonym for topic
> ---
>
> Key: SPARK-20597
> URL: https://issues.apache.org/jira/browse/SPARK-20597
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> # {{KafkaSourceProvider}} supports {{topic}} option that sets the Kafka topic 
> to save a DataFrame's rows to
> # {{KafkaSourceProvider}} can use {{topic}} column to assign rows to Kafka 
> topics for writing
> What seems a quite interesting option is to support {{start(path: String)}} 
> as the least precedence option in which {{path}} would designate the default 
> topic when no other options are used.
> {code}
> df.writeStream.format("kafka").start("topic")
> {code}
> See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/KafkaSourceProvider-Why-topic-option-and-column-without-reverting-to-path-as-the-least-priority-td21458.html
>  for discussion



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9858) Introduce an ExchangeCoordinator to estimate the number of post-shuffle partitions.

2019-04-09 Thread ketan kunde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813158#comment-16813158
 ] 

ketan kunde commented on SPARK-9858:


[~aroberts] : did this exchangecordinator suite test cases pass for your big 
endian environment, exclusively test cases by the following name

 

test(s"determining the number of reducers: complex query 1

test(s"determining the number of reducers: complex query 2 

The above test cases are also seen failing on my big endian environment with 
the below respective logs
 * determining the number of reducers: complex query 1 *** FAILED ***
 Set(1, 2) did not equal Set(2, 3) (ExchangeCoordinatorSuite.scala:424)
- determining the number of reducers: complex query 2 *** FAILED ***
 Set(4, 2) did not equal Set(5, 3) (ExchangeCoordinatorSuite.scala:476)

Since this ticket is RESOLVED i would like to know from you what is the change 
u did to ensure passing of this test cases

Also could you also highlight which exact feature of spark does this test case 
test

I would be very greatful for your reply.

 

Regards

Ketan 

 

> Introduce an ExchangeCoordinator to estimate the number of post-shuffle 
> partitions.
> ---
>
> Key: SPARK-9858
> URL: https://issues.apache.org/jira/browse/SPARK-9858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Major
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-04-09 Thread Martin Junghanns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813156#comment-16813156
 ] 

Martin Junghanns commented on SPARK-25994:
--

[~kanjilal] Thanks for the initial comments on the doc. Looking forward to you 
PR comments. Let's discuss tasks after the API is settled and 
https://issues.apache.org/jira/browse/SPARK-27300 is merged. How familiar are 
you with pyspark? Would implementing the Python API be something of interest 
for you?

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20984) Reading back from ORC format gives error on big endian systems.

2019-04-09 Thread ketan kunde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813126#comment-16813126
 ] 

ketan kunde commented on SPARK-20984:
-

Hi 

I understand that ORC file format is not well read on big endian systems.

I am looking to build spark as spark standalone, since orc related test cases 
are exclusive to Hive module which will not be part of spark standalone build

Can i neglect all orc related test cases for spark standalone build and ensure 
that i am not compromising on any of the spark standalone features?

 

Regards

Ketan Kunde

> Reading back from ORC format gives error on big endian systems.
> ---
>
> Key: SPARK-20984
> URL: https://issues.apache.org/jira/browse/SPARK-20984
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Redhat 7 on power 7 Big endian platform.
> [testuser@soe10-vm12 spark]$ cat /etc/redhat-
> redhat-access-insights/ redhat-release
> [testuser@soe10-vm12 spark]$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
> [testuser@soe10-vm12 spark]$ lscpu
> Architecture:  ppc64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Big Endian
> CPU(s):8
> On-line CPU(s) list:   0-7
> Thread(s) per core:1
> Core(s) per socket:1
> Socket(s): 8
> NUMA node(s):  1
> Model: IBM pSeries (emulated by qemu)
> L1d cache: 32K
> L1i cache: 32K
> NUMA node0 CPU(s): 0-7
> [testuser@soe10-vm12 spark]$
>Reporter: Mahesh
>Priority: Major
>  Labels: big-endian
> Attachments: hive_test_failure_log.txt
>
>
> All orc test cases seem to be failing here. Looks like spark is not able to 
> read back what is written. Following is a way to check it on spark shell. I 
> am also pasting the test case which probably passes on x86. 
> All test cases in OrcHadoopFsRelationSuite.scala are failing.
>  test("SPARK-12218: 'Not' is included in ORC filter pushdown") {
> import testImplicits._
> withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/table1"
> (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", 
> "b").write.orc(path)
> checkAnswer(
>   spark.read.orc(path).where("not (a = 2) or not(b in ('1'))"),
>   (1 to 5).map(i => Row(i, (i % 2).toString)))
> checkAnswer(
>   spark.read.orc(path).where("not (a = 2 and b in ('1'))"),
>   (1 to 5).map(i => Row(i, (i % 2).toString)))
>   }
> }
>   }
> Same can be reproduced on spark shell
> **Create a DF and write it in orc
> scala> (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", 
> "b").write.orc("test")
> **Now try to read it back
> scala> spark.read.orc("test").where("not (a = 2) or not(b in ('1'))").show
> 17/06/05 04:20:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.iq80.snappy.CorruptionException: Invalid copy offset for opcode starting 
> at 13
> at 
> org.iq80.snappy.SnappyDecompressor.decompressAllTags(SnappyDecompressor.java:165)
> at 
> org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:76)
> at org.iq80.snappy.Snappy.uncompress(Snappy.java:43)
> at 
> org.apache.hadoop.hive.ql.io.orc.SnappyCodec.decompress(SnappyCodec.java:71)
> at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:214)
> at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
> at java.io.InputStream.read(InputStream.java:101)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10661)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10625)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
> at 
> 

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-09 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813094#comment-16813094
 ] 

Gabor Somogyi commented on SPARK-27409:
---

It would definitely help if you can provide your application code.

> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
> E ... 23 more
> E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
> directory)
> E at java.io.FileInputStream.open0(Native Method)
> E at java.io.FileInputStream.open(FileInputStream.java:195)
> E at java.io.FileInputStream.(FileInputStream.java:138)
> E at java.io.FileInputStream.(FileInputStream.java:93)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
> E ... 24 more{code}
>  When running a simple data stream loader for kafka without an SSL cert, it 
> goes through this code block - 
>  
> {code:java}
> ...
> ...
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> ...
> ...{code}
>  
> Note that I 

[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-09 Thread Chendi.Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
External issue URL:   (was: https://github.com/apache/spark/pull/24322)

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: shuffle
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-09 Thread Chendi.Xue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chendi.Xue updated SPARK-27412:
---
External issue URL: https://github.com/apache/spark/pull/24322

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: shuffle
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-09 Thread Chendi.Xue (JIRA)
Chendi.Xue created SPARK-27412:
--

 Summary: Add a new shuffle manager to use Persistent Memory as 
shuffle and spilling storage
 Key: SPARK-27412
 URL: https://issues.apache.org/jira/browse/SPARK-27412
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle, Spark Core
Affects Versions: 3.0.0
Reporter: Chendi.Xue


Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
use Persistent Memory Device as storage for shuffle and external sorter 
spilling.

In this implementation, we leveraged Persistent Memory Development Kit(PMDK) to 
support transaction write with high performance.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27411) DataSourceV2Strategy should not eliminate subquery

2019-04-09 Thread Mingcong Han (JIRA)
Mingcong Han created SPARK-27411:


 Summary: DataSourceV2Strategy should not eliminate subquery
 Key: SPARK-27411
 URL: https://issues.apache.org/jira/browse/SPARK-27411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Mingcong Han
 Fix For: 3.0.0


In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after 
normalizing filters. Here is an example:
We have an sql with a scalar subquery:
{code:scala}
val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
plan.explain(true)
{code}
And we get the log info of DataSourceV2Strategy:
{noformat}
Pushing operators to csv:examples/src/main/resources/t2.txt
Pushed Filters: 
Post-Scan Filters: isnotnull(t2a#30)
Output: t2a#30, t2b#31
{noformat}
The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it 
by mistake.
{noformat}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('t2a > scalar-subquery#56 [])
   :  +- 'Project [unresolvedalias('max('t1a), None)]
   : +- 'UnresolvedRelation `t1`
   +- 'UnresolvedRelation `t2`

== Analyzed Logical Plan ==
t2a: string, t2b: string
Project [t2a#30, t2b#31]
+- Filter (t2a#30 > scalar-subquery#56 [])
   :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
   : +- SubqueryAlias `t1`
   :+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
   +- SubqueryAlias `t2`
  +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Optimized Logical Plan ==
Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
:  +- Aggregate [max(t1a#13) AS max(t1a)#63]
: +- Project [t1a#13]
:+- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
+- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt

== Physical Plan ==
*(1) Project [t2a#30, t2b#31]
+- *(1) Filter isnotnull(t2a#30)
   +- *(1) BatchScan[t2a#30, t2b#31] class 
org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org