[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-05 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811461#comment-16811461
 ] 

Felix Cheung commented on SPARK-27389:
--

maybe a new JDK changes TimeZone?

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811355#comment-16811355
 ] 

Bryan Cutler commented on SPARK-27389:
--

>From the stacktrace, it looks like it's getting this from 
>"spark.sql.session.timeZone" which defaults to Java.util 
>TimeZone.getDefault.getID()

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27391) deadlock in ContinuousExecution unit tests

2019-04-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27391:
-

Assignee: Jose Torres

> deadlock in ContinuousExecution unit tests
> --
>
> Key: SPARK-27391
> URL: https://issues.apache.org/jira/browse/SPARK-27391
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> ContinuousExecution (in the final query execution phrase) holds the lazy val 
> lock of its IncrementalExecution for the entire duration of the (indefinite 
> length) job. This can cause deadlocks in unit tests, which hook into internal 
> APIs and try to instantiate other lazy vals.
>  
> (Note that this should not be able to affect production.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27391) deadlock in ContinuousExecution unit tests

2019-04-05 Thread Jose Torres (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres resolved SPARK-27391.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24301
[https://github.com/apache/spark/pull/24301]

> deadlock in ContinuousExecution unit tests
> --
>
> Key: SPARK-27391
> URL: https://issues.apache.org/jira/browse/SPARK-27391
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> ContinuousExecution (in the final query execution phrase) holds the lazy val 
> lock of its IncrementalExecution for the entire duration of the (indefinite 
> length) job. This can cause deadlocks in unit tests, which hook into internal 
> APIs and try to instantiate other lazy vals.
>  
> (Note that this should not be able to affect production.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2019-04-05 Thread Bijith Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811206#comment-16811206
 ] 

Bijith Kumar commented on SPARK-16548:
--

[~cloud_fan] I am getting the same Exception in Spark 2.3.2. Wondering why 
would that happen since this is fixed in 2.3.0
{code:java}
java.io.CharConversionException: Invalid UTF-32 character 0x4d89aa(above 
10) at char #63, byte #255) at 
com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189) 
at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150) at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:2017)
 at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:577)
 at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:350)
 at 
org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:347)
 at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2589) at 
org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:347) 
at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:128)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$3.apply(JsonDataSource.scala:128)
 at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:132)
 at 
org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$$anonfun$readFile$2.apply(JsonDataSource.scala:132)
 at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
 at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:109) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)
{code}

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more 

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-05 Thread Robert Joseph Evans (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811183#comment-16811183
 ] 

Robert Joseph Evans commented on SPARK-27396:
-

I have kept this at a high level just answering the questions in the SPIP 
questionnaire initially.  I am not totally sure where a design doc would fit 
into all of this, or an example of some of the things we want to do with the 
APIs.

 

I am happy to work on design docs or share more details about how I see the 
APIs working as needed.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is 

[jira] [Created] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-05 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created SPARK-27396:
---

 Summary: SPIP: Public APIs for extended Columnar Processing Support
 Key: SPARK-27396
 URL: https://issues.apache.org/jira/browse/SPARK-27396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans


*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.

 

The Dataset/DataFrame API in Spark currently only exposes to users one row at a 
time when processing data.  The goals of this are to 

 
 # Expose to end users a new option of processing the data in a columnar 
format, multiple rows at a time, with the data organized into contiguous arrays 
in memory. 
 # Make any transitions between the columnar memory layout and a row based 
layout transparent to the end user.
 # Allow for simple data exchange with other systems, DL/ML libraries, pandas, 
etc. by having clean APIs to transform the columnar data into an Apache Arrow 
compatible layout.
 # Provide a plugin mechanism for columnar processing support so an advanced 
user could avoid data transition between columnar and row based processing even 
through shuffles. This means we should at least support pluggable APIs so an 
advanced end user can implement the columnar partitioning themselves, and 
provide the glue necessary to shuffle the data still in a columnar format.
 # Expose new APIs that allow advanced users or frameworks to implement 
columnar processing either as UDFs, or by adjusting the physical plan to do 
columnar processing.  If the latter is too controversial we can move it to 
another SPIP, but we plan to implement some accelerated computing in parallel 
with this feature to be sure the APIs work, and without this feature it makes 
that impossible.

 

Not Requirements, but things that would be nice to have.
 # Provide default implementations for partitioning columnar data, so users 
don’t have to.
 # Transition the existing in memory columnar layouts to be compatible with 
Apache Arrow.  This would make the transformations to Apache Arrow format a 
no-op. The existing formats are already very close to those layouts in many 
cases.  This would not be using the Apache Arrow java library, but instead 
being compatible with the memory 
[layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
subset of that layout.
 # Provide a clean transition from the existing code to the new one.  The 
existing APIs which are public but evolving are not that far off from what is 
being proposed.  We should be able to create a new parallel API that can wrap 
the existing one. This means any file format that is trying to support columnar 
can still do so until we make a conscious decision to deprecate and then turn 
off the old APIs.

 

*Q2.* What problem is this proposal NOT designed to solve?

This is not trying to implement any of the processing itself in a columnar way, 
with the exception of examples for documentation, and possibly default 
implementations for partitioning of columnar shuffle. 

 

*Q3.* How is it done today, and what are the limits of current practice?

The current columnar support is limited to 3 areas.
 # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
code generation phase knows how to take that columnar data and iterate through 
it as rows for stages that wants rows, which currently is almost everything.  
The limitations here are mostly implementation specific. The current standard 
is to abuse Scala’s type erasure to return ColumnarBatches as the elements of 
an RDD[InternalRow]. The code generation can handle this because it is 
generating java code, so it bypasses scala’s type checking and just casts the 
InternalRow to the desired ColumnarBatch.  This makes it difficult for others 
to implement the same functionality for different processing because they can 
only do it through code generation. There really is no clean separate path in 
the code generation for columnar vs row based. Additionally because it is only 
supported through code generation if for any reason code generation would fail 
there is no backup.  This is typically fine for input formats but can be 
problematic when we get into more extensive processing. 
 # When caching data it can optionally be cached in a columnar format if the 
input is also columnar.  This is similar to the first area and has the same 
limitations because the cache acts as an input, but it is the only piece of 
code that also consumes columnar data as an input.
 # Pandas vectorized processing.  To be able to support Pandas UDFs Spark will 
build up a batch of data and send it python for processing, and then get a 
batch of data back as a result.  The format of the data being sent to python 
can either be pickle, which is the default, or optionally Arrow. The result 
returned is 

[jira] [Resolved] (SPARK-26936) On yarn-client mode, insert overwrite local directory can not create temporary path in local staging directory

2019-04-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26936.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23841
[https://github.com/apache/spark/pull/23841]

> On yarn-client mode, insert overwrite local directory can not create 
> temporary path in local staging directory
> --
>
> Key: SPARK-26936
> URL: https://issues.apache.org/jira/browse/SPARK-26936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> Let me introduce bug of  'insert overwrite local directory'.
> If I execute the SQL mentioned before, a HiveException will appear as follows:
> {code:java}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
> ... 36 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3
>  (exists=false, 
> cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
> at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
> at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> 

[jira] [Assigned] (SPARK-26936) On yarn-client mode, insert overwrite local directory can not create temporary path in local staging directory

2019-04-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26936:
-

Assignee: jiaan.geng

> On yarn-client mode, insert overwrite local directory can not create 
> temporary path in local staging directory
> --
>
> Key: SPARK-26936
> URL: https://issues.apache.org/jira/browse/SPARK-26936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Let me introduce bug of  'insert overwrite local directory'.
> If I execute the SQL mentioned before, a HiveException will appear as follows:
> {code:java}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
> ... 36 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3
>  (exists=false, 
> cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
> at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
> at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
> at 
> 

[jira] [Resolved] (SPARK-27390) Fix package name mismatch

2019-04-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27390.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0
   2.4.2

This is fixed via https://github.com/apache/spark/pull/24300

> Fix package name mismatch
> -
>
> Key: SPARK-27390
> URL: https://issues.apache.org/jira/browse/SPARK-27390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.4.2, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27192) spark.task.cpus should be less or equal than spark.task.cpus when use static executor allocation

2019-04-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27192.
---
Resolution: Fixed

Issue resolved by pull request 24261
[https://github.com/apache/spark/pull/24261]

> spark.task.cpus should be less or equal than spark.task.cpus when use static 
> executor allocation
> 
>
> Key: SPARK-27192
> URL: https://issues.apache.org/jira/browse/SPARK-27192
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0, 2.4.0
>Reporter: Lijia Liu
>Assignee: Lijia Liu
>Priority: Minor
> Fix For: 3.0.0
>
>
> When use dynamic executor allocation, if we set spark.executor.cores small 
> than  spark.task.cpus, exception will be thrown as follows:
> '''spark.executor.cores must not be < spark.task.cpus'''
> But, if dynamic executor allocation not enabled, spark will hang when submit 
> new job for TaskSchedulerImpl will not schedule a task in a executor which 
> available cores is small than 
> spark.task.cpus.See 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L351]
> So, when start task scheduler, spark.task.cpus should be check.
> reproduce
> $SPARK_HOME/bin/spark-shell --conf spark.task.cpus=2  --master local[1]
> scala> sc.parallelize(1 to 9).collect



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-04-05 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811163#comment-16811163
 ] 

Saikat Kanjilal commented on SPARK-25994:
-

[~mju] I added some initial comments, will review PR next, would like to help 
out on this going forward, please advise on how best to help other then 
reviewing Design Doc and PR

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27358) Update jquery to 1.12.x to pick up security fixes

2019-04-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27358.
---
   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.2
   3.0.0

Issue resolved by pull request 24288
[https://github.com/apache/spark/pull/24288]

> Update jquery to 1.12.x to pick up security fixes
> -
>
> Key: SPARK-27358
> URL: https://issues.apache.org/jira/browse/SPARK-27358
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0, 2.4.2, 2.3.4
>
>
> jquery 1.11.1 is affected by a CVE:
> https://www.cvedetails.com/cve/CVE-2016-7103/
> This triggers some warnings in tools that check for known security issues in 
> dependencies.
> Note that I do not know whether this actually manifests as a security problem 
> for Spark. But, we can easily update to 1.12.4 (latest 1.x version) to 
> resolve it.
> (Note that https://www.cvedetails.com/cve/CVE-2015-9251/ seems to have been 
> fixed in 1.12 but then unfixed, so this may require a much bigger jump to 
> jquery 3.x if it's a problem; leaving that until later.)
> Along the way we will want to update jquery datatables to 1.10.18 to match 
> jquery 1.12.4.
> Relatedly, jquery mustache 0.8.1 also has a CVE: 
> https://snyk.io/test/npm/mustache/0.8.2
> I propose to update to 2.3.12 (latest 2.x) to resolve it.
> Although targeted for 3.0, I believe this is back-port-able to 2.4.x if 
> needed, assuming we find no UI issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20787) PySpark can't handle datetimes before 1900

2019-04-05 Thread Ruben Berenguel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811044#comment-16811044
 ] 

Ruben Berenguel commented on SPARK-20787:
-

Hi [~AdiC], indeed, I have not added additional work. So far I haven't found 
any way of fixing it in a way which does not introduce what is effectively a 
breaking change to the behaviour of dates when using Python. 

If anyone else wants to pick this ticket up, please do.

> PySpark can't handle datetimes before 1900
> --
>
> Key: SPARK-20787
> URL: https://issues.apache.org/jira/browse/SPARK-20787
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Keith Bourgoin
>Priority: Major
>
> When trying to put a datetime before 1900 into a DataFrame, it throws an 
> error because of the use of time.mktime.
> {code}
> Python 2.7.13 (default, Mar  8 2017, 17:29:55)
> Type "copyright", "credits" or "license" for more information.
> IPython 5.3.0 -- An enhanced Interactive Python.
> ? -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help  -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Mar  8 2017 17:29:55)
> SparkSession available as 'spark'.
> In [1]: import datetime as dt
> In [2]: 
> sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count()
> 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 174, in main
> process()
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 169, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 268, in dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, 
> in toInternal
> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
>   File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, 
> in 
> return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", 
> line 436, in toInternal
> return self.dataType.toInternal(obj)
>   File 
> "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", 
> line 191, in toInternal
> else time.mktime(dt.timetuple()))
> ValueError: year out of range
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Resolved] (SPARK-27393) Show ReusedSubquery in the plan when the subquery is reused

2019-04-05 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27393.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Show ReusedSubquery in the plan when the subquery is reused
> ---
>
> Key: SPARK-27393
> URL: https://issues.apache.org/jira/browse/SPARK-27393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> We need to easily identify the plan difference when subquery is reused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27176) Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4

2019-04-05 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810652#comment-16810652
 ] 

Yuming Wang commented on SPARK-27176:
-

For Hive 2.3.4, we also need {{hive-llap-common}} and {{hive-llap-client}}:

{{hive-llap-common}} is used for registry functions:
{noformat}
scala> spark.range(10).write.saveAsTable("test_hadoop3")
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/llap/security/LlapSigner$Signable
  at java.lang.Class.getDeclaredConstructors0(Native Method)
  at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
  at java.lang.Class.getConstructor0(Class.java:3075)
  at java.lang.Class.getDeclaredConstructor(Class.java:2178)
  at 
org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208)
  at 
org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201)
  at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.(FunctionRegistry.java:500)
  at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:247)
  at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
  at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388)
  at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
  at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
  at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:250)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:272)
...
{noformat}
{{hive-llap-client}} is used for test Hive:
{noformat}
spark.sharedState.externalCatalog.unwrapped.asInstanceOf[HiveExternalCatalog]
  .client.runSqlHive("SELECT COUNT(*) FROM test_hadoop3")

...
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/io/api/LlapProxy
at 
org.apache.hadoop.hive.ql.exec.GlobalWorkMapFactory.get(GlobalWorkMapFactory.java:102)
at 
org.apache.hadoop.hive.ql.exec.Utilities.clearWorkMapForConf(Utilities.java:3435)
at 
org.apache.hadoop.hive.ql.exec.Utilities.clearWork(Utilities.java:290)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:443)
at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:151)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:733)
...
{noformat}
We can exclude {{org.apache.curator:curator-framework:jar}} and 
{{org.apache.curator:apache-curator.jar}} as they are used for add consistent 
node replacement to LLAP for splits, see HIVE-14589.

> Upgrade hadoop-3's built-in Hive maven dependencies to 2.3.4
> 
>
> Key: SPARK-27176
> URL: https://issues.apache.org/jira/browse/SPARK-27176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27388) expression encoder for avro like objects

2019-04-05 Thread Taoufik DACHRAOUI (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taoufik DACHRAOUI updated SPARK-27388:
--
Description: 
Add an expression encoder for objects defined by properties (ie. I call them 
Beans). A property of an object is defined by a _setter_ and a _getter_ 
functions where the _getter_ return type is equal to the _setter_ unique 
parameter type and the _getter_ and _setter_ functions have the same name or 
the _getter_ name is prefixed by "get" and the _setter_ name is prefixed by 
"set"

 

An example of a Bean:
{code:java}
class Bar {
 ...   
  def age():Int = {...}

  def age(v:Int):Unit {...}
  ...
}

class Foo extends Bar { 
...  
   def setName(n:String) {...}
  def getName():String = {...}
...
}
{code}
```

 The class _Foo in the example above_ has 2 properties _age_ and _Name._

 

Also, in this new feature, we added support for _java.util.List_, 
_java.util.Map_ and java _Enums_ 

Avro objects are beans and thus we can create an expression encoder for avro 
objects with the current addition. All avro types, including fixed types, and 
excluding complex union types, are suppported by this addition.

 

Currently complex avro unions are not supported because a complex union is 
declared as Object.

 

  was:
Add an expression encoder for objects defined by properties (ie. I call them 
Beans). A property of an object is defined by a _setter_ and a _getter_ 
functions where the _getter_ return type is equal to the _setter_ unique 
parameter type and the _getter_ and _setter_ functions have the same name or 
the _getter_ name is prefixed by "get" and the _setter_ name is prefixed by 
"set"

 

An example of a Bean:
{code:java}
class Bar {
 ...   
  def age():Int = {...}

  def age(v:Int):Unit {...}
  ...
}

class Foo extends Bar { 
...  
   def setName(n:String) {...}
  def getName():String = {...}
...
}
{code}
```

 The class _Foo in the example above_ has 2 properties _age_ and _Name._

 

Also, in this new feature, we added support for _java.util.List_, 
_java.util.Map_ and java _Enums_ 

Avro objects are beans and thus we can create an expression encoder for avro 
objects with the current addition. All avro types, including fixed types, and 
excluding complex union types, are suppported by this addition.

 

Currently complex avro unions are not supported because a complex union is 
declared as Object and there cannot be an expression encoder for Object type 
(need to use a custom serializer like kryo for example)

 


> expression encoder for avro like objects
> 
>
> Key: SPARK-27388
> URL: https://issues.apache.org/jira/browse/SPARK-27388
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Taoufik DACHRAOUI
>Priority: Major
>
> Add an expression encoder for objects defined by properties (ie. I call them 
> Beans). A property of an object is defined by a _setter_ and a _getter_ 
> functions where the _getter_ return type is equal to the _setter_ unique 
> parameter type and the _getter_ and _setter_ functions have the same name or 
> the _getter_ name is prefixed by "get" and the _setter_ name is prefixed by 
> "set"
>  
> An example of a Bean:
> {code:java}
> class Bar {
>  ...   
>   def age():Int = {...}
>   def age(v:Int):Unit {...}
>   ...
> }
> class Foo extends Bar { 
> ...  
>def setName(n:String) {...}
>   def getName():String = {...}
> ...
> }
> {code}
> ```
>  The class _Foo in the example above_ has 2 properties _age_ and _Name._
>  
> Also, in this new feature, we added support for _java.util.List_, 
> _java.util.Map_ and java _Enums_ 
> Avro objects are beans and thus we can create an expression encoder for avro 
> objects with the current addition. All avro types, including fixed types, and 
> excluding complex union types, are suppported by this addition.
>  
> Currently complex avro unions are not supported because a complex union is 
> declared as Object.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-04-05 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810627#comment-16810627
 ] 

Udbhav Agrawal commented on SPARK-27289:


[~KaiXu] for me it is coming correct and could not reproduce this

> spark-submit explicit configuration does not take effect but Spark UI shows 
> it's effective
> --
>
> Key: SPARK-27289
> URL: https://issues.apache.org/jira/browse/SPARK-27289
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Spark Submit, Web UI
>Affects Versions: 2.3.3
>Reporter: KaiXu
>Priority: Minor
> Attachments: Capture.PNG
>
>
> The [doc 
> |https://spark.apache.org/docs/latest/submitting-applications.html]says that  
> "In general, configuration values explicitly set on a {{SparkConf}} take the 
> highest precedence, then flags passed to {{spark-submit}}, then values in the 
> defaults file", but when setting spark.local.dir through --conf with 
> spark-submit, it still uses the values from 
> ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
> environment variables shows the value from --conf, which is really misleading.
> e.g.
> I set submit my application through the command:
> /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
> spark.local.dir=/tmp/spark_local -v --class 
> org.apache.spark.examples.mllib.SparseNaiveBayes --master 
> spark://bdw-slave20:7077 
> /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
> hdfs://bdw-slave20:8020/Bayes/Input
>  
> the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
> spark.local.dir=/mnt/nvme1/spark_local
> when the application is running, I found the intermediate shuffle data was 
> wrote to /mnt/nvme1/spark_local, which is set through 
> ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
> environment value spark.local.dir=/tmp/spark_local.
> The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
> misleading. 
>  
> !image-2019-03-27-10-59-38-377.png!
> spark-submit verbose:
> 
> Spark properties used, including those specified through
>  --conf and those from the properties file /opt/spark.conf:
>  (spark.local.dir,/tmp/spark_local)
>  (spark.default.parallelism,132)
>  (spark.driver.memory,10g)
>  (spark.executor.memory,352g)
> X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24545) Function hour not working as expected for hour 2 in PySpark

2019-04-05 Thread Eric Blanco (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Blanco resolved SPARK-24545.
-
Resolution: Fixed

Field date was in String format so the change happened because it converted it 
to timestamp. Also, timezones of python and java where different.

> Function hour not working as expected for hour 2 in PySpark
> ---
>
> Key: SPARK-24545
> URL: https://issues.apache.org/jira/browse/SPARK-24545
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.1
>Reporter: Eric Blanco
>Priority: Minor
> Attachments: image-2018-06-13-13-52-06-165.png, 
> image-2018-06-13-13-53-21-185.png
>
>
> Hello,
> I tried to get the hour out of a date and it works except if the hour is 2. 
> It works well in Scala but in PySpark it shows hour 3 instead of hour 2.
> Example:
> {code:java}
> from pyspark.sql.functions import *
>  columns = ["id","date"]
>  vals = [(4,"2016-03-27 02:00:00")]
>  df = sqlContext.createDataFrame(vals, columns)
>  df.withColumn("hours", hour(col("date"))).show(){code}
> |id|date|hours|
> |4|2016-03-27 2:00:00|3|
> It works as expected for other hours.
> Also, if you change the year or month apparently it works well. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org