[jira] [Updated] (SPARK-26983) Spark PassThroughSuite failure on bigendian

2019-02-24 Thread salamani (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salamani updated SPARK-26983:
-
Description: 
Following failures are observed for PassThroughSuite in Spark Project SQL  on 
big endian system
 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***

  was:
Following failures are observed for PassThroughSuite in Spark Project SQL  


 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***



> Spark PassThroughSuite failure on bigendian
> ---
>
> Key: SPARK-26983
> URL: https://issues.apache.org/jira/browse/SPARK-26983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: salamani
>Priority: Major
> Fix For: 2.3.2
>
>
> Following failures are observed for PassThroughSuite in Spark Project SQL  on 
> big endian system
>  - PassThrough with FLOAT: empty column for decompress()
>  - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
>  Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with FLOAT: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with DOUBLE: empty column
>  - PassThrough with DOUBLE: long random series
>  - PassThrough with DOUBLE: empty column for decompress()
>  - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
>  Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
> decoded double value (PassThroughEncodingSuite.scala:150)
>  - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
> (PassThroughEncodingSuite.scala:150)
>  Run completed in 9 seconds, 72 milliseconds.
>  Total number of tests run: 30
>  Suites: completed 2, aborted 0
>  Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
>  ** 
>  *** 4 TESTS FAILED ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26983) Spark PassThroughSuite failure on bigendian

2019-02-24 Thread salamani (JIRA)
salamani created SPARK-26983:


 Summary: Spark PassThroughSuite failure on bigendian
 Key: SPARK-26983
 URL: https://issues.apache.org/jira/browse/SPARK-26983
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.2
Reporter: salamani
 Fix For: 2.3.2


Following failures are observed for PassThroughSuite in Spark Project SQL  

```
 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***
 ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0

2019-02-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776582#comment-16776582
 ] 

Jungtaek Lim commented on SPARK-19019:
--

According to fix versions, fixed version of 1.6.x version line is 1.6.4. You 
need to upgrade to 1.6.4, but I believe 1.6 is EOL and no more support on 
community. You may need to upgrade the version to 2.3.3 (if you feel more safer 
to have bugfix versions in minor version) or 2.4.0.

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19019) PySpark does not work with Python 3.6.0

2019-02-24 Thread Parixit Odedara (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776567#comment-16776567
 ] 

Parixit Odedara edited comment on SPARK-19019 at 2/25/19 6:41 AM:
--

I am facing the same issue in Spark1.6.0? Was this fixed for Spark 1.6.0 
version? If not, are there any plans to do so?


was (Author: parixit):
I am facing the same issue 1.6.0? Was this fixed for Spark 1.6.0 version? If 
not, are there any plans to do so?

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0

2019-02-24 Thread Parixit Odedara (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776567#comment-16776567
 ] 

Parixit Odedara commented on SPARK-19019:
-

I am facing the same issue 1.6.0? Was this fixed for Spark 1.6.0 version? If 
not, are there any plans to do so?

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26411) Streaming: _spark_metadata and checkpoints out of sync cause checkpoint packing failure

2019-02-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776565#comment-16776565
 ] 

Jungtaek Lim commented on SPARK-26411:
--

I tend to agree this concern: the metadata of file stream sink is rather 
leveraged from other queries which leverage the output as file source, which is 
I think "optional" for the view of current query. We have some guidances to 
remove checkpoint data and rerun the query, but we haven't mentioned they may 
need to also remove metadata in case of file stream sink.

I think this is related a bit with SPARK-24295, because reporter suffered 
metadata going huge too fast, and wanted to have a way to purge, but given 
multiple queries can access the output there's no way to purge it. Only 
workaround to safely avoid the metadata size issue is just not writing it if 
not necessary.

File stream sink itself only leverages the last succeed batch ID, which might 
be able to be checkpointed altogether in query checkpoint data.

> Streaming: _spark_metadata and checkpoints out of sync cause checkpoint 
> packing failure
> ---
>
> Key: SPARK-26411
> URL: https://issues.apache.org/jira/browse/SPARK-26411
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Alexander Panzhin
>Priority: Major
>
> Spark Structured Streaming File source to File sink seems to be picking up 
> information from `_spark_metadata` directory for checkpoint data compaction
> Worst part is that output and checkpoint being out of sync, data is not being 
> written.
> *This is not documented anywhere. Removing checkpoint data and leaving 
> _spark_metadata in the output directory WILL CAUSE data loss.*
>  
> FileSourceScanExec.createNonBucketedReadRDD kicks off compaction and fails 
> the whole job, because it expects deltas to be present.
> But the delta files are never written because FileStreamSink.addBatch doesn't 
> execute the Dataframe that it receives.
> {code:java}
> ...
> INFO  [2018-12-17 03:20:02,784] 
> org.apache.spark.sql.execution.streaming.FileStreamSink: Skipping already 
> committed batch 75 
> ...
> INFO [2018-12-17 03:30:01,691] 
> org.apache.spark.sql.execution.streaming.FileStreamSource: Log offset set to 
> 76 with 29 new files INFO [2018-12-17 03:30:01,700] 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution: Committed 
> offsets for batch 76. Metadata 
> OffsetSeqMetadata(0,1545017401691,Map(spark.sql.shuffle.partitions -> 200, 
> spark.sql.streaming.stateStore.providerClass -> 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider)) 
> INFO [2018-12-17 03:30:01,704] 
> org.apache.spark.sql.execution.streaming.FileStreamSource: Processing 29 
> files from 76:76 INFO [2018-12-17 03:30:01,983] 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Pruning 
> directories with: INFO [2018-12-17 03:30:01,983] 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Post-Scan 
> Filters: INFO [2018-12-17 03:30:01,984] 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy: Output Data 
> Schema: struct INFO [2018-12-17 03:30:01,984] 
> org.apache.spark.sql.execution.FileSourceScanExec: Pushed Filters: INFO 
> [2018-12-17 03:30:02,581] 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code 
> generated in 16.205011 ms INFO [2018-12-17 03:30:02,593] 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code 
> generated in 9.368244 ms INFO [2018-12-17 03:30:02,629] 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code 
> generated in 31.126375 ms INFO [2018-12-17 03:30:02,640] 
> org.apache.spark.SparkContext: Created broadcast 86 from start at 
> SourceStream.scala:55 INFO [2018-12-17 03:30:02,643] 
> org.apache.spark.sql.execution.FileSourceScanExec: Planning scan with bin 
> packing, max size: 14172786 bytes, open cost is considered as scanning 
> 4194304 bytes. INFO [2018-12-17 03:30:02,700] 
> org.apache.spark.ContextCleaner: Cleaned accumulator 4321 INFO [2018-12-17 
> 03:30:02,700] org.apache.spark.ContextCleaner: Cleaned accumulator 4326 INFO 
> [2018-12-17 03:30:02,700] org.apache.spark.ContextCleaner: Cleaned 
> accumulator 4324 INFO [2018-12-17 03:30:02,700] 
> org.apache.spark.ContextCleaner: Cleaned accumulator 4320 INFO [2018-12-17 
> 03:30:02,700] org.apache.spark.ContextCleaner: Cleaned accumulator 4325 INFO 
> [2018-12-17 03:30:02,737] org.apache.spark.SparkContext: Created broadcast 87 
> from start at SourceStream.scala:55 INFO [2018-12-17 03:30:02,756] 
> org.apache.spark.SparkContext: Starting job: start at SourceStream.scala:55 
> INFO [2018-12-17 03:30:02,761] org.apache.spark.SparkContext: Created 
> broadcast 88 from broadcast at 

[jira] [Updated] (SPARK-26982) Enhance describe framework to describe the output of a query

2019-02-24 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-26982:
-
Summary: Enhance describe framework to describe the output of a query  
(was: Enhance describe frame work to describe the output of a query)

> Enhance describe framework to describe the output of a query
> 
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26982) Enhance describe frame work to describe the output of a query

2019-02-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26982:


Assignee: Apache Spark

> Enhance describe frame work to describe the output of a query
> -
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26982) Enhance describe frame work to describe the output of a query

2019-02-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26982:


Assignee: (was: Apache Spark)

> Enhance describe frame work to describe the output of a query
> -
>
> Key: SPARK-26982
> URL: https://issues.apache.org/jira/browse/SPARK-26982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> Currently we can use `df.printSchema` to discover the schema information for 
> a query. We should have a way to describe the output schema of a query using 
> SQL interface. 
>  
> Example:
> DESCRIBE SELECT * FROM desc_table
> DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26982) Enhance describe frame work to describe the output of a query

2019-02-24 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-26982:


 Summary: Enhance describe frame work to describe the output of a 
query
 Key: SPARK-26982
 URL: https://issues.apache.org/jira/browse/SPARK-26982
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dilip Biswal


Currently we can use `df.printSchema` to discover the schema information for a 
query. We should have a way to describe the output schema of a query using SQL 
interface. 

 

Example:

DESCRIBE SELECT * FROM desc_table

DESCRIBE QUERY SELECT * FROM desc_table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26953) Test TimSort for ArrayIndexOutOfBoundsException

2019-02-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26953.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23856
[https://github.com/apache/spark/pull/23856]

> Test TimSort for ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-26953
> URL: https://issues.apache.org/jira/browse/SPARK-26953
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The paper (https://arxiv.org/pdf/1805.08612.pdf at the end) shows a case when 
> TimSort can cause  ArrayIndexOutOfBoundsException. In particular, the test in 
> Java is http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java. The test 
> allocates huge arrays of ints but it seems it is not necessary. Probably, 
> smaller array of bytes can be used in test.
> The ticket aims to add a test which checks Spark's TimSort doesn't cause 
> ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26953) Test TimSort for ArrayIndexOutOfBoundsException

2019-02-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26953:
-

Assignee: Maxim Gekk

> Test TimSort for ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-26953
> URL: https://issues.apache.org/jira/browse/SPARK-26953
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The paper (https://arxiv.org/pdf/1805.08612.pdf at the end) shows a case when 
> TimSort can cause  ArrayIndexOutOfBoundsException. In particular, the test in 
> Java is http://igm.univ-mlv.fr/~pivoteau/Timsort/Test.java. The test 
> allocates huge arrays of ints but it seems it is not necessary. Probably, 
> smaller array of bytes can be used in test.
> The ticket aims to add a test which checks Spark's TimSort doesn't cause 
> ArrayIndexOutOfBoundsException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-02-24 Thread Mani M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776400#comment-16776400
 ] 

Mani M commented on SPARK-26918:


Let me know to raise the PR.

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-02-24 Thread wuyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776309#comment-16776309
 ] 

wuyi commented on SPARK-26927:
--

[~liupengcheng] I can not understand the issue clearly by your desc. Can you 
elaborate it with a more simple and concrete example ?

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Priority: Major
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26959) Join of two tables, bucketed the same way, on bucket columns and one or more other coulmns should not need a shuffle

2019-02-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776247#comment-16776247
 ] 

Yuming Wang commented on SPARK-26959:
-

Duplicate with [SPARK-24087|https://issues.apache.org/jira/browse/SPARK-24087]?

> Join of two tables, bucketed the same way, on bucket columns and one or more 
> other coulmns should not need a shuffle
> 
>
> Key: SPARK-26959
> URL: https://issues.apache.org/jira/browse/SPARK-26959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Natang
>Priority: Major
>
> _When two tables, that are bucketed the same way, are joined using bucket 
> columns and one or more other columns, Spark should be able to perform the 
> join without doing a shuffle._
> Consider the example below. There are two tables, 'join_left_table' and 
> 'join_right_table', bucketed by 'col1' into 4 buckets. When these tables are 
> joined on 'col1' and 'col2', Spark should be able to do the join without 
> having to do a shuffle. All entries for a give value of 'col1' would be in 
> the same bucket for both the tables, irrespective of values of 'col2'.
>  
> 
>  
>  
> {noformat}
> def randomInt1to100 = scala.util.Random.nextInt(100)+1
> val left = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> val right = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> import org.apache.spark.sql.SaveMode
> left.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_left_table")
> 
> right.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_right_table")
> val left_table = spark.read.table("join_left_table")
> val right_table = spark.read.table("join_right_table")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val join_on_col1=left_table.join(
> right_table,
> Seq("col1"))
> join_on_col1.explain
> ### BEGIN Output
> join_on_col1: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 3 
> more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col2#258, col3#259]
> +- *SortMergeJoin [col1#250], [col1#257], Inner
>:- *Sort [col1#250 ASC NULLS FIRST], false, 0
>:  +- *Project [col1#250, col2#251, col3#252]
>: +- *Filter isnotnull(col1#250)
>:+- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
>+- *Sort [col1#257 ASC NULLS FIRST], false, 0
>   +- *Project [col1#257, col2#258, col3#259]
>  +- *Filter isnotnull(col1#257)
> +- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
> ### END Output
> val join_on_col1_col2=left_table.join(
> right_table,
> Seq("col1","col2"))
> join_on_col1_col2.explain
> ### BEGIN Output
> join_on_col1_col2: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 
> 2 more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col3#259]
> +- *SortMergeJoin [col1#250, col2#251], [col1#257, col2#258], Inner
>:- *Sort [col1#250 ASC NULLS FIRST, col2#251 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(col1#250, col2#251, 200)
>: +- *Project [col1#250, col2#251, col3#252]
>:+- *Filter (isnotnull(col2#251) && isnotnull(col1#250))
>:   +- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col2), IsNotNull(col1)], 
> ReadSchema: struct
>+- *Sort [col1#257 ASC NULLS FIRST, col2#258 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(col1#257, col2#258, 200)
>  +- *Project [col1#257, col2#258, col3#259]
> +- *Filter (isnotnull(col2#258) && isnotnull(col1#257))
>+- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> 

[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

2019-02-24 Thread Andre Sa de Mello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--
Description: 
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on *PySpark's side*, and it should still be a 
painless change. 

  was:
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * initcap()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on *PySpark's side*, and it should still be a 
painless change. 


> [PySpark] Some SQL functions do not take column names
> -
>
> Key: SPARK-26979
> URL: https://issues.apache.org/jira/browse/SPARK-26979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Andre Sa de Mello
>Priority: Minor
>  Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support 

[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

2019-02-24 Thread Andre Sa de Mello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--
Description: 
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * initcap()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on *PySpark's side*, and it should still be a 
painless change. 

  was:
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * initcap()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on PySpark's side, and it should still be a 
painless change. 


> [PySpark] Some SQL functions do not take column names
> -
>
> Key: SPARK-26979
> URL: https://issues.apache.org/jira/browse/SPARK-26979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Andre Sa de Mello
>Priority: Minor
>  Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides 

[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

2019-02-24 Thread Andre Sa de Mello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--
Component/s: (was: SQL)
 PySpark

> [PySpark] Some SQL functions do not take column names
> -
>
> Key: SPARK-26979
> URL: https://issues.apache.org/jira/browse/SPARK-26979
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Andre Sa de Mello
>Priority: Minor
>  Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support for that, but these few exceptions prevent this 
> pattern from being universally applicable.
> This is a very easy fix, and I see no reason not to apply it. I have a PR 
> ready.
> *UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
> aware of. The reason I missed them is because I had been looking at things 
> from PySpark's point of view, and the API there does support column name 
> literals for almost all SQL functions.
> Exceptions for the PySpark API include all the above plus:
>  * ltrim()
>  * rtrim()
>  * trim()
>  * ascii()
>  * initcap()
>  * base64()
>  * unbase64()
> The argument for making the API consistent still stands, however. I have been 
> working on a PR to fix this on PySpark's side, and it should still be a 
> painless change. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26979) Some SQL functions do not take column names

2019-02-24 Thread Andre Sa de Mello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--
Description: 
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * initcap()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on PySpark's side, and it should still be a 
painless change. 

  was:
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

 


> Some SQL functions do not take column names
> ---
>
> Key: SPARK-26979
> URL: https://issues.apache.org/jira/browse/SPARK-26979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andre Sa de Mello
>Priority: Minor
>  Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support for that, but these few exceptions prevent this 
> pattern from being universally applicable.
> This is a very easy fix, and I see no reason not to apply it. I have a PR 
> ready.
> *UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
> aware of. The reason I missed them is because I had been looking at things 
> from PySpark's point of view, and the API there does support column name 
> literals for almost all SQL functions.
> Exceptions for the PySpark API include all the above plus:
>  * ltrim()
>  * rtrim()
>  * trim()
>  * ascii()
>  * initcap()
>  * 

[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

2019-02-24 Thread Andre Sa de Mello (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--
Summary: [PySpark] Some SQL functions do not take column names  (was: Some 
SQL functions do not take column names)

> [PySpark] Some SQL functions do not take column names
> -
>
> Key: SPARK-26979
> URL: https://issues.apache.org/jira/browse/SPARK-26979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andre Sa de Mello
>Priority: Minor
>  Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support for that, but these few exceptions prevent this 
> pattern from being universally applicable.
> This is a very easy fix, and I see no reason not to apply it. I have a PR 
> ready.
> *UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
> aware of. The reason I missed them is because I had been looking at things 
> from PySpark's point of view, and the API there does support column name 
> literals for almost all SQL functions.
> Exceptions for the PySpark API include all the above plus:
>  * ltrim()
>  * rtrim()
>  * trim()
>  * ascii()
>  * initcap()
>  * base64()
>  * unbase64()
> The argument for making the API consistent still stands, however. I have been 
> working on a PR to fix this on PySpark's side, and it should still be a 
> painless change. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26961) Found Java-level deadlock in Spark Driver

2019-02-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776211#comment-16776211
 ] 

Hyukjin Kwon commented on SPARK-26961:
--

How did this happen? Would you be able to provide a reproducer and/or narrow 
down this further?

> Found Java-level deadlock in Spark Driver
> -
>
> Key: SPARK-26961
> URL: https://issues.apache.org/jira/browse/SPARK-26961
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Rong Jialei
>Priority: Major
>
> Our spark job usually will finish in minutes, however, we recently found it 
> take days to run, and we can only kill it when this happened.
> An investigation show all worker container could not connect drive after 
> start, and driver is hanging, using jstack, we found a Java-level deadlock.
>  
> *Jstack output for deadlock part is showing below:*
>  
> Found one Java-level deadlock:
> =
> "SparkUI-907":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> "ForkJoinPool-1-worker-57":
>  waiting to lock monitor 0x7f3860574298 (object 0x0005b7991168, a 
> org.apache.spark.util.MutableURLClassLoader),
>  which is held by "ForkJoinPool-1-worker-7"
> "ForkJoinPool-1-worker-7":
>  waiting to lock monitor 0x7f387761b398 (object 0x0005c0c1e5e0, a 
> org.apache.hadoop.conf.Configuration),
>  which is held by "ForkJoinPool-1-worker-57"
> Java stack information for the threads listed above:
> ===
> "SparkUI-907":
>  at org.apache.hadoop.conf.Configuration.getOverlay(Configuration.java:1328)
>  - waiting to lock <0x0005c0c1e5e0> (a 
> org.apache.hadoop.conf.Configuration)
>  at 
> org.apache.hadoop.conf.Configuration.handleDeprecation(Configuration.java:684)
>  at org.apache.hadoop.conf.Configuration.get(Configuration.java:1088)
>  at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1145)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2363)
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
>  at 
> org.apache.hadoop.fs.FsUrlStreamHandlerFactory.createURLStreamHandler(FsUrlStreamHandlerFactory.java:74)
>  at java.net.URL.getURLStreamHandler(URL.java:1142)
>  at java.net.URL.(URL.java:599)
>  at java.net.URL.(URL.java:490)
>  at java.net.URL.(URL.java:439)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doRequest(JettyUtils.scala:176)
>  at org.apache.spark.ui.JettyUtils$$anon$4.doGet(JettyUtils.scala:161)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:171)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:534)
>  at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>  at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>  at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748)
> "ForkJoinPool-1-worker-57":
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
>  - waiting to lock 

[jira] [Commented] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776210#comment-16776210
 ] 

Hyukjin Kwon commented on SPARK-26968:
--

{{quoteMode}} does not exist in Spark anymore because we had to switch the CSV 
library into Univocity.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26968.
--
Resolution: Not A Problem

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26969) [Spark] Using ODBC not able to see the data in table when datatype is decimal

2019-02-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26969:
-
Description: 
{code}
#  Using odbc rpm file install odbc 
 # connect to odbc using isql -v spark2xsingle
 # SQL> create table t1_t(id decimal(15,2));
 # SQL> insert into t1_t values(15);
 # 
SQL> select * from t1_t;
+-+
| id |
+-+
+-+  Actual output is empty
{code}

Note: When creating table of int data type select is giving result as below
{code}
SQL> create table test_t1(id int);
SQL> insert into test_t1 values(10);
SQL> select * from test_t1;
++
| id |
++
| 10 |
++
{code}

Needs to handle for decimal case.



  was:
# Using odbc rpm file install odbc 
 # connect to odbc using isql -v spark2xsingle
 # SQL> create table t1_t(id decimal(15,2));
 # SQL> insert into t1_t values(15);
 # 
SQL> select * from t1_t;
+-+
| id |
+-+
+-+  Actual output is empty

Note: When creating table of int data type select is giving result as below
SQL> create table test_t1(id int);
SQL> insert into test_t1 values(10);
SQL> select * from test_t1;
++
| id |
++
| 10 |
++

Needs to handle for decimal case.




> [Spark] Using ODBC not able to see the data in table when datatype is decimal
> -
>
> Key: SPARK-26969
> URL: https://issues.apache.org/jira/browse/SPARK-26969
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> #  Using odbc rpm file install odbc 
>  # connect to odbc using isql -v spark2xsingle
>  # SQL> create table t1_t(id decimal(15,2));
>  # SQL> insert into t1_t values(15);
>  # 
> SQL> select * from t1_t;
> +-+
> | id |
> +-+
> +-+  Actual output is empty
> {code}
> Note: When creating table of int data type select is giving result as below
> {code}
> SQL> create table test_t1(id int);
> SQL> insert into test_t1 values(10);
> SQL> select * from test_t1;
> ++
> | id |
> ++
> | 10 |
> ++
> {code}
> Needs to handle for decimal case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776192#comment-16776192
 ] 

Bo Hai edited comment on SPARK-26932 at 2/24/19 9:48 AM:
-

Relevant hive jiras:
* https://issues.apache.org/jira/browse/HIVE-16683
* https://issues.apache.org/jira/browse/HIVE-14007


was (Author: haiboself):
Relevant hive jiras:
* https://jira.apache.org/jira/browse/SPARK-24322
* https://issues.apache.org/jira/browse/HIVE-14007

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776192#comment-16776192
 ] 

Bo Hai edited comment on SPARK-26932 at 2/24/19 9:47 AM:
-

Relevant hive jiras:
* https://jira.apache.org/jira/browse/SPARK-24322
* https://issues.apache.org/jira/browse/HIVE-14007


was (Author: haiboself):
Relevant hive jiras:
* https://jira.apache.org/jira/browse/SPARK-24322

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes

2019-02-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776194#comment-16776194
 ] 

Hyukjin Kwon commented on SPARK-26971:
--

Questions should go to mailing list rather than filing it as an issue here.

> How to read delimiter (Cedilla) in spark RDD and Dataframes
> ---
>
> Key: SPARK-26971
> URL: https://issues.apache.org/jira/browse/SPARK-26971
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Babu
>Priority: Minor
>
>  
> I am trying to read a cedilla delimited HDFS Text file. I am getting the 
> below error, did any one face similar issue?
> {{hadoop fs -cat test_file.dat }}
> {{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }}
> {{>>> rdd = sc.textFile("test_file.dat") }}
> {{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', 
> u'3Dallas\xc7Texas'] }}
> {{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 
> 'ascii' codec can't decode byte 0xc7 in position 0: ordinal not in range(128) 
> }}
> {{>>> 
> sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show()
>  }}
> |1ÇCelvelandÇOhio|
> {{2ÇDurhamÇNC}}
> {{ 3DallasÇTexas}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26971) How to read delimiter (Cedilla) in spark RDD and Dataframes

2019-02-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26971.
--
Resolution: Invalid

> How to read delimiter (Cedilla) in spark RDD and Dataframes
> ---
>
> Key: SPARK-26971
> URL: https://issues.apache.org/jira/browse/SPARK-26971
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Babu
>Priority: Minor
>
>  
> I am trying to read a cedilla delimited HDFS Text file. I am getting the 
> below error, did any one face similar issue?
> {{hadoop fs -cat test_file.dat }}
> {{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }}
> {{>>> rdd = sc.textFile("test_file.dat") }}
> {{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', 
> u'3Dallas\xc7Texas'] }}
> {{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: 
> 'ascii' codec can't decode byte 0xc7 in position 0: ordinal not in range(128) 
> }}
> {{>>> 
> sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show()
>  }}
> |1ÇCelvelandÇOhio|
> {{2ÇDurhamÇNC}}
> {{ 3DallasÇTexas}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776192#comment-16776192
 ] 

Bo Hai commented on SPARK-26932:


Relevant hive jiras:
* https://jira.apache.org/jira/browse/SPARK-24322

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26972) Issue with CSV import and inferSchema set to true

2019-02-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776193#comment-16776193
 ] 

Hyukjin Kwon commented on SPARK-26972:
--

Your books.csv contains {{\r\n}}, and I guess you read it in Mac OS. The 
current behaviour looks correct if that's an issue. You can set the line 
separator in Spark 3.0 (see SPARK-26108)

> Issue with CSV import and inferSchema set to true
> -
>
> Key: SPARK-26972
> URL: https://issues.apache.org/jira/browse/SPARK-26972
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3, 2.3.3, 2.4.0
> Environment: Java 8/Scala 2.11/MacOs
>Reporter: Jean Georges Perrin
>Priority: Major
> Attachments: ComplexCsvToDataframeApp.java, 
> ComplexCsvToDataframeWithSchemaApp.java, books.csv, issue.txt, pom.xml
>
>
>  
> I found a few discrepencies while working with inferSchema set to true in CSV 
> ingestion.
> Given the following CSV in the attached books.csv:
> {noformat}
> id;authorId;title;releaseDate;link
> 1;1;Fantastic Beasts and Where to Find Them: The Original 
> Screenplay;11/18/16;http://amzn.to/2kup94P
> 2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry 
> Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP
> 3;1;*The Tales of Beedle the Bard, Standard Edition (Harry 
> Potter)*;12/4/08;http://amzn.to/2kYezqr
> 4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry 
> Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n
> 5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the 
> Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT
> 6;2;*Development Tools in 2006: any Room for a 4GL-style Language?
> An independent study by Jean Georges Perrin, IIUG Board 
> Member*;12/28/16;http://amzn.to/2vBxOe1
> 7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav
> 8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD
> 10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA
> 11;4;Diderot Encyclopedia: The Complete Illustrations 
> 1762-1777;;http://amzn.to/2i2zo3I
> 12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ
> 13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW
> 14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk
> 15;7;Soft Skills: The software developer's life 
> manual;12/29/14;http://amzn.to/2zNnSyn
> 16;8;Of Mice and Men;;http://amzn.to/2zJjXoc
> 17;9;*Java 8 in Action: Lambdas; Streams; and functional-style 
> programming*;8/28/14;http://amzn.to/2isdqoL
> 18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY
> 19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG
> 20;14;*Fables choisies; mises en vers par M. de La 
> Fontaine*;9/1/1999;http://amzn.to/2yRH10W
> 21;15;Discourse on Method and Meditations on First 
> Philosophy;6/15/1999;http://amzn.to/2hwB8zc
> 22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo
> 23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo{noformat}
> And this Java code:
> {code:java}
> Dataset df = spark.read().format("csv")
>  .option("header", "true")
>  .option("multiline", true)
>  .option("sep", ";")
>  .option("quote", "*")
>  .option("dateFormat", "M/d/y")
>  .option("inferSchema", true)
>  .load("data/books.csv");
> df.show(7);
> df.printSchema();
> {code}
> h1. In Spark v2.0.1
> Output: 
> {noformat}
> +---+++---++
> | id|authorId|   title|releaseDate|link|
> +---+++---++
> |  1|   1|Fantastic Beasts ...|   11/18/16|http://amzn.to/2k...|
> |  2|   1|Harry Potter and ...|10/6/15|http://amzn.to/2l...|
> |  3|   1|The Tales of Beed...|12/4/08|http://amzn.to/2k...|
> |  4|   1|Harry Potter and ...|10/4/16|http://amzn.to/2k...|
> |  5|   2|Informix 12.10 on...|4/23/17|http://amzn.to/2i...|
> |  6|   2|Development Tools...|   12/28/16|http://amzn.to/2v...|
> |  7|   3|Adventures of Huc...|.   5/26/94|http://amzn.to/2w...|
> +---+++---++
> only showing top 7 rows
> Dataframe's schema:
> root
> |-- id: integer (nullable = true)
> |-- authorId: integer (nullable = true)
> |-- title: string (nullable = true)
> |-- releaseDate: string (nullable = true)
> |-- link: string (nullable = true)
> {noformat}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content: 
> {noformat}
> ++++---++
> | id|authorId| title|releaseDate| link|
> ++++---++
> | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
> | 2| 1|Harry Potter and 

[jira] [Commented] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776188#comment-16776188
 ] 

Bo Hai commented on SPARK-26932:


We discuss this issue in dev mail list before, refer to 
http://apache-spark-developers-list.1001551.n3.nabble.com/Time-to-cut-an-Apache-2-4-1-release-tt26381.html#a26428

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776181#comment-16776181
 ] 

Bo Hai edited comment on SPARK-26932 at 2/24/19 9:25 AM:
-

To reproduce this issue, please create ORC table by Spark 2.3.2/2.4 and read by 
Hive 2.1.1 like :

spark-sql --conf 'spark.sql.orc.impl=native' -e 'CREATE TABLE tmp.orcTable2 
USING orc AS SELECT * FROM tmp.orcTable1 limit 10;'

hive -e 'select * from tmp.orcTable2;'

Hive will throw exception showing below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


was (Author: haiboself):
To reproduce this issue, please create ORC table by Spark 2.3.2/2.4 and read by 
Hive 2.1.1 like :

spark-sql -e 'CREATE TABLE tmp.orcTable2 USING orc AS SELECT * FROM 
tmp.orcTable1 limit 10;'

hive -e 'select * from tmp.orcTable2;'

Hive will throw exception showing below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776181#comment-16776181
 ] 

Bo Hai edited comment on SPARK-26932 at 2/24/19 9:22 AM:
-

To reproduce this issue, please create ORC table by Spark 2.3.2/2.4 and read by 
Hive 2.1.1 like :

spark-sql -e 'CREATE TABLE tmp.orcTable2 USING orc AS SELECT * FROM 
tmp.orcTable1 limit 10;'

hive -e 'select * from tmp.orcTable2;'

Hive will throw exception showing below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


was (Author: haiboself):
To reproduce this issue, please create ORC table by Spark 2.4 and read by Hive 
2.1.1 like :

spark-sql -e 'CREATE TABLE tmp.orcTable2 USING orc AS SELECT * FROM 
tmp.orcTable1 limit 10;'

hive -e 'select * from tmp.orcTable2;'

Hive will throw exception showing below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776181#comment-16776181
 ] 

Bo Hai commented on SPARK-26932:


To reproduce this issue, please create ORC table by Spark 2.4 and read by Hive 
2.1.1 like :

spark-sql -e 'CREATE TABLE tmp.orcTable2 USING orc AS SELECT * FROM 
tmp.orcTable1 limit 10;'

hive -e 'select * from tmp.orcTable2;'

Hive will throw exception showing below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
at org.apache.orc.tools.FileDump.main(FileDump.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26932) Orc compatibility between hive and spark

2019-02-24 Thread Bo Hai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Hai updated SPARK-26932:
---
Description: 
As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer and 
reader. In older version of Hive, orc reader(isn't forward-compitaient) 
implemented by its own.

So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
which using apache/orc instead of Hive orc.

I think we should add these information into Spark2.4 orc configuration file : 
https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html

  was:
Since Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer and 
reader. In older version of Hive, orc reader(isn't forward-compitaient) 
implemented by its own.

So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
which using apache/orc instead of Hive orc.

Spark2.4 orc configuration: 
https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html


> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26907) Does ShuffledRDD Replication Work With External Shuffle Service

2019-02-24 Thread Han Altae-Tran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776150#comment-16776150
 ] 

Han Altae-Tran commented on SPARK-26907:


Ok, thank you I will try the mailing list.

I think my main point is that Spark could be improved for use with preemptible 
virtual machines if shuffle files can be replicated across the cluster. In my 
experience, whenever there is a shuffle map task, a single node being preempted 
can cause the entire stage to be retried, causing a huge loss of uptime as all 
tasks fail until the retry is initiated. Using persist with replication doesn't 
seem to help this issue, so I figured there is an optimization around shuffle 
files that could be made for this use case.

> Does ShuffledRDD Replication Work With External Shuffle Service
> ---
>
> Key: SPARK-26907
> URL: https://issues.apache.org/jira/browse/SPARK-26907
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, YARN
>Affects Versions: 2.3.2
>Reporter: Han Altae-Tran
>Priority: Major
>
> I am interested in working with high replication environments for extreme 
> fault tolerance (e.g. 10x replication), but have noticed that when using 
> groupBy or groupWith followed by persist (with 10x replication), even if one 
> node fails, the entire stage can fail with FetchFailedException.
>  
> Is this because the External Shuffle Service writes and services intermediate 
> shuffle data only to/from the local disk attached to the executor that 
> generated it, causing spark to ignore possible replicated shuffle data (from 
> the persist) that may be serviced elsewhere? If so, is there any way to 
> increase the replication factor of the External Shuffle Service to make it 
> fault tolerant?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org