date:20181018

[jira] [Commented] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

2018-10-18 Thread Greg Senia (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656308#comment-16656308
 ] 

Greg Senia commented on SPARK-25778:


Could we add a snippit like this.. I just tested this in my environment and 
this option allows me to work around the issue. So could it be a -D or a spark 
flag to fix this.

  try {
// The WriteAheadLogUtils.createLog*** method needs a directory to 
create a
// WriteAheadLog object as the default FileBasedWriteAheadLog needs a 
directory for
// writing log data. However, the directory is not needed if data needs 
to be read, hence
// a dummy path is provided to satisfy the method parameter 
requirements.
// FileBasedWriteAheadLog will not create any file or directory at that 
path.
// FileBasedWriteAheadLog will not create any file or directory at that 
path. Also,
// this dummy directory should not already exist otherwise the WAL will 
try to recover
// past events from the directory and throw errors.
var nonExistentDirectory = ""
if (!(System.getProperty("hdfs.writeahead.tmpdir").isEmpty())) {
  nonExistentDirectory = new File(
System.getProperty("hdfs.writeahead.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
} else {
  nonExistentDirectory = new File(
System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
}
writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
  SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
dataRead = writeAheadLog.read(partition.walRecordHandle)
  } catch {
case NonFatal(e) =>
  throw new SparkException(
s"Could not read data from write ahead log record 
${partition.walRecordHandle}", e)

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2
>Reporter: Greg Senia
>Priority: Critical
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
>

[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release

2018-10-18 Thread Ablimit A. Keskin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ablimit A. Keskin updated SPARK-25775:
--
Description: 
The following issue comes from a production Spark job where executors die due 
to uncaught exceptions during block release. When the task run with a specific 
configuration for _executor-cores_ and _total-executor-cores_ (e.g. 4 & 8 or 1 
& 8),  it constantly fails at the same code segment.  Following are logs from 
our run:

 
{code:java}
18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are 
for partitions Vector(0))
18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 
112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 
10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB)
18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 
10.248.7.2, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 
113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 
10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 
1]
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 
114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/0 is now EXITED (Command exited with code 50)
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor 
app-20181018230546-0040/0 removed: Command exited with code 50
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 
(10.248.110.2:40787) with 4 cores
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID 
app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB 
RAM
18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 
10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/2 is now RUNNING
18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11)
18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested
18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 
from BlockManagerMaster.
18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 0
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 
115, 10.248.21.2, executor 1, partition 0, ANY, 5585 bytes)
18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_27_2 !
18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_44_7 !
18/10/18 23:06:31 WARN

[jira] [Created] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access

2018-10-18 Thread Greg Senia (JIRA)

Greg Senia created SPARK-25778:
--

 Summary: WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails 
due lack of access
 Key: SPARK-25778
 URL: https://issues.apache.org/jira/browse/SPARK-25778
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming, YARN
Affects Versions: 2.3.2, 2.3.1, 2.2.2, 2.2.1
Reporter: Greg Senia


WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
Mode for Spark

While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
following errors after the driver attempted to recovery the already read data 
that was being written to HDFS in the checkpoint folder. After spending many 
hours looking at the cause of the following error below due to the fact the 
parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
make an option configurable to choose an alternate bogus directory that will 
never be used.

hadoop fs -ls /
drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
hadoop fs -ls /hadoop/apps
drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps


streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
  val nonExistentDirectory = new File(
  System.getProperty("java.io.tmpdir"), 
UUID.randomUUID().toString).getAbsolutePath
writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
  SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
dataRead = writeAheadLog.read(partition.walRecordHandle)

18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching task 
72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 as 
bytes
18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
StorageLevel(disk, memory, 1 replicas)
18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 
ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
ha20t5002dn.tech.hdp.example.com, executor 1): org.apache.spark.SparkException: 
Could not read data from write ahead log record 
FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.AccessControlException: Permission 
denied: user=hdpdevspark, access=EXECUTE, 
inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at 
org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:307)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1827)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108)
at

[jira] [Commented] (SPARK-25777) Optimize DetermineTableStats if LogicalRelation already cached

2018-10-18 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656228#comment-16656228
 ] 

Yuming Wang commented on SPARK-25777:
-

I'm working on.

This needs to fix at least:
 org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from a 
hive table with a new column - orc
 org.apache.spark.sql.hive.execution.SQLQuerySuite.SPARK-18355 Read data from a 
hive table with a new column - parquet

> Optimize DetermineTableStats if LogicalRelation already cached 
> ---
>
> Key: SPARK-25777
> URL: https://issues.apache.org/jira/browse/SPARK-25777
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We can avoid compute stats if {{LogicalRelation}} already cached. because the 
> computed stats will not take effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25777) Optimize DetermineTableStats if LogicalRelation already cached

2018-10-18 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-25777:
---

 Summary: Optimize DetermineTableStats if LogicalRelation already 
cached 
 Key: SPARK-25777
 URL: https://issues.apache.org/jira/browse/SPARK-25777
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


We can avoid compute stats if {{LogicalRelation}} already cached. because the 
computed stats will not take effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25493) CRLF Line Separators don't work in multiline CSVs

2018-10-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25493.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22503
[https://github.com/apache/spark/pull/22503]

> CRLF Line Separators don't work in multiline CSVs
> -
>
> Key: SPARK-25493
> URL: https://issues.apache.org/jira/browse/SPARK-25493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Justin Uang
>Assignee: Justin Uang
>Priority: Major
> Fix For: 3.0.0
>
>
> CSVs with windows style crlf (carriage return line feed) don't work in 
> multiline mode. They work fine in single line mode because the line 
> separation is done by Hadoop, which can handle all the different types of 
> line separators. In multiline mode, the Univocity parser is used to also 
> handle splitting of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25493) CRLF Line Separators don't work in multiline CSVs

2018-10-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25493:


Assignee: Justin Uang

> CRLF Line Separators don't work in multiline CSVs
> -
>
> Key: SPARK-25493
> URL: https://issues.apache.org/jira/browse/SPARK-25493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Justin Uang
>Assignee: Justin Uang
>Priority: Major
> Fix For: 3.0.0
>
>
> CSVs with windows style crlf (carriage return line feed) don't work in 
> multiline mode. They work fine in single line mode because the line 
> separation is done by Hadoop, which can handle all the different types of 
> line separators. In multiline mode, the Univocity parser is used to also 
> handle splitting of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656180#comment-16656180
 ] 

Apache Spark commented on SPARK-25776:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22754

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25776:


Assignee: (was: Apache Spark)

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656177#comment-16656177
 ] 

Apache Spark commented on SPARK-25776:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22754

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25776:


Assignee: Apache Spark

> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25776:

Description: 
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_

  was:
In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file 
wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long 
keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write 
buffer first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_


> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {{diskWriteBufferSize}} is 10, it will print this exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25776:

Description: 
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_

  was:
In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
long baseOffset, int recordLength, long keyPrefix{color})}}, 
{color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
{color}will be written the disk write buffer first, and these will take 12 
bytes, so the disk write buffer size must be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_


> The disk write buffer size must be greater than 12.
> ---
>
> Key: SPARK-25776
> URL: https://issues.apache.org/jira/browse/SPARK-25776
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> In {color:#205081}{{UnsafeSorterSpillWriter.java}}{color}, when we write a 
> record to a spill file wtih {{ {color:#205081}void write(Object baseObject, 
> long baseOffset, int recordLength, long keyPrefix{color})}}, 
> {color:#205081}{{recordLength}} {color}and {color:#205081}{{keyPrefix}} 
> {color}will be written the disk write buffer first, and these will take 12 
> bytes, so the disk write buffer size must be greater than 12.
> If {color:#205081}{{diskWriteBufferSize}} {color}is 10, it will print this 
> exception info:
> _java.lang.ArrayIndexOutOfBoundsException: 10_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
>  (UnsafeSorterSpillWriter.java:91)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
>  _at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
>  _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25776) The disk write buffer size must be greater than 12.

2018-10-18 Thread liuxian (JIRA)

liuxian created SPARK-25776:
---

 Summary: The disk write buffer size must be greater than 12.
 Key: SPARK-25776
 URL: https://issues.apache.org/jira/browse/SPARK-25776
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: liuxian


In {{UnsafeSorterSpillWriter.java}}, when we write a record to a spill file 
wtih {{ void write(Object baseObject, long baseOffset, int recordLength, long 
keyPrefix)}}, {{recordLength}} and {{keyPrefix}} will be written the disk write 
buffer first, and these will take 12 bytes, so the disk write buffer size must 
be greater than 12.

If {{diskWriteBufferSize}} is 10, it will print this exception info:

_java.lang.ArrayIndexOutOfBoundsException: 10_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.writeLongToBuffer
 (UnsafeSorterSpillWriter.java:91)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:123)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spillIterator(UnsafeExternalSorter.java:498)_
 _at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:222)_
 _at org.apache.spark.memory.MemoryConsumer.spill(MemoryConsumer.java:65)_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25764) Avoid usage of deprecated methods in examples for BisectingKMeans

2018-10-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25764.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22763
[https://github.com/apache/spark/pull/22763]

> Avoid usage of deprecated methods in examples for BisectingKMeans
> -
>
> Key: SPARK-25764
> URL: https://issues.apache.org/jira/browse/SPARK-25764
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-25758 is deprecating {{computeCost}} for {{BisectingKMeans}} and the 
> method is going to be removed in 3.0. So we should not use this method in our 
> examples.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25764) Avoid usage of deprecated methods in examples for BisectingKMeans

2018-10-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25764:
---

Assignee: Marco Gaido

> Avoid usage of deprecated methods in examples for BisectingKMeans
> -
>
> Key: SPARK-25764
> URL: https://issues.apache.org/jira/browse/SPARK-25764
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-25758 is deprecating {{computeCost}} for {{BisectingKMeans}} and the 
> method is going to be removed in 3.0. So we should not use this method in our 
> examples.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25740) Refactor DetermineTableStats to invalidate cache when some configuration changed

2018-10-18 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25740:

Description: 
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
-- clear cache
REFRESH TABLE t1;
REFRESH TABLE t2;

explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin

-- clear cache
REFRESH TABLE t1;
REFRESH TABLE t2;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
{code}

  was:
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
exit;

spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code:java}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}


> Refactor DetermineTableStats to invalidate cache when some configuration 
> changed
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> -- clear cache
> REFRESH TABLE t1;
> REFRESH TABLE t2;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> -- clear cache
> REFRESH TABLE t1;
> REFRESH TABLE t2;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25761) sparksql执行sql语句的时候，sql语句已经执行成功，但是从sparkui上看该语句还是没有执行完成，还是running状态。

2018-10-18 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-25761.
-
Resolution: Invalid

> sparksql执行sql语句的时候，sql语句已经执行成功，但是从sparkui上看该语句还是没有执行完成，还是running状态。
> ---
>
> Key: SPARK-25761
> URL: https://issues.apache.org/jira/browse/SPARK-25761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: hanrentong
>Priority: Major
>
> 我去sparksql上执行sql语句，sql语句已经执行完了，但是从sparkui上看语句还是running状态，并且也kill不掉



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656065#comment-16656065
 ] 

Apache Spark commented on SPARK-24499:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22772

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656062#comment-16656062
 ] 

Apache Spark commented on SPARK-24499:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22772

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release

2018-10-18 Thread Ablimit A. Keskin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ablimit A. Keskin updated SPARK-25775:
--
Affects Version/s: (was: 2.2.0)
   2.2.2

> Race between end-of-task and completion iterator read lock release
> --
>
> Key: SPARK-25775
> URL: https://issues.apache.org/jira/browse/SPARK-25775
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.2.2
>Reporter: Ablimit A. Keskin
>Priority: Major
>
> The following issue comes from a production Spark job where executors die due 
> to uncaught exceptions during block release. When the task run with a 
> specific configuration for -_executor-cores_ and- _total-executor-cores_ 
> (e.g. 4 & 8 or 1 & 8),  it constantly fails at the same code segment.  
> Following are logs from our run:
>  
> {code:java}
> 18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from 
> ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks 
> are for partitions Vector(0))
> 18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
> 18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 
> 112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
> 18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory 
> on 10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB)
> 18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 
> 10.248.7.2, executor 0): java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:156)
> at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
> at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
> at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
> at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
> at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
> at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 
> 113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
> 18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) 
> on 10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) 
> [duplicate 1]
> 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 
> 114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
> 18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: 
> Remote RPC client disassociated. Likely due to containers exceeding 
> thresholds, or network issues. Check driver logs for WARN messages.
> 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
> app-20181018230546-0040/0 is now EXITED (Command exited with code 50)
> 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor 
> app-20181018230546-0040/0 removed: Command exited with code 50
> 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
> app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 
> (10.248.110.2:40787) with 4 cores
> 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID 
> app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 
> GB RAM
> 18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 
> 10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
> of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
> containers exceeding thresholds, or network issues. Check driver logs for 
> WARN messages.
> 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
> app-20181018230546-0040/2 is now RUNNING
> 18/10/18 23:06:31 INFO DAGScheduler: Executor lost:

[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release

2018-10-18 Thread Ablimit A. Keskin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ablimit A. Keskin updated SPARK-25775:
--
Description: 
The following issue comes from a production Spark job where executors die due 
to uncaught exceptions during block release. When the task run with a specific 
configuration for -_executor-cores_ and- _total-executor-cores_ (e.g. 4 & 8 or 
1 & 8),  it constantly fails at the same code segment.  Following are logs from 
our run:

 
{code:java}
18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are 
for partitions Vector(0))
18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 
112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 
10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB)
18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 
10.248.7.2, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 
113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 
10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 
1]
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 
114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/0 is now EXITED (Command exited with code 50)
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor 
app-20181018230546-0040/0 removed: Command exited with code 50
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 
(10.248.110.2:40787) with 4 cores
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID 
app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB 
RAM
18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 
10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/2 is now RUNNING
18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11)
18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested
18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 
from BlockManagerMaster.
18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 0
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 
115, 10.248.21.2, executor 1, partition 0, ANY, 5585 bytes)
18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_27_2 !
18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available 
for rdd_44_7 !
18/10/18 23:06:31

[jira] [Created] (SPARK-25775) Race between end-of-task and completion iterator read lock release

2018-10-18 Thread Ablimit A. Keskin (JIRA)

Ablimit A. Keskin created SPARK-25775:
-

 Summary: Race between end-of-task and completion iterator read 
lock release
 Key: SPARK-25775
 URL: https://issues.apache.org/jira/browse/SPARK-25775
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 2.2.0
Reporter: Ablimit A. Keskin


The following issue comes from a production Spark job where executors die due 
to uncaught exceptions during block release. When the task run with a specific 
configuration for "--executor-cores" and "--total-executor-cores" (e.g. 4 & 8 
or 1 & 8),  it constantly fails at the same code segment.  Following are logs 
from our run:

 
{code:java}
18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are 
for partitions Vector(0))
18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 
112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 
10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB)
18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 
10.248.7.2, executor 0): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 
113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 
10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 
1]
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 
114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes)
18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: 
Remote RPC client disassociated. Likely due to containers exceeding thresholds, 
or network issues. Check driver logs for WARN messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/0 is now EXITED (Command exited with code 50)
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor 
app-20181018230546-0040/0 removed: Command exited with code 50
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 
(10.248.110.2:40787) with 4 cores
18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID 
app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB 
RAM
18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 
10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one 
of the running tasks) Reason: Remote RPC client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for WARN 
messages.
18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: 
app-20181018230546-0040/2 is now RUNNING
18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11)
18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested
18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 
from BlockManagerMaster.
18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to 
remove non-existent executor 0
18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 
115, 10.248.21.2, executor 1, partition 0,

[jira] [Updated] (SPARK-21402) Fix java array of structs deserialization

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21402:
--
Fix Version/s: 2.4.0
   2.3.3
   2.2.3

> Fix java array of structs deserialization
> -
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21402) Fix java array of structs deserialization

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-21402.
---
  Resolution: Fixed
Target Version/s: 2.2.3, 2.3.3, 2.4.0

> Fix java array of structs deserialization
> -
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21402) Fix java array of structs deserialization

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21402:
--
Target Version/s:   (was: 2.2.3, 2.3.3, 2.4.0)

> Fix java array of structs deserialization
> -
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21402) Fix java array of structs deserialization

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-21402:
-

Assignee: Vladimir Kuriatkov

> Fix java array of structs deserialization
> -
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25683) Updated the log for the firstTime event Drop occurs.

2018-10-18 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25683:
--

Assignee: Shivu Sondur

> Updated the log for the firstTime event Drop occurs.
> 
>
> Key: SPARK-25683
> URL: https://issues.apache.org/jira/browse/SPARK-25683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Shivu Sondur
>Priority: Trivial
> Fix For: 3.0.0
>
>
> {code:xml}
> 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. 
> This likely means one of the listeners is too slow and cannot keep up with 
> the rate at which tasks are being started by the scheduler.
> 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since 
> Wed Dec 31 16:00:00 PST 1969.
> 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog 
> since Mon Oct 08 17:51:40 PDT 2018.
> {code}
> Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, 
> log should updated as  "... since start of the application" if 
> 'lastReportTimestamp' == 0.
> when the  first dropEvent occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25683) Updated the log for the firstTime event Drop occurs.

2018-10-18 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25683.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22677
[https://github.com/apache/spark/pull/22677]

> Updated the log for the firstTime event Drop occurs.
> 
>
> Key: SPARK-25683
> URL: https://issues.apache.org/jira/browse/SPARK-25683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Devaraj K
>Assignee: Shivu Sondur
>Priority: Trivial
> Fix For: 3.0.0
>
>
> {code:xml}
> 18/10/08 17:51:40 ERROR AsyncEventQueue: Dropping event from queue eventLog. 
> This likely means one of the listeners is too slow and cannot keep up with 
> the rate at which tasks are being started by the scheduler.
> 18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since 
> Wed Dec 31 16:00:00 PST 1969.
> 18/10/08 17:52:40 WARN AsyncEventQueue: Dropped 144853 events from eventLog 
> since Mon Oct 08 17:51:40 PDT 2018.
> {code}
> Here it shows the time as Wed Dec 31 16:00:00 PST 1969 for the first log, 
> log should updated as  "... since start of the application" if 
> 'lastReportTimestamp' == 0.
> when the  first dropEvent occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25774) Eliminate query anomalies with empty partitions - TRUNCATE, SELECT DISTINCT, etc.

2018-10-18 Thread Steven Cardella (JIRA)

Steven Cardella created SPARK-25774:
---

 Summary: Eliminate query anomalies with empty partitions - 
TRUNCATE, SELECT DISTINCT, etc.
 Key: SPARK-25774
 URL: https://issues.apache.org/jira/browse/SPARK-25774
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
 Environment: Right now, I'm using Cloudera with Spark 2.2.0, but I 
understand it's a widespread thing.
Reporter: Steven Cardella


If you run a spark SQL TRUNCATE TABLE command on a managed table in Hive, it 
deletes the files in HDFS but leaves the partitions and partition folder 
structure.  If you then SELECT DISTINCT on the partition columns, it returns 
all the empty partition values.  So, you can have a SELECT DISTINCT return rows 
but SELECT * on the same table returns 0 rows.  

Coming from SQL Server and the like, SELECT DISTINCT always reflects the ROWS, 
and Impala works like that as well.  

I'd like SELECT DISTINCT to reflect rows, not partitions, TRUNCATE TABLE to 
have the option to drop partitions, and MSCK REPAIR TABLE to have the option to 
drop empty partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25773:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25773:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655887#comment-16655887
 ] 

Apache Spark commented on SPARK-25773:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22771

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655888#comment-16655888
 ] 

Apache Spark commented on SPARK-25773:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22771

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25773:
-
Description: When a job finishes, there may be some zombie tasks still 
running due to stage retry. Since a result stage will never be used by other 
jobs, running these tasks are just wasting the cluster resource. This PR just 
asks TaskScheduler to cancel the running tasks of a result stage when it's 
already finished. Credits go to @srinathshankar who suggested this idea to me.

> Cancel zombie tasks in a result stage when the job finishes
> ---
>
> Key: SPARK-25773
> URL: https://issues.apache.org/jira/browse/SPARK-25773
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> When a job finishes, there may be some zombie tasks still running due to 
> stage retry. Since a result stage will never be used by other jobs, running 
> these tasks are just wasting the cluster resource. This PR just asks 
> TaskScheduler to cancel the running tasks of a result stage when it's already 
> finished. Credits go to @srinathshankar who suggested this idea to me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25773) Cancel zombie tasks in a result stage when the job finishes

2018-10-18 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-25773:


 Summary: Cancel zombie tasks in a result stage when the job 
finishes
 Key: SPARK-25773
 URL: https://issues.apache.org/jira/browse/SPARK-25773
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21402) Fix java array of structs deserialization

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21402:
--
Summary: Fix java array of structs deserialization  (was: Java encoders - 
switch fields on collectAsList)

> Fix java array of structs deserialization
> -
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655862#comment-16655862
 ] 

Apache Spark commented on SPARK-25772:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22745

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25772:


Assignee: (was: Apache Spark)

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25772:


Assignee: Apache Spark

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Apache Spark
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655861#comment-16655861
 ] 

Apache Spark commented on SPARK-25772:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22745

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-25772
> URL: https://issues.apache.org/jira/browse/SPARK-25772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655855#comment-16655855
 ] 

Dongjoon Hyun commented on SPARK-21402:
---

We already committed with SPARK JIRA ID, SPARK-21402. Too bad.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655852#comment-16655852
 ] 

Apache Spark commented on SPARK-25771:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22770

> Fix improper synchronization in PythonWorkerFactory
> ---
>
> Key: SPARK-25771
> URL: https://issues.apache.org/jira/browse/SPARK-25771
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655845#comment-16655845
 ] 

Dongjoon Hyun edited comment on SPARK-21402 at 10/18/18 8:47 PM:
-

I cloned this JIRA issue. Can we do like the followings? [~cloud_fan] and 
[~vofque] and [~srowen] and [~tomron]

- SPARK-21402 (this) for `Fix java array of structs deserialization`
- SPARK-25772 (new cloned one) for the original Map issue?


was (Author: dongjoon):
I cloned this JIRA issue. Can we do like the followings? [~cloud_fan] and 
[~vofque].

- SPARK-21402 (this) for `Fix java array of structs deserialization`
- SPARK-25772 (new cloned one) for the original Map issue?

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25771:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix improper synchronization in PythonWorkerFactory
> ---
>
> Key: SPARK-25771
> URL: https://issues.apache.org/jira/browse/SPARK-25771
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25771:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix improper synchronization in PythonWorkerFactory
> ---
>
> Key: SPARK-25771
> URL: https://issues.apache.org/jira/browse/SPARK-25771
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655850#comment-16655850
 ] 

Apache Spark commented on SPARK-25771:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22770

> Fix improper synchronization in PythonWorkerFactory
> ---
>
> Key: SPARK-25771
> URL: https://issues.apache.org/jira/browse/SPARK-25771
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655845#comment-16655845
 ] 

Dongjoon Hyun commented on SPARK-21402:
---

I cloned this JIRA issue. Can we do like the followings? [~cloud_fan] and 
[~vofque].

- SPARK-21402 (this) for `Fix java array of structs deserialization`
- SPARK-25772 (new cloned one) for the original Map issue?

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25772) Java encoders - switch fields on collectAsList

2018-10-18 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-25772:
-

 Summary: Java encoders - switch fields on collectAsList
 Key: SPARK-25772
 URL: https://issues.apache.org/jira/browse/SPARK-25772
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
 Environment: mac os
spark 2.1.1
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
Reporter: Tom


I have the following schema in a dataset -

root
 |-- userId: string (nullable = true)
 |-- data: map (nullable = true)
 ||-- key: string
 ||-- value: struct (valueContainsNull = true)
 |||-- startTime: long (nullable = true)
 |||-- endTime: long (nullable = true)
 |-- offset: long (nullable = true)


 And I have the following classes (+ setter and getters which I omitted for 
simplicity) -


 
{code:java}
public class MyClass {

private String userId;

private Map data;

private Long offset;
 }

public class MyDTO {

private long startTime;
private long endTime;

}
{code}


I collect the result the following way - 


{code:java}
Encoder myClassEncoder = Encoders.bean(MyClass.class);
Dataset results = raw_df.as(myClassEncoder);
List lst = results.collectAsList();

{code}

I do several calculations to get the result I want and the result is correct 
all through the way before I collect it.
This is the result for - 


{code:java}
results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);

{code}

|data[2017-07-01].startTime|data[2017-07-01].endTime|
+-+--+
|1498854000|1498870800  |


This is the result after collecting the reuslts for - 


{code:java}
MyClass userData = results.collectAsList().get(0);
MyDTO userDTO = userData.getData().get("2017-07-01");
System.out.println("userDTO startTime: " + userDTO.getStartTime());
System.out.println("userDTO endTime: " + userDTO.getEndTime());

{code}

--
data startTime: 1498870800
data endTime: 1498854000

I tend to believe it is a spark issue. Would love any suggestions on how to 
bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25771) Fix improper synchronization in PythonWorkerFactory

2018-10-18 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-25771:


 Summary: Fix improper synchronization in PythonWorkerFactory
 Key: SPARK-25771
 URL: https://issues.apache.org/jira/browse/SPARK-25771
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655825#comment-16655825
 ] 

Apache Spark commented on SPARK-24499:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22769

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655824#comment-16655824
 ] 

Apache Spark commented on SPARK-24499:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22769

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655776#comment-16655776
 ] 

Xiao Li commented on SPARK-24499:
-

Feel free to create the extra tasks to improve the documentation. I just merged 
this to 2.4 and 3.0. 

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24499.
-
  Resolution: Fixed
Assignee: Yuanjian Li
   Fix Version/s: 2.4.0
Target Version/s:   (was: 3.0.0)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Summary: Split the page of sql-programming-guide.html to multiple separate 
pages  (was: Documentation improvement of Spark core and SQL)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24499) Split the page of sql-programming-guide.html to multiple separate pages

2018-10-18 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24499:

Component/s: (was: Spark Core)

> Split the page of sql-programming-guide.html to multiple separate pages
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.0
>
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655711#comment-16655711
 ] 

Huaxin Gao commented on SPARK-25769:


I will work on this. Thanks!

> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> {{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res1: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res2: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25770) support SparkDataFrame pretty print

2018-10-18 Thread Weiqiang Zhuang (JIRA)

Weiqiang Zhuang created SPARK-25770:
---

 Summary: support SparkDataFrame pretty print
 Key: SPARK-25770
 URL: https://issues.apache.org/jira/browse/SPARK-25770
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.2, 2.3.1, 2.3.0
Reporter: Weiqiang Zhuang


This is for continuous discussion with a requirement added in 
[https://github.com/apache/spark/pull/22455#discussion_r223197863.]

 

Summary:

SparkDataFrame is a S4 object, `show()` is the default method to display the 
data frame to screen output. Currently the output is simply in string format 
returned by `showString()` call which pre-formats the data frame and displays 
as a table. This lacks the flexibility to re-format the output with a more user 
friendly and pretty fashion, as has been seen in 1) S3 object's `print()` 
method allows to specify arguments like `quote` etc to control the output; 2) 
external tools such as `Jupyter` R notebook implement their own customized way 
of display.

 

This Jira aims to explore a feasible solution to improve the screen output 
experience by both supporting a pretty print from with the SparkR package and 
also offering a common hook for external tools to customize the display 
function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25758:
--
Issue Type: Task  (was: Improvement)

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25758.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22756
[https://github.com/apache/spark/pull/22756]

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25758) Deprecate BisectingKMeans compute cost

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25758:
-

Assignee: Marco Gaido

> Deprecate BisectingKMeans compute cost
> --
>
> Key: SPARK-25758
> URL: https://issues.apache.org/jira/browse/SPARK-25758
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>
> In SPARK-23451 the method {{computeCost}} from KMeans was deprecated, as we 
> have now a better way to evaluate a clustering algorithm (the 
> {{ClusteringEvaluator}}). Moreover, in the deprecation, the method was 
> targeted for removal in 3.0.
> I think we should deprecate the computeCost method on BisectingKMeans  for 
> the same reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25682) Docker images generated from dev build and from dist tarball are different

2018-10-18 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25682.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22681
[https://github.com/apache/spark/pull/22681]

> Docker images generated from dev build and from dist tarball are different
> --
>
> Key: SPARK-25682
> URL: https://issues.apache.org/jira/browse/SPARK-25682
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's at least one difference I noticed, because of this line:
> {noformat}
> COPY examples /opt/spark/examples
> {noformat}
> In a dev build, "examples" contains your usual source code and maven-style 
> directories, whereas in the dist version, it's this:
> {code}
> cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars"
> {code}
> So the path to the actual jar files ends up being different depending on how 
> you built the image.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25682) Docker images generated from dev build and from dist tarball are different

2018-10-18 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25682:
--

Assignee: Marcelo Vanzin

> Docker images generated from dev build and from dist tarball are different
> --
>
> Key: SPARK-25682
> URL: https://issues.apache.org/jira/browse/SPARK-25682
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> There's at least one difference I noticed, because of this line:
> {noformat}
> COPY examples /opt/spark/examples
> {noformat}
> In a dev build, "examples" contains your usual source code and maven-style 
> directories, whereas in the dist version, it's this:
> {code}
> cp "$SPARK_HOME"/examples/target/scala*/jars/* "$DISTDIR/examples/jars"
> {code}
> So the path to the actual jar files ends up being different depending on how 
> you built the image.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25766) AMCredentialRenewer can leak FS clients

2018-10-18 Thread Marcelo Vanzin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655574#comment-16655574
 ] 

Marcelo Vanzin commented on SPARK-25766:


That code doesn't exist anymore in 2.4... there could be a similar issue in 
{{HadoopFSDelegationTokenProvider}}, though.

> AMCredentialRenewer can leak FS clients
> ---
>
> Key: SPARK-25766
> URL: https://issues.apache.org/jira/browse/SPARK-25766
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Trivial
>
> AMCredentialRenewer's scheduled {{writeNewCredentialsToHDFS}} operation 
> creates a new FS connector each time, so as to access the store with 
> refreshed credentials
> but it doesn't close it after, so any resources used by the client are kept 
> around. This is more expensive with the cloud store connectors which create 
> thread pools. 
> It should call {{remoteFs.close()}} at the end of its work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25763) Use more `@contextmanager` to ensure clean-up each test.

2018-10-18 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25763.
--
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 3.0.0

fixed in https://github.com/apache/spark/pull/22762

> Use more `@contextmanager` to ensure clean-up each test.
> 
>
> Key: SPARK-25763
> URL: https://issues.apache.org/jira/browse/SPARK-25763
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently each test in {{SQLTest}} in PySpark is not cleaned properly.
>  We should introduce and use more {{@contextmanager}} to be convenient to 
> clean up the context properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655530#comment-16655530
 ] 

Marco Gaido commented on SPARK-25767:
-

Your conversion of a Java array in a Scala Seq creates a Stream.

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:212)
>     at

[jira] [Resolved] (SPARK-25760) Set AddJarCommand return empty

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25760.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22747
[https://github.com/apache/spark/pull/22747]

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> {noformat}
> spark-sql> add jar 
> /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
> ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
> 0
> spark-sql>{noformat}
> Only {{AddJarCommand}} return 0, the user will be confused about what it 
> means. It should be empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25760) Set AddJarCommand return empty

2018-10-18 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25760:
-

Assignee: Yuming Wang

> Set AddJarCommand return empty
> --
>
> Key: SPARK-25760
> URL: https://issues.apache.org/jira/browse/SPARK-25760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> {noformat}
> spark-sql> add jar 
> /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar;
> ADD JAR /Users/yumwang/spark/sql/hive/src/test/resources/TestUDTF.jar
> 0
> spark-sql>{noformat}
> Only {{AddJarCommand}} return 0, the user will be confused about what it 
> means. It should be empty.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Thomas Brugiere (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655445#comment-16655445
 ] 

Thomas Brugiere commented on SPARK-25767:
-

Please can you provide an example? I don't see any Stream in my code

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at

[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2018-10-18 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655442#comment-16655442
 ] 

Yuming Wang commented on SPARK-21725:
-

[~owen.omalley] [~ste...@apache.org]
I found lots of related issues, can we fix it on the Hadoop side?
 
[https://stackoverflow.com/questions/17421218/multiples-hadoop-filesystem-instances/]
 
[https://stackoverflow.com/questions/48592337/hive-hadoop-intermittent-failure-unable-to-move-source-to-destination]

> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>Priority: Major
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655440#comment-16655440
 ] 

Marco Gaido commented on SPARK-25767:
-

So I tracked down the issue. The problem is that you are passing a Stream as 
parameter for the join keys. The easy workaround is to use a Buffer instead of 
it.

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Thomas Brugiere (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655428#comment-16655428
 ] 

Thomas Brugiere commented on SPARK-25767:
-

It looks like it's not stopping the processing but I would like to make sure 
there is has no side effect in the data processing

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655418#comment-16655418
 ] 

Marco Gaido commented on SPARK-25767:
-

It is interesting, I can reproduce with the Java API but not with the Scala 
one...

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at

[jira] [Commented] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-18 Thread Peter Toth (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655416#comment-16655416
 ] 

Peter Toth commented on SPARK-25665:


I will submit a PR but I've run into SPARK-25768.

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Thomas Brugiere (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655391#comment-16655391
 ] 

Thomas Brugiere commented on SPARK-25767:
-

If it can help, I have created a github repo with the problematic code: 
[https://github.com/onyssius/spark-troubleshooting-bug]

Once I run the SparkBug.main() function, I can see the error mentioned in the 
description in the application logs (search ERROR)

Note that the exception doesn't reach the main thread, but it appears in the 
logs

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
>

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655383#comment-16655383
 ] 

Apache Spark commented on SPARK-21402:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22768

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Peter Toth (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-25768:
---
Description: 
 The following doesn't work since SPARK-18186 so it is a regression.
{code:java}
test("constant argument expecting Hive UDAF") {
  withTempView("inputTable") {
spark.range(10).createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 

  was:
 The following doesn't work since SPARK-18168 so it is a regression.
{code:java}
test("constant argument expecting Hive UDAF") {
  withTempView("inputTable") {
spark.range(10).createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 


> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  The following doesn't work since SPARK-18186 so it is a regression.
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   withTempView("inputTable") {
> spark.range(10).createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655382#comment-16655382
 ] 

Apache Spark commented on SPARK-21402:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22768

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21402:


Assignee: (was: Apache Spark)

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21402:


Assignee: Apache Spark

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Apache Spark
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Peter Toth (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-25768:
---
Description: 
 The following doesn't work since SPARK-18168 so it is a regression.
{code:java}
test("constant argument expecting Hive UDAF") {
  withTempView("inputTable") {
spark.range(10).createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 

  was:
 

The following doesn't work since -SPARK-18186:(-
{code:java}
test("constant argument expecting Hive UDAF") {
  withTempView("inputTable") {
spark.range(10).createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 


> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  The following doesn't work since SPARK-18168 so it is a regression.
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   withTempView("inputTable") {
> spark.range(10).createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655380#comment-16655380
 ] 

Apache Spark commented on SPARK-21402:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22767

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Peter Toth (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-25768:
---
Description: 
 

The following doesn't work since -SPARK-18186:(-
{code:java}
test("constant argument expecting Hive UDAF") {
  withTempView("inputTable") {
spark.range(10).createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 

  was:
 

The following doesn't work since SPARK-18186:
{code:java}
test("constant argument expecting Hive UDAF") {
  val testData = spark.range(10).toDF()
  withTempView("inputTable") {
testData.createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 


> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since -SPARK-18186:(-
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   withTempView("inputTable") {
> spark.range(10).createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Thomas Brugiere (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655374#comment-16655374
 ] 

Thomas Brugiere commented on SPARK-25767:
-

Hi Marco,

I have noticed the bug using the jar coming from the mavenCentral repository 
([http://central.maven.org/maven2/org/apache/spark/spark-sql_2.11/2.3.2/spark-sql_2.11-2.3.2.jar)]

Find below the *build.gradle* I'm using for reference:
{code:java}
buildscript {
ext {
sparkVersion = '2.3.2'
}
}

plugins {
id 'java'
}

group 'spark'
version '1.0-SNAPSHOT'

sourceCompatibility = 1.8

repositories {
mavenCentral()
}

dependencies {
compile group: 'org.apache.spark', name: 'spark-sql_2.11', version: 
"${sparkVersion}"
}
{code}

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
>

[jira] [Commented] (SPARK-20760) Memory Leak of RDD blocks

2018-10-18 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655369#comment-16655369
 ] 

Wenchen Fan commented on SPARK-20760:
-

has this been fixed by https://issues.apache.org/jira/browse/SPARK-24889 ?

> Memory Leak of RDD blocks 
> --
>
> Key: SPARK-20760
> URL: https://issues.apache.org/jira/browse/SPARK-20760
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
>Reporter: Binzi Cao
>Priority: Major
> Attachments: RDD Blocks .png, RDD blocks in spark 2.1.1.png, Storage 
> in spark 2.1.1.png
>
>
> Memory leak for RDD blocks for a long time running rdd process.
> We  have a long term running application, which is doing computations of 
> RDDs. and we found the RDD blocks are keep increasing in the spark ui page. 
> The rdd blocks and memory usage do not mach the cached rdds and memory. It 
> looks like spark keeps old rdd in memory and never released it or never got a 
> chance to release it. The job will eventually die of out of memory. 
> In addition, I'm not seeing this issue in spark 1.6. We are seeing the same 
> issue in Yarn Cluster mode both in kafka streaming and batch applications. 
> The issue in streaming is similar, however, it seems the rdd blocks grows a 
> bit slower than batch jobs. 
> The below is the sample code and it is reproducible by justing running it in 
> local mode. 
> Scala file:
> {code}
> import scala.concurrent.duration.Duration
> import scala.util.{Try, Failure, Success}
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import scala.concurrent._
> import ExecutionContext.Implicits.global
> case class Person(id: String, name: String)
> object RDDApp {
>   def run(sc: SparkContext) = {
> while (true) {
>   val r = scala.util.Random
>   val data = (1 to r.nextInt(100)).toList.map { a =>
> Person(a.toString, a.toString)
>   }
>   val rdd = sc.parallelize(data)
>   rdd.cache
>   println("running")
>   val a = (1 to 100).toList.map { x =>
> Future(rdd.filter(_.id == x.toString).collect)
>   }
>   a.foreach { f =>
> println(Await.ready(f, Duration.Inf).value.get)
>   }
>   rdd.unpersist()
> }
>   }
>   def main(args: Array[String]): Unit = {
>val conf = new SparkConf().setAppName("test")
> val sc   = new SparkContext(conf)
> run(sc)
>   }
> }
> {code}
> build sbt file:
> {code}
> name := "RDDTest"
> version := "0.1.1"
> scalaVersion := "2.11.5"
> libraryDependencies ++= Seq (
> "org.scalaz" %% "scalaz-core" % "7.2.0",
> "org.scalaz" %% "scalaz-concurrent" % "7.2.0",
> "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
> "org.apache.spark" % "spark-hive_2.11" % "2.1.0" % "provided"
>   )
> addCompilerPlugin("org.spire-math" %% "kind-projector" % "0.7.1")
> mainClass in assembly := Some("RDDApp")
> test in assembly := {}
> {code}
> To reproduce it: 
> Just 
> {code}
> spark-2.1.0-bin-hadoop2.7/bin/spark-submit   --driver-memory 4G \
> --executor-memory 4G \
> --executor-cores 1 \
> --num-executors 1 \
> --class "RDDApp" --master local[4] RDDTest-assembly-0.1.1.jar
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25767) Error reported in Spark logs when using the org.apache.spark:spark-sql_2.11:2.3.2 Java library

2018-10-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655366#comment-16655366
 ] 

Marco Gaido commented on SPARK-25767:
-

I tried on current master branch but I wasn't able to reproduce. Could you try 
it on master branch too? Thanks.

> Error reported in Spark logs when using the 
> org.apache.spark:spark-sql_2.11:2.3.2 Java library
> --
>
> Key: SPARK-25767
> URL: https://issues.apache.org/jira/browse/SPARK-25767
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0, 2.3.2
>Reporter: Thomas Brugiere
>Priority: Major
> Attachments: fileA.csv, fileB.csv, fileC.csv
>
>
> Hi,
> Here is a bug I found using the latest version of spark-sql_2.11:2.2.0. Note 
> that this case was also tested with spark-sql_2.11:2.3.2 and the bug is also 
> present.
> This issue is a duplicate of the SPARK-25582 issue that I had to close after 
> an accidental manipulation from another developer (was linked to a wrong PR)
> You will find attached three small sample CSV files with the minimal content 
> to raise the bug.
> Find below a reproducer code:
> {code:java}
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import scala.collection.JavaConverters;
> import scala.collection.Seq;
> import java.util.Arrays;
> public class SparkBug {
> private static  Seq arrayToSeq(T[] input) {
> return 
> JavaConverters.asScalaIteratorConverter(Arrays.asList(input).iterator()).asScala().toSeq();
> }
> public static void main(String[] args) throws Exception {
> SparkConf conf = new 
> SparkConf().setAppName("SparkBug").setMaster("local");
> SparkSession sparkSession = 
> SparkSession.builder().config(conf).getOrCreate();
> Dataset df_a = sparkSession.read().option("header", 
> true).csv("local/fileA.csv").dropDuplicates();
> Dataset df_b = sparkSession.read().option("header", 
> true).csv("local/fileB.csv").dropDuplicates();
> Dataset df_c = sparkSession.read().option("header", 
> true).csv("local/fileC.csv").dropDuplicates();
> String[] key_join_1 = new String[]{"colA", "colB", "colC", "colD", 
> "colE", "colF"};
> String[] key_join_2 = new String[]{"colA", "colB", "colC", "colD", 
> "colE"};
> Dataset df_inventory_1 = df_a.join(df_b, arrayToSeq(key_join_1), 
> "left");
> Dataset df_inventory_2 = df_inventory_1.join(df_c, 
> arrayToSeq(key_join_2), "left");
> df_inventory_2.show();
> }
> }
> {code}
> When running this code, I can see the exception below:
> {code:java}
> 18/10/18 09:25:49 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 202, Column 18: Expression "agg_isNull_28" is not an rvalue
>     at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11821)
>     at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7170)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5332)
>     at org.codehaus.janino.UnitCompiler.access$9400(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$13$1.visitAmbiguousName(UnitCompiler.java:5287)
>     at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4053)
>     at org.codehaus.janino.UnitCompiler$13.visitLvalue(UnitCompiler.java:5284)
>     at org.codehaus.janino.Java$Lvalue.accept(Java.java:3977)
>     at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5280)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2391)
>     at org.codehaus.janino.UnitCompiler.access$1900(UnitCompiler.java:212)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1474)
>     at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1466)
>     at org.codehaus.janino.Java$IfStatement.accept(Java.java:2926)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1466)
>     at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1546)
>     at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3075)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1336)
>     at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1309)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:799)
>     at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:958)
>     at

[jira] [Updated] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21402:

Fix Version/s: (was: 2.4.0)

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-21402:
-
  Assignee: (was: Vladimir Kuriatkov)

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655322#comment-16655322
 ] 

Wenchen Fan commented on SPARK-21402:
-

oh this ticket reports the bug for map type. I'm reopening it. Can you create a 
ticket for the array type?

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655314#comment-16655314
 ] 

Wenchen Fan commented on SPARK-21402:
-

Can you create a ticket for the map case? Basically what we need is to resolve 
the todo at 
https://github.com/apache/spark/blob/v2.4.0-rc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L202

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-18 Thread Vladimir Kuriatkov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655295#comment-16655295
 ] 

Vladimir Kuriatkov commented on SPARK-21402:


I guess, only array case has been fixed yet, the initial map case is still in 
progress.

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Assignee: Vladimir Kuriatkov
>Priority: Major
> Fix For: 2.4.0
>
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-25769:

Description: 
{{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res1: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res2: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 

  was:
 This issue affects dynamic SQL generation that relies on {{sql()}}.
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 


> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> {{UnresolvedAttribute.sql()}} output is incorrectly escaped for nested columns
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res1: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res2: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-25769:

Description: 
 This issue affects dynamic SQL generation that relies on {{sql()}}.
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 

  was:
 

This issue affects dynamic SQL generation that relies on {{sql()}}.

 
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
 

The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.

 

 
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 


> UnresolvedAttribute.sql() incorrectly escapes nested columns
> 
>
> Key: SPARK-25769
> URL: https://issues.apache.org/jira/browse/SPARK-25769
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
>  This issue affects dynamic SQL generation that relies on {{sql()}}.
> {code:java}
> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
> // The correct output is a.b, without backticks, or `a`.`b`.
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
> // res2: String = `a.b`
> // Parsing is correct; the bug is localized to sql() 
> $"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
> // res1: Seq[String] = ArrayBuffer(a, b)
> {code}
> The likely culprit is that the {{sql()}} implementation does not check for 
> {{nameParts}} being non-empty.
> {code:java}
> override def sql: String = name match { 
>   case ParserUtils.escapedIdentifier(_) | 
> ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
>   case _ => quoteIdentifier(name) 
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25769) UnresolvedAttribute.sql() incorrectly escapes nested columns

2018-10-18 Thread Simeon Simeonov (JIRA)

Simeon Simeonov created SPARK-25769:
---

 Summary: UnresolvedAttribute.sql() incorrectly escapes nested 
columns
 Key: SPARK-25769
 URL: https://issues.apache.org/jira/browse/SPARK-25769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Simeon Simeonov


 

This issue affects dynamic SQL generation that relies on {{sql()}}.

 
{code:java}
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute

// The correct output is a.b, without backticks, or `a`.`b`.
$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
// res2: String = `a.b`

// Parsing is correct; the bug is localized to sql() 
$"a.b".expr.asInstanceOf[UnresolvedAttribute].nameParts 
// res1: Seq[String] = ArrayBuffer(a, b)
{code}
 

The likely culprit is that the {{sql()}} implementation does not check for 
{{nameParts}} being non-empty.

 

 
{code:java}
override def sql: String = name match { 
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name 
  case _ => quoteIdentifier(name) 
}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655255#comment-16655255
 ] 

Apache Spark commented on SPARK-25768:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22766

> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25768:


Assignee: (was: Apache Spark)

> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655254#comment-16655254
 ] 

Apache Spark commented on SPARK-25768:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22766

> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25768:


Assignee: Apache Spark

> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Assignee: Apache Spark
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Peter Toth (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-25768:
---
Description: 
 

The following doesn't work since SPARK-18186:
{code:java}
test("constant argument expecting Hive UDAF") {
  val testData = spark.range(10).toDF()
  withTempView("inputTable") {
testData.createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 

  was:
 

The following doesn't work since SPARK-18186:
{code:java}
test("constant argument expecting Hive UDF") {
  val testData = spark.range(10).toDF()
  withTempView("inputTable") {
testData.createOrReplaceTempView("inputTable")
withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
  val numFunc = spark.catalog.listFunctions().count()
  sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
s"${classOf[GenericUDAFPercentileApprox].getName}'")
  checkAnswer(
sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM inputTable"),
Seq(Row(4.0)))
}
  }
}

{code}
 


> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDAF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25768) Constant argument expecting Hive UDAFs doesn't work

2018-10-18 Thread Peter Toth (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-25768:
---
Summary: Constant argument expecting Hive UDAFs doesn't work  (was: 
Constant argument expecting Hive UDFs doesn't work)

> Constant argument expecting Hive UDAFs doesn't work
> ---
>
> Key: SPARK-25768
> URL: https://issues.apache.org/jira/browse/SPARK-25768
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Peter Toth
>Priority: Major
>
>  
> The following doesn't work since SPARK-18186:
> {code:java}
> test("constant argument expecting Hive UDF") {
>   val testData = spark.range(10).toDF()
>   withTempView("inputTable") {
> testData.createOrReplaceTempView("inputTable")
> withUserDefinedFunction("testGenericUDAFPercentileApprox" -> false) {
>   val numFunc = spark.catalog.listFunctions().count()
>   sql(s"CREATE FUNCTION testGenericUDAFPercentileApprox AS '" +
> s"${classOf[GenericUDAFPercentileApprox].getName}'")
>   checkAnswer(
> sql("SELECT testGenericUDAFPercentileApprox(id, 0.5) FROM 
> inputTable"),
> Seq(Row(4.0)))
> }
>   }
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo