[jira] [Created] (SPARK-15448) Flaky test:pyspark.ml.tests.DefaultValuesTests.test_java_params

2016-05-20 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15448:
--

 Summary: Flaky 
test:pyspark.ml.tests.DefaultValuesTests.test_java_params
 Key: SPARK-15448
 URL: https://issues.apache.org/jira/browse/SPARK-15448
 Project: Spark
  Issue Type: Test
Affects Versions: 2.0.0
Reporter: Davies Liu


{code}

==
FAIL [1.284s]: test_java_params (pyspark.ml.tests.DefaultValuesTests)
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
 line 1161, in test_java_params
self.check_params(cls())
  File 
"/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/ml/tests.py",
 line 1136, in check_params
% (p.name, str(py_stage)))
AssertionError: True != False : Default value mismatch of param 
linkPredictionCol for Params GeneralizedLinearRegression_4a78b84aab05b0ed2192

--
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15438) Improve the explain of whole-stage codegen

2016-05-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15438:
--

 Summary: Improve the explain of whole-stage codegen
 Key: SPARK-15438
 URL: https://issues.apache.org/jira/browse/SPARK-15438
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Davies Liu
Assignee: Davies Liu


Currently, the explain of a query with whole-stage codegen looks like this
{code}
>>> df = sqlCtx.range(1000);df2 = 
>>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
>>> 'id').explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [id#1L]
: +- BroadcastHashJoin [id#1L], [id#4L], Inner, BuildRight, None
::- Range 0, 1, 4, 1000, [id#1L]
:+- INPUT
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
   +- WholeStageCodegen
  :  +- Range 0, 1, 4, 1000, [id#4L]
{code}
The problem is that the plan looks much different than logical plan, make us 
hard to understand the plan (especially when the logical plan is not showed 
together).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15432) Two executors with same id in Spark UI

2016-05-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15432:
--

 Summary: Two executors with same id in Spark UI
 Key: SPARK-15432
 URL: https://issues.apache.org/jira/browse/SPARK-15432
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Davies Liu


Both of them are dead.

{code}
56  10.0.245.96:50929   Dead0   0.0 B / 15.3 GB 0.0 B   4   
0   0   0   0   0 ms (0 ms) 0.0 B   0.0 B   0.0 B   
stdout
stderr
56  10.0.245.96:50929   Dead0   0.0 B / 15.3 GB 0.0 B   4   
0   0   0   0   0 ms (0 ms) 0.0 B   0.0 B   0.0 B   
stdout
stderr

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14959:
---
Priority: Blocker  (was: Major)

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Blocker
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:3

[jira] [Updated] (SPARK-14959) ​Problem Reading partitioned ORC or Parquet files

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14959:
---
Target Version/s: 2.0.0

> ​Problem Reading partitioned ORC or Parquet files
> -
>
> Key: SPARK-14959
> URL: https://issues.apache.org/jira/browse/SPARK-14959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Hadoop 2.7.1.2.4.0.0-169 (HDP 2.4)
>Reporter: Sebastian YEPES FERNANDEZ
>Priority: Blocker
>
> Hello,
> I have noticed that in the pasts days there is an issue when trying to read 
> partitioned files from HDFS.
> I am running on Spark master branch #c544356
> The write actually works but the read fails.
> {code:title=Issue Reproduction}
> case class Data(id: Int, text: String)
> val ds = spark.createDataset( Seq(Data(0, "hello"), Data(1, "hello"), Data(0, 
> "world"), Data(1, "there")) )
> scala> 
> ds.write.mode(org.apache.spark.sql.SaveMode.Overwrite).format("parquet").partitionBy("id").save("/user/spark/test.parquet")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".  
>   
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> java.io.FileNotFoundException: Path is not a file: 
> /user/spark/test.parquet/id=0
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>   at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>   at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
>   at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>   at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1285)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:221)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:217)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:228)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:209)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:372)
>   at 
> org.apache.spark.sql.execution.datasources.HDFSFileCatalog$$anonfun$9$$anonfun$apply$4.apply(fileSourceInterfaces.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   a

[jira] [Updated] (SPARK-15396) [Spark] [SQL] [DOC] It can't connect hive metastore database

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15396:
---
Target Version/s: 2.0.0
  Issue Type: Documentation  (was: Bug)
 Summary: [Spark] [SQL] [DOC] It can't connect hive metastore 
database  (was: [Spark] [SQL] It can't connect hive metastore database)

> [Spark] [SQL] [DOC] It can't connect hive metastore database
> 
>
> Key: SPARK-15396
> URL: https://issues.apache.org/jira/browse/SPARK-15396
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yi Zhou
>Priority: Critical
>
> I am try to run Spark SQL using bin/spark-sql with Spark 2.0 master 
> code(commit ba181c0c7a32b0e81bbcdbe5eed94fc97b58c83e) but ran across an issue 
> that it always connect local derby database and can't connect my existing 
> hive metastore database. Could you help me to check what's the root cause ? 
> What's specific configuration for integration with hive metastore in Spark 
> 2.0 ? BTW, this case is OK in Spark 1.6. Thanks in advance !
> Build package command:
> ./dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0-cdh5.5.1 -Phive -Phive-thriftserver -DskipTests
> Key configurations in spark-defaults.conf:
> {code}
> spark.sql.hive.metastore.version=1.1.0
> spark.sql.hive.metastore.jars=/usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
> spark.executor.extraClassPath=/etc/hive/conf
> spark.driver.extraClassPath=/etc/hive/conf
> spark.yarn.jars=local:/usr/lib/spark/jars/*
> {code}
> There is existing hive metastore database named by "test_sparksql". I always 
> got error "metastore.ObjectStore: Failed to get database test_sparksql, 
> returning NoSuchObjectException" after issuing 'use test_sparksql'. Please 
> see below steps for details.
>  
> $ /usr/lib/spark/bin/spark-sql --master yarn --deploy-mode client
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/05/12 22:23:28 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:30 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 16/05/12 22:23:30 INFO metastore.ObjectStore: ObjectStore, initialize called
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.store.rdbms" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-rdbms-3.2.9.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" 
> is already registered. Ensure you dont have multiple JAR versions of the same 
> plugin in the classpath. The URL 
> "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already registered, 
> and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/spark/jars/datanucleus-core-3.2.10.jar."
> 16/05/12 22:23:30 WARN DataNucleus.General: Plugin (Bundle) 
> "org.datanucleus.api.jdo" is already registered. Ensure you dont have 
> multiple JAR versions of the same plugin in the classpath. The URL 
> "file:/usr/lib/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already 
> registered, and you are trying to register an identical plugin located at URL 
> "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> datanucleus.cache.level2 unknown - will be ignored
> 16/05/12 22:23:30 INFO DataNucleus.Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 16/05/12 22:23:31 WARN conf.HiveConf: HiveConf of name 
> hive.enable.spark.execution.engine does not exist
> 16/05/12 22:23:31 INFO metastore.ObjectStore: Setting MetaStore object pin 
> classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 16/05/12 22:23:32 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table

[jira] [Updated] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15393:
---
Priority: Critical  (was: Major)

> Writing empty Dataframes doesn't save any _metadata files
> -
>
> Key: SPARK-15393
> URL: https://issues.apache.org/jira/browse/SPARK-15393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It only saves an _SUCCESS file (which is also incorrect behaviour, because it 
> raised an exception).
> This means that loading it again will result in the following error:
> {code}
> Unable to infer schema for ParquetFormat at /some/test/file. It must be 
> specified manually;'
> {code}
> It looks like this problem was introduced in 
> https://github.com/apache/spark/pull/12855 (SPARK-10216).
> After reverting those changes I could save the empty dataframe as parquet and 
> load it again without Spark complaining or throwing any exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-15332) OutOfMemory in TimSort

2016-05-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292143#comment-15292143
 ] 

Davies Liu commented on SPARK-15332:


 It only happen in some corner cases, could not reproduce this easily, may not 
target for 2.0.

> OutOfMemory in TimSort 
> ---
>
> Key: SPARK-15332
> URL: https://issues.apache.org/jira/browse/SPARK-15332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
>
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o154.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 230.0 failed 1 times, most recent failure: Lost task 1.0 in stage 
> 230.0 (TID 1881, localhost): java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:88)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:285)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:199)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:363)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:378)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:736)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:611)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.(SortMergeJoinExec.scala:206)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:204)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:100)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14343:
---
Priority: Critical  (was: Major)

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-05-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292139#comment-15292139
 ] 

Davies Liu commented on SPARK-14343:


[~jurriaanpruis] Since you fixed the SPARK-14463, could you double check this 
one? 

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15415) Marking partitions for broadcast broken

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-15415.
--
Resolution: Won't Fix
  Assignee: Davies Liu

> Marking partitions for broadcast broken
> ---
>
> Key: SPARK-15415
> URL: https://issues.apache.org/jira/browse/SPARK-15415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
>
> I couldn't get the broadcast(DataFrame) sql function to work in Spark 2.0.
> It does work in Spark 1.6.1:
> {code}
> $ pyspark --conf spark.sql.autoBroadcastJoinThreshold=0
> >>> df = sqlCtx.range(1000);df2 = 
> >>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
> >>> 'id').explain()
> == Physical Plan ==
> Project [id#0L]
> +- BroadcastHashJoin [id#0L], [id#1L], BuildRight
>:- ConvertToUnsafe
>:  +- Scan ExistingRDD[id#0L]
>+- ConvertToUnsafe
>   +- Scan ExistingRDD[id#1L]
> {code}
> While in Spark 2.0 this results in:
> {code}
> >>> df = sqlCtx.range(1000);df2 = 
> >>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
> >>> 'id').explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#6L]
> : +- SortMergeJoin [id#6L], [id#9L], Inner, None
> ::- INPUT
> :+- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [id#6L ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(id#6L, 200), None
> : +- WholeStageCodegen
> ::  +- Range 0, 1, 8, 1000, [id#6L]
> +- WholeStageCodegen
>:  +- Sort [id#9L ASC], false, 0
>: +- INPUT
>+- ReusedExchange [id#9L], Exchange hashpartitioning(id#6L, 200), None
> {code}
>  
> While it should look like (output when you remove the 
> spark.sql.autoBroadcastJoinThreshold conf):
> {code}
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0L]
> : +- BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
> ::- Range 0, 1, 8, 1000, [id#0L]
> :+- INPUT
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>+- WholeStageCodegen
>   :  +- Range 0, 1, 8, 1000, [id#3L]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15415) Marking partitions for broadcast broken

2016-05-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292134#comment-15292134
 ] 

Davies Liu commented on SPARK-15415:


[~jurriaanpruis] The implementation of broadcast() had been changed in 2.0, so 
it will not work with spark.sql.autoBroadcastJoinThreshold=0,
spark.sql.autoBroadcastJoinThreshold should be at least 1 to use broadcast(), 
otherwise no broadcast join will be used.

> Marking partitions for broadcast broken
> ---
>
> Key: SPARK-15415
> URL: https://issues.apache.org/jira/browse/SPARK-15415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>
> I couldn't get the broadcast(DataFrame) sql function to work in Spark 2.0.
> It does work in Spark 1.6.1:
> {code}
> $ pyspark --conf spark.sql.autoBroadcastJoinThreshold=0
> >>> df = sqlCtx.range(1000);df2 = 
> >>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
> >>> 'id').explain()
> == Physical Plan ==
> Project [id#0L]
> +- BroadcastHashJoin [id#0L], [id#1L], BuildRight
>:- ConvertToUnsafe
>:  +- Scan ExistingRDD[id#0L]
>+- ConvertToUnsafe
>   +- Scan ExistingRDD[id#1L]
> {code}
> While in Spark 2.0 this results in:
> {code}
> >>> df = sqlCtx.range(1000);df2 = 
> >>> sqlCtx.range(1000);df.join(pyspark.sql.functions.broadcast(df2), 
> >>> 'id').explain()
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#6L]
> : +- SortMergeJoin [id#6L], [id#9L], Inner, None
> ::- INPUT
> :+- INPUT
> :- WholeStageCodegen
> :  :  +- Sort [id#6L ASC], false, 0
> :  : +- INPUT
> :  +- Exchange hashpartitioning(id#6L, 200), None
> : +- WholeStageCodegen
> ::  +- Range 0, 1, 8, 1000, [id#6L]
> +- WholeStageCodegen
>:  +- Sort [id#9L ASC], false, 0
>: +- INPUT
>+- ReusedExchange [id#9L], Exchange hashpartitioning(id#6L, 200), None
> {code}
>  
> While it should look like (output when you remove the 
> spark.sql.autoBroadcastJoinThreshold conf):
> {code}
> == Physical Plan ==
> WholeStageCodegen
> :  +- Project [id#0L]
> : +- BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight, None
> ::- Range 0, 1, 8, 1000, [id#0L]
> :+- INPUT
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>+- WholeStageCodegen
>   :  +- Range 0, 1, 8, 1000, [id#3L]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13513) add some tests for leap year handling in catalyst

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13513:
---
Affects Version/s: (was: 2.0.0)

> add some tests for leap year handling in catalyst
> -
>
> Key: SPARK-13513
> URL: https://issues.apache.org/jira/browse/SPARK-13513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Steve Loughran
>
> Add some basic tests  for SQLDate and SQLCatalyst to verify that feb 29 works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13513) add some tests for leap year handling in catalyst

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13513:
---
Priority: Minor  (was: Major)

> add some tests for leap year handling in catalyst
> -
>
> Key: SPARK-13513
> URL: https://issues.apache.org/jira/browse/SPARK-13513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Steve Loughran
>Priority: Minor
>
> Add some basic tests  for SQLDate and SQLCatalyst to verify that feb 29 works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15390:
---
Assignee: Davies Liu

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withN

[jira] [Resolved] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-05-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15390.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13182
[https://github.com/apache/spark/pull/13182]

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
> Fix For: 2.0.0
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExec

[jira] [Resolved] (SPARK-15381) physical object operator should define `reference` correctly

2016-05-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15381.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13167
[https://github.com/apache/spark/pull/13167]

> physical object operator should define `reference` correctly
> 
>
> Key: SPARK-15381
> URL: https://issues.apache.org/jira/browse/SPARK-15381
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> A test case is given in https://issues.apache.org/jira/browse/SPARK-15384.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15392) The default value of size estimation is not good

2016-05-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15392:
--

Assignee: Davies Liu

> The default value of size estimation is not good
> 
>
> Key: SPARK-15392
> URL: https://issues.apache.org/jira/browse/SPARK-15392
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> We use  autoBroadcastJoinThreshold + 1L as the default value of size 
> estimation, that is not good in 2.0, because we will calculate the size based 
> on size of schema, then the estimation could be less than 
> autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame 
> created from RDD.
> We should use an even bigger default value, for example, MaxLong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15392) The default value of size estimation is not good

2016-05-18 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15392:
--

 Summary: The default value of size estimation is not good
 Key: SPARK-15392
 URL: https://issues.apache.org/jira/browse/SPARK-15392
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Davies Liu


We use  autoBroadcastJoinThreshold + 1L as the default value of size 
estimation, that is not good in 2.0, because we will calculate the size based 
on size of schema, then the estimation could be less than 
autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created 
from RDD.

We should use an even bigger default value, for example, MaxLong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15342) PySpark test for non ascii column name does not actually test with unicode column name

2016-05-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15342.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13134
[https://github.com/apache/spark/pull/13134]

> PySpark test for non ascii column name does not actually test with unicode 
> column name
> --
>
> Key: SPARK-15342
> URL: https://issues.apache.org/jira/browse/SPARK-15342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>
> The PySpark SQL test_column_name_with_non_ascii wants to test non-ascii 
> (i.e., utf-8) column name. But it doesn't actually test it. We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15342) PySpark test for non ascii column name does not actually test with unicode column name

2016-05-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15342:
---
Assignee: Liang-Chi Hsieh

> PySpark test for non ascii column name does not actually test with unicode 
> column name
> --
>
> Key: SPARK-15342
> URL: https://issues.apache.org/jira/browse/SPARK-15342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.0.0
>
>
> The PySpark SQL test_column_name_with_non_ascii wants to test non-ascii 
> (i.e., utf-8) column name. But it doesn't actually test it. We should fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-18 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15357.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13151
[https://github.com/apache/spark/pull/13151]

> Cooperative spilling should check consumer memory mode
> --
>
> Key: SPARK-15357
> URL: https://issues.apache.org/jira/browse/SPARK-15357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> In TaskMemoryManager.java:
> {code}
> for (MemoryConsumer c: consumers) {
>   if (c != consumer && c.getUsed() > 0) {
> try {
>   long released = c.spill(required - got, consumer);
> if (released > 0 && mode == tungstenMemoryMode) {
>   got += memoryManager.acquireExecutionMemory(required - got, 
> taskAttemptId, mode);
>   if (got >= required) {
> break;
>   }
> }
>   } catch(...) { ... }
> }
>   }
> }
> {code}
> Currently, when non-tungsten consumers acquire execution memory, they may 
> force other tungsten consumers to spill and then NOT use the freed memory. A 
> better way to do this is to incorporate the memory mode in the consumer 
> itself and spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15244) Type of column name created with sqlContext.createDataFrame() is not consistent.

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15244.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13097
[https://github.com/apache/spark/pull/13097]

> Type of column name created with sqlContext.createDataFrame() is not 
> consistent.
> 
>
> Key: SPARK-15244
> URL: https://issues.apache.org/jira/browse/SPARK-15244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: CentOS 7, Spark 1.6.0
>Reporter: Kazuki Yokoishi
>Priority: Minor
> Fix For: 2.0.0
>
>
> StructField() converts field name to str in __init__.
> But, when list of str/unicode is passed to sqlContext.createDataFrame() as a 
> schema, the type of StructField.name is not converted.
> To reproduce:
> {noformat}
> >>> schema = StructType([StructField(u"col", StringType())])
> >>> df1 = sqlContext.createDataFrame([("a",)], schema)
> >>> df1.columns # "col" is str
> ['col']
> >>> df2 = sqlContext.createDataFrame([("a",)], [u"col"])
> >>> df2.columns # "col" is unicode
> [u'col']
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15244) Type of column name created with sqlContext.createDataFrame() is not consistent.

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15244:
---
Assignee: Dongjoon Hyun

> Type of column name created with sqlContext.createDataFrame() is not 
> consistent.
> 
>
> Key: SPARK-15244
> URL: https://issues.apache.org/jira/browse/SPARK-15244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: CentOS 7, Spark 1.6.0
>Reporter: Kazuki Yokoishi
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> StructField() converts field name to str in __init__.
> But, when list of str/unicode is passed to sqlContext.createDataFrame() as a 
> schema, the type of StructField.name is not converted.
> To reproduce:
> {noformat}
> >>> schema = StructType([StructField(u"col", StringType())])
> >>> df1 = sqlContext.createDataFrame([("a",)], schema)
> >>> df1.columns # "col" is str
> ['col']
> >>> df2 = sqlContext.createDataFrame([("a",)], [u"col"])
> >>> df2.columns # "col" is unicode
> [u'col']
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15357:
--

Assignee: Davies Liu

> Cooperative spilling should check consumer memory mode
> --
>
> Key: SPARK-15357
> URL: https://issues.apache.org/jira/browse/SPARK-15357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Davies Liu
>
> In TaskMemoryManager.java:
> {code}
> for (MemoryConsumer c: consumers) {
>   if (c != consumer && c.getUsed() > 0) {
> try {
>   long released = c.spill(required - got, consumer);
> if (released > 0 && mode == tungstenMemoryMode) {
>   got += memoryManager.acquireExecutionMemory(required - got, 
> taskAttemptId, mode);
>   if (got >= required) {
> break;
>   }
> }
>   } catch(...) { ... }
> }
>   }
> }
> {code}
> Currently, when non-tungsten consumers acquire execution memory, they may 
> force other tungsten consumers to spill and then NOT use the freed memory. A 
> better way to do this is to incorporate the memory mode in the consumer 
> itself and spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15165:
---
Fix Version/s: 2.0.0

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Priority: Blocker
> Fix For: 2.0.0
>
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code causes compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15165.

Resolution: Fixed

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 2.0.0
>
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code causes compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15165:
---
Assignee: Kousuke Saruta

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 2.0.0
>
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code causes compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15165) Codegen can break because toCommentSafeString is not actually safe

2016-05-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15165:
---
Affects Version/s: 1.5.2
   1.6.1

> Codegen can break because toCommentSafeString is not actually safe
> --
>
> Key: SPARK-15165
> URL: https://issues.apache.org/jira/browse/SPARK-15165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Kousuke Saruta
>Priority: Blocker
>
> toCommentSafeString method replaces "\u" with "\ \u" to avoid codegen 
> breaking.
> But if the even number of "\" is put before "u", like "\ \u", in the string 
> literal in the query, codegen can break.
> Following code causes compilation error.
> {code}
> val df = Seq(...).toDF
> df.select("'u002A/'").show
> {code}
> The reason of the compilation error is because "u002A/" is translated 
> into "*/" (the end of comment). 
> Due to this unsafety, arbitrary code can be injected like as follows.
> {code}
> val df = Seq(...).toDF
> // Inject "System.exit(1)"
> df.select("'u002A/{System.exit(1);}/*'").show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15332) OutOfMemory in TimSort

2016-05-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15332:
--

 Summary: OutOfMemory in TimSort 
 Key: SPARK-15332
 URL: https://issues.apache.org/jira/browse/SPARK-15332
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 2.0.0
Reporter: Davies Liu


{code}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o154.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 230.0 failed 1 times, most recent failure: Lost task 1.0 in stage 230.0 
(TID 1881, localhost): java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:88)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32)
at 
org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
at 
org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
at 
org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:285)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:199)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:363)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:378)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:92)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:736)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:611)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.(SortMergeJoinExec.scala:206)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:204)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1.apply(SortMergeJoinExec.scala:100)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13866) Handle decimal type in CSV inference

2016-05-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13866:
---
Assignee: Hyukjin Kwon

> Handle decimal type in CSV inference
> 
>
> Key: SPARK-13866
> URL: https://issues.apache.org/jira/browse/SPARK-13866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> CSV Schema Inference does not check for decimal type. As a result large 
> integers are cast as double. This is not desired outcome for many ID columns 
> in real data scenarios. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13866) Handle decimal type in CSV inference

2016-05-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13866.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11724
[https://github.com/apache/spark/pull/11724]

> Handle decimal type in CSV inference
> 
>
> Key: SPARK-13866
> URL: https://issues.apache.org/jira/browse/SPARK-13866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> CSV Schema Inference does not check for decimal type. As a result large 
> integers are cast as double. This is not desired outcome for many ID columns 
> in real data scenarios. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15307) Super slow to load a partitioned table from local disks

2016-05-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15307:
--

 Summary: Super slow to load a partitioned table from local disks
 Key: SPARK-15307
 URL: https://issues.apache.org/jira/browse/SPARK-15307
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


It took 22 seconds to load an partitioned table with 3000 files on local disks, 
because DeprecatedRawLocalFileStatus in RawLocalFileSystem use subprocess to 
load permission information.

Right now, we did use these permission information, we could use another 
constructor of LocatedFileStatus to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15307) Super slow to load a partitioned table from local disks

2016-05-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282270#comment-15282270
 ] 

Davies Liu commented on SPARK-15307:


cc [~liancheng]

> Super slow to load a partitioned table from local disks
> ---
>
> Key: SPARK-15307
> URL: https://issues.apache.org/jira/browse/SPARK-15307
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> It took 22 seconds to load an partitioned table with 3000 files on local 
> disks, because DeprecatedRawLocalFileStatus in RawLocalFileSystem use 
> subprocess to load permission information.
> Right now, we did use these permission information, we could use another 
> constructor of LocatedFileStatus to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15287) Spark SQL partition filter clause with different literal type will scan all hive partitions

2016-05-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-15287.
--
Resolution: Won't Fix

> Spark SQL partition filter clause with different literal type will scan all 
> hive partitions
> ---
>
> Key: SPARK-15287
> URL: https://issues.apache.org/jira/browse/SPARK-15287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Tao Li
>
> I have a hive table with a partition of string type. When the query with 
> filter of string literal, it will call getPartitionsByFilter. But when the 
> query with filter of integer literal, it will call getAllPartitions, which 
> will cause a heavy query in hive metastore when the hive table with too many 
> partitions.
> I find that, spark sql will cast the integer literal to double on logical 
> plan (I don't know why spark sql will do this cast), spark sql will dynamic 
> deceide to call getAllPartitions() or getPartitionsByFilter(). It seems that 
> the method convertFilters() can't cover the double case.
> {code:java}
> def convertFilters(table: Table, filters: Seq[Expression]): String = {
> // hive varchar is treated as catalyst string, but hive varchar can't be 
> pushed down.
> val varcharKeys = table.getPartitionKeys.asScala
>   .filter(col => col.getType.startsWith(serdeConstants.VARCHAR_TYPE_NAME) 
> ||
> col.getType.startsWith(serdeConstants.CHAR_TYPE_NAME))
>   .map(col => col.getName).toSet
> filters.collect {
>   case op @ BinaryComparison(a: Attribute, Literal(v, _: IntegralType)) =>
> s"${a.name} ${op.symbol} $v"
>   case op @ BinaryComparison(Literal(v, _: IntegralType), a: Attribute) =>
> s"$v ${op.symbol} ${a.name}"
>   case op @ BinaryComparison(a: Attribute, Literal(v, _: StringType))
>   if !varcharKeys.contains(a.name) =>
> s"""${a.name} ${op.symbol} "$v
>   case op @ BinaryComparison(Literal(v, _: StringType), a: Attribute)
>   if !varcharKeys.contains(a.name) =>
> s$v" ${op.symbol} ${a.name}"""
> }.mkString(" and ")
>   }
>   override def getPartitionsByFilter(
>   hive: Hive,
>   table: Table,
>   predicates: Seq[Expression]): Seq[Partition] = {
> // Hive getPartitionsByFilter() takes a string that represents partition
> // predicates like "str_key=\"value\" and int_key=1 ..."
> val filter = convertFilters(table, predicates)
> val partitions =
>   if (filter.isEmpty) {
> getAllPartitionsMethod.invoke(hive, 
> table).asInstanceOf[JSet[Partition]]
>   } else {
> logDebug(s"Hive metastore filter is '$filter'.")
> getPartitionsByFilterMethod.invoke(hive, table, 
> filter).asInstanceOf[JArrayList[Partition]]
>   }
> partitions.asScala.toSeq
>   }
> {code}
> The query plan with the string literal filter, and logdate is the parition of 
> string type:
> {noformat}
> == Parsed Logical Plan ==
> Limit 21
> +- Limit 20
>+- Project 
> [ip#1,field1#2,field2#3,field3#4,manualtime#5,timezone#6,pbstr#7,retcode#8,pagesize#9,refer#10,useragent#11,field4#12,responsetime#13,usid#14,style#15,unid#16,pl#17,usid_src#18,resinip#19,upstreamtime#20,uuid#21,qua#22,q_ext#23,line#24,logdate#0]
>   +- Filter (logdate#0 = 20160505)
>  +- MetastoreRelation default, wap_nginx_5min, None
> == Analyzed Logical Plan ==
> ip: string, field1: string, field2: string, field3: string, manualtime: 
> string, timezone: string, pbstr: map, retcode: string, 
> pagesize: string, refer: string, useragent: string, field4: string, 
> responsetime: string, usid: string, style: string, unid: string, pl: string, 
> usid_src: string, resinip: string, upstreamtime: string, uuid: string, qua: 
> string, q_ext: string, line: string, logdate: string
> Limit 21
> +- Limit 20
>+- Project 
> [ip#1,field1#2,field2#3,field3#4,manualtime#5,timezone#6,pbstr#7,retcode#8,pagesize#9,refer#10,useragent#11,field4#12,responsetime#13,usid#14,style#15,unid#16,pl#17,usid_src#18,resinip#19,upstreamtime#20,uuid#21,qua#22,q_ext#23,line#24,logdate#0]
>   +- Filter (logdate#0 = 20160505)
>  +- MetastoreRelation default, wap_nginx_5min, None
> == Optimized Logical Plan ==
> Limit 20
> +- Filter (logdate#0 = 20160505)
>+- MetastoreRelation default, wap_nginx_5min, None
> == Physical Plan ==
> Limit 20
> +- HiveTableScan 
> [ip#1,field1#2,field2#3,field3#4,manualtime#5,timezone#6,pbstr#7,retcode#8,pagesize#9,refer#10,useragent#11,field4#12,responsetime#13,usid#14,style#15,unid#16,pl#17,usid_src#18,resinip#19,upstreamtime#20,uuid#21,qua#22,q_ext#23,line#24,logdate#0],
>  MetastoreRelation default, wap_nginx_5min, None, [(logdate#0 = 20160505)]
> {noformat}
> The query plan with 

[jira] [Commented] (SPARK-15287) Spark SQL partition filter clause with different literal type will scan all hive partitions

2016-05-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282247#comment-15282247
 ] 

Davies Liu commented on SPARK-15287:


By design, we can't compare two expression with different types (IntegerType 
and StringType), so both of them are converted to DoubleType. But HiveMetaStore 
does not support these complicated predicates on partition column.

So, I think we can't fix this easily. You should use string literals instead of 
integer literals, or use IntegerType for partition column.

> Spark SQL partition filter clause with different literal type will scan all 
> hive partitions
> ---
>
> Key: SPARK-15287
> URL: https://issues.apache.org/jira/browse/SPARK-15287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Tao Li
>
> I have a hive table with a partition of string type. When the query with 
> filter of string literal, it will call getPartitionsByFilter. But when the 
> query with filter of integer literal, it will call getAllPartitions, which 
> will cause a heavy query in hive metastore when the hive table with too many 
> partitions.
> I find that, spark sql will cast the integer literal to double on logical 
> plan (I don't know why spark sql will do this cast), spark sql will dynamic 
> deceide to call getAllPartitions() or getPartitionsByFilter(). It seems that 
> the method convertFilters() can't cover the double case.
> {code:java}
> def convertFilters(table: Table, filters: Seq[Expression]): String = {
> // hive varchar is treated as catalyst string, but hive varchar can't be 
> pushed down.
> val varcharKeys = table.getPartitionKeys.asScala
>   .filter(col => col.getType.startsWith(serdeConstants.VARCHAR_TYPE_NAME) 
> ||
> col.getType.startsWith(serdeConstants.CHAR_TYPE_NAME))
>   .map(col => col.getName).toSet
> filters.collect {
>   case op @ BinaryComparison(a: Attribute, Literal(v, _: IntegralType)) =>
> s"${a.name} ${op.symbol} $v"
>   case op @ BinaryComparison(Literal(v, _: IntegralType), a: Attribute) =>
> s"$v ${op.symbol} ${a.name}"
>   case op @ BinaryComparison(a: Attribute, Literal(v, _: StringType))
>   if !varcharKeys.contains(a.name) =>
> s"""${a.name} ${op.symbol} "$v
>   case op @ BinaryComparison(Literal(v, _: StringType), a: Attribute)
>   if !varcharKeys.contains(a.name) =>
> s$v" ${op.symbol} ${a.name}"""
> }.mkString(" and ")
>   }
>   override def getPartitionsByFilter(
>   hive: Hive,
>   table: Table,
>   predicates: Seq[Expression]): Seq[Partition] = {
> // Hive getPartitionsByFilter() takes a string that represents partition
> // predicates like "str_key=\"value\" and int_key=1 ..."
> val filter = convertFilters(table, predicates)
> val partitions =
>   if (filter.isEmpty) {
> getAllPartitionsMethod.invoke(hive, 
> table).asInstanceOf[JSet[Partition]]
>   } else {
> logDebug(s"Hive metastore filter is '$filter'.")
> getPartitionsByFilterMethod.invoke(hive, table, 
> filter).asInstanceOf[JArrayList[Partition]]
>   }
> partitions.asScala.toSeq
>   }
> {code}
> The query plan with the string literal filter, and logdate is the parition of 
> string type:
> {noformat}
> == Parsed Logical Plan ==
> Limit 21
> +- Limit 20
>+- Project 
> [ip#1,field1#2,field2#3,field3#4,manualtime#5,timezone#6,pbstr#7,retcode#8,pagesize#9,refer#10,useragent#11,field4#12,responsetime#13,usid#14,style#15,unid#16,pl#17,usid_src#18,resinip#19,upstreamtime#20,uuid#21,qua#22,q_ext#23,line#24,logdate#0]
>   +- Filter (logdate#0 = 20160505)
>  +- MetastoreRelation default, wap_nginx_5min, None
> == Analyzed Logical Plan ==
> ip: string, field1: string, field2: string, field3: string, manualtime: 
> string, timezone: string, pbstr: map, retcode: string, 
> pagesize: string, refer: string, useragent: string, field4: string, 
> responsetime: string, usid: string, style: string, unid: string, pl: string, 
> usid_src: string, resinip: string, upstreamtime: string, uuid: string, qua: 
> string, q_ext: string, line: string, logdate: string
> Limit 21
> +- Limit 20
>+- Project 
> [ip#1,field1#2,field2#3,field3#4,manualtime#5,timezone#6,pbstr#7,retcode#8,pagesize#9,refer#10,useragent#11,field4#12,responsetime#13,usid#14,style#15,unid#16,pl#17,usid_src#18,resinip#19,upstreamtime#20,uuid#21,qua#22,q_ext#23,line#24,logdate#0]
>   +- Filter (logdate#0 = 20160505)
>  +- MetastoreRelation default, wap_nginx_5min, None
> == Optimized Logical Plan ==
> Limit 20
> +- Filter (logdate#0 = 20160505)
>+- MetastoreRelation default, wap_nginx_5min, None
> == Physical Plan

[jira] [Updated] (SPARK-15300) Can't remove a block if it's under evicting

2016-05-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15300:
---
Summary: Can't remove a block if it's under evicting  (was: Can't remove a 
block if it's under eviting)

> Can't remove a block if it's under evicting
> ---
>
> Key: SPARK-15300
> URL: https://issues.apache.org/jira/browse/SPARK-15300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned shuffle 94
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433121
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433122
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433123
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_629_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
> 16/04/15 12:17:05 ERROR BlockManagerSlaveEndpoint: Error in removing block 
> broadcast_631_piece0
> java.lang.IllegalStateException: Task -1024 has already locked 
> broadcast_631_piece0 for writing
>   at 
> org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
>   at 
> org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1286)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:47)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
>   at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_626_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433124
> 16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_627_piece0 on 
> 10.0.164.43:39651 in memory (size: 23.3 KB, free: 15.8 GB)
> 16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433125
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15300) Can't remove a block if it's under eviting

2016-05-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15300:
--

 Summary: Can't remove a block if it's under eviting
 Key: SPARK-15300
 URL: https://issues.apache.org/jira/browse/SPARK-15300
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Davies Liu
Assignee: Davies Liu


{code}
16/04/15 12:17:05 INFO ContextCleaner: Cleaned shuffle 94
16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433121
16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433122
16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433123
16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_629_piece0 on 
10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
16/04/15 12:17:05 ERROR BlockManagerSlaveEndpoint: Error in removing block 
broadcast_631_piece0
java.lang.IllegalStateException: Task -1024 has already locked 
broadcast_631_piece0 for writing
at 
org.apache.spark.storage.BlockInfoManager.lockForWriting(BlockInfoManager.scala:232)
at 
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1286)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:47)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$1.apply(BlockManagerSlaveEndpoint.scala:46)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:82)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_626_piece0 on 
10.0.164.43:39651 in memory (size: 23.4 KB, free: 15.8 GB)
16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433124
16/04/15 12:17:05 INFO BlockManagerInfo: Removed broadcast_627_piece0 on 
10.0.164.43:39651 in memory (size: 23.3 KB, free: 15.8 GB)
16/04/15 12:17:05 INFO ContextCleaner: Cleaned accumulator 1433125
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13522:
---
Fix Version/s: (was: 1.6.2)

> Executor should kill itself when it's unable to heartbeat to the driver more 
> than N times
> -
>
> Key: SPARK-13522
> URL: https://issues.apache.org/jira/browse/SPARK-13522
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Sometimes, network disconnection event won't be triggered for other potential 
> race conditions that we may not have thought of, then the executor will keep 
> sending heartbeats to driver and won't exit.
> We should make Executor kill itself when it's unable to heartbeat to the 
> driver more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15260) UnifiedMemoryManager could be in bad state if any exception happen while evicting blocks

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15260:
---
Fix Version/s: 1.6.2

> UnifiedMemoryManager could be in bad state if any exception happen while 
> evicting blocks
> 
>
> Key: SPARK-15260
> URL: https://issues.apache.org/jira/browse/SPARK-15260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Andrew Or
> Fix For: 1.6.2, 2.0.0
>
>
> {code}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 62 in stage 19.0 failed 4 times, most
> recent failure: Lost task 62.3 in stage 19.0 (TID 2841, 
> ip-10-109-240-229.ec2.internal): java.io.IOException:
> java.lang.AssertionError: assertion failed at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1223) at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock( 
> TorrentBroadcast.scala:165) at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute( 
> TorrentBroadcast.scala:64) at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast. 
> scala:64) at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast. 
> scala:88) at
> org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 71) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 46) at
> org.apache.spark.scheduler.Task.run(Task.scala:96) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. 
> java:1142) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. 
> java:617) at
> java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AssertionError: 
> assertion failed at
> scala.Predef$.assert(Predef.scala:165) at 
> org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(
> UnifiedMemoryManager.scala:140) at 
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:387) at
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346) at
> org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:99) at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:803) at
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:690) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
> 1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
> TorrentBroadcast.scala:130) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
> 1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
> TorrentBroadcast.scala:127) at
> scala.Option.map(Option.scala:145) at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.org$apache$spark$broadcast$
> TorrentBroadcast$$anonfun$$getRemote$1(TorrentBroadcast.scala:127) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
> TorrentBroadcast.scala:137) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
> TorrentBroadcast.scala:137) at
> scala.Option.orElse(Option.scala:257) at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast. 
> scala:137) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
> scala.collection.immutable.List.foreach(List.scala:318) at
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$
> TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$ 
> 1.apply(TorrentBroadcast.scala:175) at
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1220) ... 12 more 
> Driver stacktrace: at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$
> DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
> DAGScheduler.scala:1419) at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
> DAGScheduler.scala:1418) at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. 
> scala:59) at
> scala.collection.mutable.ArrayBuffer.foreach(

[jira] [Updated] (SPARK-13522) Executor should kill itself when it's unable to heartbeat to the driver more than N times

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13522:
---
Fix Version/s: 1.6.2

> Executor should kill itself when it's unable to heartbeat to the driver more 
> than N times
> -
>
> Key: SPARK-13522
> URL: https://issues.apache.org/jira/browse/SPARK-13522
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Sometimes, network disconnection event won't be triggered for other potential 
> race conditions that we may not have thought of, then the executor will keep 
> sending heartbeats to driver and won't exit.
> We should make Executor kill itself when it's unable to heartbeat to the 
> driver more than N times



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15256:
---
Assignee: Nicholas Chammas

> Clarify the docstring for DataFrameReader.jdbc()
> 
>
> Key: SPARK-15256
> URL: https://issues.apache.org/jira/browse/SPARK-15256
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 2.0.0
>
>
> The doc for the {{properties}} parameter [currently 
> reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:
> {quote}
> :param properties: JDBC database connection arguments, a list of 
> arbitrary string
>tag/value. Normally at least a "user" and 
> "password" property
>should be included.
> {quote}
> This is incorrect, since {{properties}} is expected to be a dictionary.
> Some of the other parameters have cryptic descriptions. I'll try to clarify 
> those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15256) Clarify the docstring for DataFrameReader.jdbc()

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15256.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13034
[https://github.com/apache/spark/pull/13034]

> Clarify the docstring for DataFrameReader.jdbc()
> 
>
> Key: SPARK-15256
> URL: https://issues.apache.org/jira/browse/SPARK-15256
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 2.0.0
>
>
> The doc for the {{properties}} parameter [currently 
> reads|https://github.com/apache/spark/blob/d37c7f7f042f7943b5b684e53cf4284c601fb347/python/pyspark/sql/readwriter.py#L437-L439]:
> {quote}
> :param properties: JDBC database connection arguments, a list of 
> arbitrary string
>tag/value. Normally at least a "user" and 
> "password" property
>should be included.
> {quote}
> This is incorrect, since {{properties}} is expected to be a dictionary.
> Some of the other parameters have cryptic descriptions. I'll try to clarify 
> those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15278) Remove experimental tag from Python DataFrame

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15278.

Resolution: Fixed

Issue resolved by pull request 13062
[https://github.com/apache/spark/pull/13062]

> Remove experimental tag from Python DataFrame
> -
>
> Key: SPARK-15278
> URL: https://issues.apache.org/jira/browse/SPARK-15278
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Earlier we removed experimental tag for Scala/Java DataFrames, but haven't 
> done so for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15270) Creating HiveContext does not work

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15270.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13056
[https://github.com/apache/spark/pull/13056]

> Creating HiveContext does not work
> --
>
> Key: SPARK-15270
> URL: https://issues.apache.org/jira/browse/SPARK-15270
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Piotr Milanowski
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Built spark (commit c6d23b6604e85bcddbd1fb6a2c1c3edbfd2be2c1, branch-2.0)  
> with command:
> /dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> Launched master and slave, launched ./bin/pyspark
> Creating hive context fails:
> {code}
> from pyspark.sql import HiveContext
> hc = HiveContext(sc)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "spark-2.0/python/pyspark/sql/context.py", line 458, in __init__
> sparkSession = SparkSession.withHiveSupport(sparkContext)
>   File "spark-2.0/python/pyspark/sql/session.py", line 192, in withHiveSupport
> jsparkSession = 
> sparkContext._jvm.SparkSession.withHiveSupport(sparkContext._jsc.sc())
>   File "spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 
> 1048, in __getattr__
> py4j.protocol.Py4JError: org.apache.spark.sql.SparkSession.withHiveSupport 
> does not exist in the JVM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15260) UnifiedMemoryManager could be in bad state if any exception happen while evicting blocks

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15260.

   Resolution: Fixed
Fix Version/s: 2.0.0

> UnifiedMemoryManager could be in bad state if any exception happen while 
> evicting blocks
> 
>
> Key: SPARK-15260
> URL: https://issues.apache.org/jira/browse/SPARK-15260
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> {code}
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 62 in stage 19.0 failed 4 times, most
> recent failure: Lost task 62.3 in stage 19.0 (TID 2841, 
> ip-10-109-240-229.ec2.internal): java.io.IOException:
> java.lang.AssertionError: assertion failed at 
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1223) at
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock( 
> TorrentBroadcast.scala:165) at
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute( 
> TorrentBroadcast.scala:64) at
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast. 
> scala:64) at
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast. 
> scala:88) at
> org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 71) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 46) at
> org.apache.spark.scheduler.Task.run(Task.scala:96) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. 
> java:1142) at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. 
> java:617) at
> java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AssertionError: 
> assertion failed at
> scala.Predef$.assert(Predef.scala:165) at 
> org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(
> UnifiedMemoryManager.scala:140) at 
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:387) at
> org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346) at
> org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:99) at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:803) at
> org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:690) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
> 1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
> TorrentBroadcast.scala:130) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
> 1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
> TorrentBroadcast.scala:127) at
> scala.Option.map(Option.scala:145) at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.org$apache$spark$broadcast$
> TorrentBroadcast$$anonfun$$getRemote$1(TorrentBroadcast.scala:127) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
> TorrentBroadcast.scala:137) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
> TorrentBroadcast.scala:137) at
> scala.Option.orElse(Option.scala:257) at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast. 
> scala:137) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
> broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
> scala.collection.immutable.List.foreach(List.scala:318) at
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$
> TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120) at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$ 
> 1.apply(TorrentBroadcast.scala:175) at
> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1220) ... 12 more 
> Driver stacktrace: at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$
> DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
> DAGScheduler.scala:1419) at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
> DAGScheduler.scala:1418) at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. 
> scala:59) at
> scala.collection.mutable.

[jira] [Resolved] (SPARK-15259) Sort time metric should not include spill and record insertion time

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15259.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Sort time metric should not include spill and record insertion time
> ---
>
> Key: SPARK-15259
> URL: https://issues.apache.org/jira/browse/SPARK-15259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> After SPARK-14669 it seems the sort time metric includes both spill and 
> record insertion time. This makes it not very useful since the metric becomes 
> close to the total execution time of the node.
> We should track just the time spent for in-memory sort, as before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15259) Sort time metric should not include spill and record insertion time

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15259:
---
Assignee: Eric Liang

> Sort time metric should not include spill and record insertion time
> ---
>
> Key: SPARK-15259
> URL: https://issues.apache.org/jira/browse/SPARK-15259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
>
> After SPARK-14669 it seems the sort time metric includes both spill and 
> record insertion time. This makes it not very useful since the metric becomes 
> close to the total execution time of the node.
> We should track just the time spent for in-memory sort, as before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15241) support scala decimal in external row

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15241.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13019
[https://github.com/apache/spark/pull/13019]

> support scala decimal in external row
> -
>
> Key: SPARK-15241
> URL: https://issues.apache.org/jira/browse/SPARK-15241
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15242) keep decimal precision and scale when convert external decimal to catalyst decimal

2016-05-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15242.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13019
[https://github.com/apache/spark/pull/13019]

> keep decimal precision and scale when convert external decimal to catalyst 
> decimal
> --
>
> Key: SPARK-15242
> URL: https://issues.apache.org/jira/browse/SPARK-15242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15260) UnifiedMemoryManager could be in bad state if any exception happen while evicting blocks

2016-05-10 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15260:
--

 Summary: UnifiedMemoryManager could be in bad state if any 
exception happen while evicting blocks
 Key: SPARK-15260
 URL: https://issues.apache.org/jira/browse/SPARK-15260
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.1, 1.6.0, 2.0.0
Reporter: Davies Liu
Assignee: Andrew Or


{code}

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 
62 in stage 19.0 failed 4 times, most
recent failure: Lost task 62.3 in stage 19.0 (TID 2841, 
ip-10-109-240-229.ec2.internal): java.io.IOException:
java.lang.AssertionError: assertion failed at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1223) at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock( 
TorrentBroadcast.scala:165) at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute( 
TorrentBroadcast.scala:64) at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast. scala:64) 
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast. 
scala:88) at
org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 71) at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala: 46) at
org.apache.spark.scheduler.Task.run(Task.scala:96) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. 
java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. 
java:617) at
java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AssertionError: 
assertion failed at
scala.Predef$.assert(Predef.scala:165) at 
org.apache.spark.memory.UnifiedMemoryManager.acquireStorageMemory(
UnifiedMemoryManager.scala:140) at 
org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:387) at
org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:346) at
org.apache.spark.storage.MemoryStore.putBytes(MemoryStore.scala:99) at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:803) at
org.apache.spark.storage.BlockManager.putBytes(BlockManager.scala:690) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
TorrentBroadcast.scala:130) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$or
1c5ab38dcb7d9b112f54b116debbe7fcast$$anonfun$$getRemote$1$1.apply( 
TorrentBroadcast.scala:127) at
scala.Option.map(Option.scala:145) at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1.org$apache$spark$broadcast$
TorrentBroadcast$$anonfun$$getRemote$1(TorrentBroadcast.scala:127) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
TorrentBroadcast.scala:137) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply( 
TorrentBroadcast.scala:137) at
scala.Option.orElse(Option.scala:257) at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast. 
scala:137) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$
broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala: 120) at
scala.collection.immutable.List.foreach(List.scala:318) at
org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$
TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120) at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$ 
1.apply(TorrentBroadcast.scala:175) at
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1220) ... 12 more 
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$
DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
DAGScheduler.scala:1419) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply( 
DAGScheduler.scala:1418) at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. scala:59) 
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala: 1418) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. 
apply(DAGScheduler.scala:799) at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1. 
apply(DAGScheduler.scala:799) at
scala.Option.foreach(Option.scala:236) at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed

[jira] [Updated] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12661:
---
Target Version/s: 2.1.0  (was: 2.0.0)

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-05-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278630#comment-15278630
 ] 

Davies Liu commented on SPARK-12661:


I think the goal is clear we did not enough to do that, so I target this for 
2.1, sounds good?

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14560) Cooperative Memory Management for Spillables

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14560.

   Resolution: Fixed
 Assignee: Lianhui Wang  (was: Imran Rashid)
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/10024

> Cooperative Memory Management for Spillables
> 
>
> Key: SPARK-14560
> URL: https://issues.apache.org/jira/browse/SPARK-14560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Imran Rashid
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> SPARK-10432 introduced cooperative memory management for SQL operators that 
> can spill; however, {{Spillable}} s used by the old RDD api still do not 
> cooperate.  This can lead to memory starvation, in particular on a 
> shuffle-to-shuffle stage, eventually resulting in errors like:
> {noformat}
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory 
> were used by task 3081 but are not associated with specific consumers
> 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory 
> are used for execution and 1710484 bytes of memory are used for storage
> 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size 
> = 1317230346 bytes, TID = 3081
> 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage 
> 3.0 (TID 3081)
> java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
> at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This can happen anytime the shuffle read side requires more memory than what 
> is available for the task.  Since the shuffle-read side doubles its memory 
> request each time, it can easily end up acquiring all of the available 
> memory, even if it does not use it.  Eg., say that after the final spill, the 
> shuffle-read side requires 10 MB more memory, and there is 15 MB of memory 
> available.  But if it starts at 2 MB, it will double to 4, 8, and then 
> request 16 MB of memory, and in fact get all available 15 MB.  Since the 15 
> MB of memory is sufficient, it will not spill, and will continue holding on 
> to all available memory.  But this leaves *no* memory available for the 
> shuffle-write side.  Since the shuffle-write side cannot request the 
> shuffle-read side to free up memory, this leads to an OOM.
> The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as 
> well, so RDDs can benefit from the cooperative memory management introduced 
> by SPARK-10342.
> Note that an additional improvement would be for the shuffle-read side to 
> simple release unused memory, without spilling, in case that would leave 
> enough memory, and only spill if that was inadequate.  However that can come 
> as a later improvement.
> *Workaround*:  You can set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to 
> occur every {{N}} elements, thus preventing the shuffle-read side from ever 
> grabbing all of the available memory.  However, this requires careful tuning 
> of {{N}} to specific workloads: too big, and you will still get an OOM; too 
> small, and there will be so much spilling that performance will suffer 
> drastically.  Furthermore, this workaround uses an *undocumented* 
> configuration with *no compatibility guarantees* for future versions of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
T

[jira] [Updated] (SPARK-15179) Enable SQL generation for subqueries

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15179:
---
Assignee: Herman van Hovell

> Enable SQL generation for subqueries
> 
>
> Key: SPARK-15179
> URL: https://issues.apache.org/jira/browse/SPARK-15179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> SQL Generation for subqueries is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14773) Enable the tests in HiveCompatibilitySuite for subquery

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14773.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12988
[https://github.com/apache/spark/pull/12988]

> Enable the tests in HiveCompatibilitySuite for subquery
> ---
>
> Key: SPARK-14773
> URL: https://issues.apache.org/jira/browse/SPARK-14773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> There are a few test cases in HiveCompatibilitySuite  for subquery, we should 
> enable them to have better coverage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15179) Enable SQL generation for subqueries

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15179.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12988
[https://github.com/apache/spark/pull/12988]

> Enable SQL generation for subqueries
> 
>
> Key: SPARK-15179
> URL: https://issues.apache.org/jira/browse/SPARK-15179
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> SQL Generation for subqueries is broken.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15154) LongHashedRelation test fails on Big Endian platform

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15154:
---
Assignee: Pete Robbins

> LongHashedRelation test fails on Big Endian platform
> 
>
> Key: SPARK-15154
> URL: https://issues.apache.org/jira/browse/SPARK-15154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Minor
>  Labels: big-endian
> Fix For: 2.0.0
>
>
> NPE in 
> org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap
> Error Message
> java.lang.NullPointerException was thrown.
> Stacktrace
>   java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526)
>   at 
> org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29)
>   at org.scalatest.Suite$class.run(Suite.scala:1421)
>   at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29)
>   at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
>   at 
> org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDo

[jira] [Resolved] (SPARK-15154) LongHashedRelation test fails on Big Endian platform

2016-05-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15154.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13009
[https://github.com/apache/spark/pull/13009]

> LongHashedRelation test fails on Big Endian platform
> 
>
> Key: SPARK-15154
> URL: https://issues.apache.org/jira/browse/SPARK-15154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Pete Robbins
>Priority: Minor
>  Labels: big-endian
> Fix For: 2.0.0
>
>
> NPE in 
> org.apache.spark.sql.execution.joins.HashedRelationSuite.LongToUnsafeRowMap
> Error Message
> java.lang.NullPointerException was thrown.
> Stacktrace
>   java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(HashedRelationSuite.scala:121)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply$mcV$sp(HashedRelationSuite.scala:119)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112)
>   at 
> org.apache.spark.sql.execution.joins.HashedRelationSuite$$anonfun$3.apply(HashedRelationSuite.scala:112)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1526)
>   at 
> org.scalatest.tools.DiscoverySuite.runNestedSuites(DiscoverySuite.scala:29)
>   at org.scalatest.Suite$class.run(Suite.scala:1421)
>   at org.scalatest.tools.DiscoverySuite.run(DiscoverySuite.scala:29)
>   at org.scalatest.tools.SuiteRunner.run(SuiteRunner.sca

[jira] [Resolved] (SPARK-14972) Improve performance of JSON schema inference's inferField step

2016-05-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14972.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12750
[https://github.com/apache/spark/pull/12750]

> Improve performance of JSON schema inference's inferField step
> --
>
> Key: SPARK-14972
> URL: https://issues.apache.org/jira/browse/SPARK-14972
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> JSON schema inference spends a lot of time in inferField and there are a 
> number of techniques to speed it up, including eliminating unnecessary 
> sorting and the use of inefficient collections.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

2016-05-09 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276743#comment-15276743
 ] 

Davies Liu commented on SPARK-14946:


[~raymond.honderd...@sizmek.com] It seems that the second job (scan the bigger 
table) did not get started, could you try to disable the broadcast join by set 
spark.sql.autoBroadcastJoinThreshold to 0?

> Spark 2.0 vs 1.6.1 Query Time(out)
> --
>
> Key: SPARK-14946
> URL: https://issues.apache.org/jira/browse/SPARK-14946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Critical
> Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, 
> spark-defaults.conf, spark-env.sh, version 1.6.1 screen 1 - thrift collect = 
> true.png, version 1.6.1 screen 1 thrift collect = false.png, version 1.6.1 
> screen 2 thrift collect =false.png, version 2.0 -screen 1 thrift collect = 
> false.png, version 2.0 screen 2 thrift collect = true.png, versiuon 2.0 
> screen 1 thrift collect = true.png
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15122.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12954
[https://github.com/apache/spark/pull/12954]

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
> Fix For: 2.0.0
>
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large
>:   +- 'UnresolvedRelation `item`, None
>+- 'UnresolvedRelation `item`, Some(i1)
> == Analyzed Logical Plan ==
> i_product_name: string
> GlobalLimit 100
> +- LocalLimit 100
>+- Sort [i_product_name#24 ASC], true
>   +- Distinct
>  +- Project [i_product_name#24]
> +- Filter (((i_manufact_id#1

[jira] [Resolved] (SPARK-1239) Improve fetching of map output statuses

2016-05-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-1239.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12113
[https://github.com/apache/spark/pull/12113]

> Improve fetching of map output statuses
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
> Fix For: 2.0.0
>
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14512) Add python example for QuantileDiscretizer

2016-05-06 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14512.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12281
[https://github.com/apache/spark/pull/12281]

> Add python example for QuantileDiscretizer
> --
>
> Key: SPARK-14512
> URL: https://issues.apache.org/jira/browse/SPARK-14512
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
> Fix For: 2.0.0
>
>
> Add the missing python example for QuantileDiscretizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame

2016-05-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15110.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12887
[https://github.com/apache/spark/pull/12887]

> SparkR - Implement repartitionByColumn on DataFrame
> ---
>
> Key: SPARK-15110
> URL: https://issues.apache.org/jira/browse/SPARK-15110
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> Implement repartitionByColumn on DataFrame.
> This will allow us to run R functions on each partition identified by column 
> groups with dapply() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

2016-05-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272656#comment-15272656
 ] 

Davies Liu commented on SPARK-14946:


The screen shot of 2.0 seemed that the second job (main job) have not report 
any metrics back after 1.8 minutes, that's weird. 

> Spark 2.0 vs 1.6.1 Query Time(out)
> --
>
> Key: SPARK-14946
> URL: https://issues.apache.org/jira/browse/SPARK-14946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Critical
> Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, 
> spark-defaults.conf, spark-env.sh
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

2016-05-05 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272646#comment-15272646
 ] 

Davies Liu commented on SPARK-14946:


It will be great to narrow down to this issue, I can't reproduce this issue 
right now.

Have you set this spark.sql.thriftServer.incrementalCollect ?

> Spark 2.0 vs 1.6.1 Query Time(out)
> --
>
> Key: SPARK-14946
> URL: https://issues.apache.org/jira/browse/SPARK-14946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Critical
> Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, 
> spark-defaults.conf, spark-env.sh
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15045:
---
Assignee: Jacek Lewandowski

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Lewandowski
> Fix For: 2.0.0
>
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15045.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12829
[https://github.com/apache/spark/pull/12829]

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
> Fix For: 2.0.0
>
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14946) Spark 2.0 vs 1.6.1 Query Time(out)

2016-05-04 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271332#comment-15271332
 ] 

Davies Liu commented on SPARK-14946:


[~raymond.honderd...@sizmek.com] Could you try to run this query in 
spark-shell, also check the Spark UI to see what's the time spent on each 
job/stages/tasks?

> Spark 2.0 vs 1.6.1 Query Time(out)
> --
>
> Key: SPARK-14946
> URL: https://issues.apache.org/jira/browse/SPARK-14946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Raymond Honderdors
>Priority: Critical
> Attachments: Query Plan 1.6.1.png, screenshot-spark_2.0.png, 
> spark-defaults.conf, spark-env.sh
>
>
> I run a query using JDBC driver running it on version 1.6.1 it return after 5 
> – 6 min , the same query against version 2.0 fails after 2h (due to timeout) 
> for details on how to reproduce (also see comments below)
> here is what I tried
> I run the following query: select * from pe_servingdata sd inner join 
> pe_campaigns_gzip c on sd.campaignid = c.campaign_id ;
> (with and without a counter and group by on campaigne_id)
> I run spark 1.6.1 and Thriftserver
> then running the sql from beeline or squirrel, after a few min I get answer 
> (0 row) it is correct due to the fact my data did not have matching campaign 
> ids in both tables
> when I run spark 2.0 and Thriftserver, I once again run the sql statement and 
> after 2:30 min it gives up, bit already after 30/60 sec I stop seeing 
> activity on the spark ui
> (sorry for the delay in competing the description of the bug, I was on and 
> off work due to national holidays)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14951) Subexpression elimination in wholestage codegen version of TungstenAggregate

2016-05-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14951:
---
Assignee: Liang-Chi Hsieh

> Subexpression elimination in wholestage codegen version of TungstenAggregate
> 
>
> Key: SPARK-14951
> URL: https://issues.apache.org/jira/browse/SPARK-14951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Currently wholestage codegen version of TungstenAggregate does not support 
> subexpression elimination. We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14951) Subexpression elimination in wholestage codegen version of TungstenAggregate

2016-05-04 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14951.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12729
[https://github.com/apache/spark/pull/12729]

> Subexpression elimination in wholestage codegen version of TungstenAggregate
> 
>
> Key: SPARK-14951
> URL: https://issues.apache.org/jira/browse/SPARK-14951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> Currently wholestage codegen version of TungstenAggregate does not support 
> subexpression elimination. We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15105) Remove HiveSessionHook from ThriftServer

2016-05-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15105:
--

 Summary: Remove HiveSessionHook from ThriftServer
 Key: SPARK-15105
 URL: https://issues.apache.org/jira/browse/SPARK-15105
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15102) remove delegation token from ThriftServer

2016-05-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15102:
--

 Summary: remove delegation token from ThriftServer
 Key: SPARK-15102
 URL: https://issues.apache.org/jira/browse/SPARK-15102
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu


These feature is only useful for Hadoop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11316) coalesce doesn't handle UnionRDD with partial locality properly

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11316.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11327
[https://github.com/apache/spark/pull/11327]

> coalesce doesn't handle UnionRDD with partial locality properly
> ---
>
> Key: SPARK-11316
> URL: https://issues.apache.org/jira/browse/SPARK-11316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 2.0.0
>
>
> So I haven't fully debugged this yet but reporting what I'm seeing and think 
> might be going on.
> I have a graph processing job that is seeing huge slow down in setupGroups in 
> the location iterator where its getting the preferred locations for the 
> coalesce.  They are coalescing from 2400 down to 1200 and its taking 17+ 
> hours to do the calculation.  Killed it at this point so don't know total 
> time.
> It appears that the job is doing an isEmpty call, a bunch of other 
> transformation, then a coalesce (where it takes so long), other 
> transformations, then finally a count to trigger it.   
> It appears that there is only one node that its finding in the setupGroup 
> call and to get to that node it has to first to through the while loop:
> while (numCreated < targetLen && tries < expectedCoupons2) {
> where expectedCoupons2 is around 19000.  It finds very few or none in this 
> loop.  
> Then it does the second loop:
> while (numCreated < targetLen) {  // if we don't have enough partition 
> groups, create duplicates
>   var (nxt_replica, nxt_part) = rotIt.next()
>   val pgroup = PartitionGroup(nxt_replica)
>   groupArr += pgroup
>   groupHash.getOrElseUpdate(nxt_replica, ArrayBuffer()) += pgroup
>   var tries = 0
>   while (!addPartToPGroup(nxt_part, pgroup) && tries < targetLen) { // 
> ensure at least one part
> nxt_part = rotIt.next()._2
> tries += 1
>   }
>   numCreated += 1
> }
> Where it has an inner while loop and both of those are going 1200 times.  
> 1200*1200 loops.  This is taking a very long time.
> The user can work around the issue by adding in a count() call very close to 
> after the isEmpty call before the coalesce is called.  I also tried putting 
> in a take(1)  right before the isEmpty call and it seems to work around 
> the issue, took 1 hours with the take vs a few minutes with the count().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14521) StackOverflowError in Kryo when executing TPC-DS

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14521.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12598
[https://github.com/apache/spark/pull/12598]

> StackOverflowError in Kryo when executing TPC-DS
> 
>
> Key: SPARK-14521
> URL: https://issues.apache.org/jira/browse/SPARK-14521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Build details:  Spark build from master branch (Apr-10)
> DataSet:TPC-DS at 200 GB scale in Parq format stored in hive.
> Client: $SPARK_HOME/bin/beeline 
> Query:  TPC-DS Query27
> spark.sql.sources.fileScan=true (this is the default value anyways)
> Exception:
> {noformat}
> Exception in thread "broadcast-exchange-0" java.lang.StackOverflowError
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:108)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:99)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
> at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15095) Drop binary mode in ThriftServer

2016-05-03 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15095:
--

 Summary: Drop binary mode in ThriftServer
 Key: SPARK-15095
 URL: https://issues.apache.org/jira/browse/SPARK-15095
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14226) Caching a table with 1,100 columns and a few million rows fails

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-14226.
--
Resolution: Duplicate

> Caching a table with 1,100 columns and a few million rows fails
> ---
>
> Key: SPARK-14226
> URL: https://issues.apache.org/jira/browse/SPARK-14226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
>
> I created a DataFrame from the 1000 genomes data set using csv data source. I 
> register it as a table and tried to cache it. I get the following error:
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/")
> val colNames = 
> sc.textFile("/hossein/1000genomes/").take(100).filter(_.startsWith("#CHROM")).head.split("\t")
> val data = vcfData.toDF(colNames: _*).registerTempTable("genomeTable)
> %sql cache table genomeTable
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 2086 tasks (4.0 GB) is bigger than 
> spark.driver.maxResultSize (4.0 GB)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1457)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1445)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1444)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1444)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1666)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1765)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1778)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1791)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:881)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:880)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:1979)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2242)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:1978)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:1985)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:1995)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:1994)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2255)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:1994)
>   at 
> org.apache.spark.sql.execution.command.CacheTableCommand.run(commands.scala:270)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:61)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$exec

[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12837:
---
Target Version/s: 2.0.0
Priority: Critical  (was: Major)

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12837:
---
Assignee: Wenchen Fan

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2016-05-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269132#comment-15269132
 ] 

Davies Liu commented on SPARK-12837:


With spark.driver.maxResultSize=1m, the simply job will fail
{code}
>>> sc.range(0, 1000, 1, 1000).count()
Job aborted due to stage failure: Total size of serialized results of 370 tasks 
(1024.7 KB) is bigger than spark.driver.maxResultSize (1024.0 KB)
{code}

It meant the average size of serialized task result is about 2.8k

After some debugging the actual result is 60 byte, all others are accumulator 
updates (23 accumulators), but this query should not update so many 
accumulators. It seems that we are collecting all the accumulators to driver 
(not just the updated ones).

Another thing is that, each accumulator will be serialized to about 100 Bytes, 
we could also reduce the size.

cc [~cloud_fan]


> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14992) Flaky test: BucketedReadSuite.only shuffle one side when join bucketed table and non-bucketed table

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14992.

   Resolution: Fixed
Fix Version/s: 2.0.0

Fixed by https://github.com/apache/spark/pull/12773

> Flaky test: BucketedReadSuite.only shuffle one side when join bucketed table 
> and non-bucketed table
> ---
>
> Key: SPARK-14992
> URL: https://issues.apache.org/jira/browse/SPARK-14992
> Project: Spark
>  Issue Type: Test
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57279/testReport/org.apache.spark.sql.sources/BucketedReadSuite/only_shuffle_one_side_when_join_bucketed_table_and_non_bucketed_table/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15088) Remove SparkSqlSerializer

2016-05-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15088.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12864
[https://github.com/apache/spark/pull/12864]

> Remove SparkSqlSerializer
> -
>
> Key: SPARK-15088
> URL: https://issues.apache.org/jira/browse/SPARK-15088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This seems unused now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12540) Support all TPCDS queries

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12540.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Support all TPCDS queries
> -
>
> Key: SPARK-12540
> URL: https://issues.apache.org/jira/browse/SPARK-12540
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As of Spark SQL 1.6, Spark can run 55 out of 99 TPCDS queries. The goal of 
> this epic is to support running all the TPC-DS queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12540) Support all TPCDS queries

2016-05-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267739#comment-15267739
 ] 

Davies Liu commented on SPARK-12540:


We made it into Spark 2.0 finally, bingo!

> Support all TPCDS queries
> -
>
> Key: SPARK-12540
> URL: https://issues.apache.org/jira/browse/SPARK-12540
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As of Spark SQL 1.6, Spark can run 55 out of 99 TPCDS queries. The goal of 
> this epic is to support running all the TPC-DS queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14785) Support correlated scalar subquery

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14785:
---
Assignee: Herman van Hovell

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14785) Support correlated scalar subquery

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14785.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12822
[https://github.com/apache/spark/pull/12822]

> Support correlated scalar subquery
> --
>
> Key: SPARK-14785
> URL: https://issues.apache.org/jira/browse/SPARK-14785
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> For example:
> {code}
> SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id)
> {code}
> it could be rewritten as 
> {code}
> SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON 
> t3.id = t.id where b > avg_c
> {code}
> TPCDS Q92, Q81, Q6 required this
> Update: TPCDS Q1 and Q30 also require correlated scalar subquery support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13753) Column nullable is derived incorrectly

2016-05-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267689#comment-15267689
 ] 

Davies Liu commented on SPARK-13753:


After looking at the query, the bug is caused by we though the key of MapType 
should always be not-nullable, but when create an map using map(), we do not 
check the nullability of keys.

So the solution could be 1) enforce the nullability check in map(), which will 
break this use case, 2) or allow `null` as key in MapType, which may require 
more API changes

cc [~rxin] [~marmbrus] [~yhuai]

> Column nullable is derived incorrectly
> --
>
> Key: SPARK-13753
> URL: https://issues.apache.org/jira/browse/SPARK-13753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jingwei Lu
>Priority: Critical
>
> There is a problem in spark sql to derive nullable column and used in 
> optimization incorrectly. In following query:
> {code}
> select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0]
>   from (
> select explode(map(a.frontend[0], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p50"),
>  a.frontend[1], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p90"),
>  a.backend[0], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.backend[1], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.render[0], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.render[1], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.page_load_time[0], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.page_load_time[1], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"),
>  a.total_load_time[0], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.total_load_time[1], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
> from (
>   select  data.controller as controller, data.action as 
> action,
>   percentile(data.frontend, array(0.5, 0.9)) as 
> frontend,
>   percentile(data.backend, array(0.5, 0.9)) as 
> backend,
>   percentile(data.render, array(0.5, 0.9)) as render,
>   percentile(data.page_load_time, array(0.5, 0.9)) as 
> page_load_time,
>   percentile(data.total_load_time, array(0.5, 0.9)) 
> as total_load_time
>   from air_events_rt
>   where type='air_events' and data.event_name='pageload'
>   group by data.controller, data.action
> ) a
>   ) b
>   where b.value is not null
> {code}
> b.value is incorrectly derived as not nullable.  "b.value is not null" 
> predicate will be ignored by optimizer which cause the query return incorrect 
> result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14226) Caching a table with 1,100 columns and a few million rows fails

2016-05-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267651#comment-15267651
 ] 

Davies Liu commented on SPARK-14226:


[~falaki] Could you reproduce this with latest master? (or 2.0 branch)

> Caching a table with 1,100 columns and a few million rows fails
> ---
>
> Key: SPARK-14226
> URL: https://issues.apache.org/jira/browse/SPARK-14226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>Priority: Critical
>
> I created a DataFrame from the 1000 genomes data set using csv data source. I 
> register it as a table and tried to cache it. I get the following error:
> {code}
> val vcfData = sqlContext.read.format("csv").options(Map(
>   "comment" -> "#", "header" -> "false", "delimiter" -> "\t"
> )).load("/1000genomes/")
> val colNames = 
> sc.textFile("/hossein/1000genomes/").take(100).filter(_.startsWith("#CHROM")).head.split("\t")
> val data = vcfData.toDF(colNames: _*).registerTempTable("genomeTable)
> %sql cache table genomeTable
> org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
> of serialized results of 2086 tasks (4.0 GB) is bigger than 
> spark.driver.maxResultSize (4.0 GB)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1457)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1445)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1444)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1444)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1666)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1625)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1614)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1765)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1778)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1791)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:881)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:880)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:276)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:1979)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2242)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:1978)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:1985)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:1995)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:1994)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2255)
>   at org.apache.spark.sql.Dataset.count(Dataset.scala:1994)
>   at 
> org.apache.spark.sql.execution.command.CacheTableCommand.run(commands.scala:270)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:61)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.sideEffectResult(commands.scala:59)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommand.doExecute(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$

[jira] [Updated] (SPARK-12141) Use Jackson to serialize all events when writing event log

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12141:
---
Target Version/s:   (was: 2.0.0)

> Use Jackson to serialize all events when writing event log
> --
>
> Key: SPARK-12141
> URL: https://issues.apache.org/jira/browse/SPARK-12141
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure to serialize events using Jackson, so that 
> manual serialization code is not needed anymore.
> We should write all events using that support, and remove all the manual 
> serialization code in {{JsonProtocol}}.
> Since the event log format is a semi-public API, I'm targeting this at 2.0. 
> Also, we can't remove the manual deserialization code, since we need to be 
> able to read old event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13756) Reuse Query Fragments

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13756:
---
Target Version/s:   (was: 2.0.0)

> Reuse Query Fragments
> -
>
> Key: SPARK-13756
> URL: https://issues.apache.org/jira/browse/SPARK-13756
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Mark Hamstra
>
> Query fragments that have been materialized in RDDs can and should be reused 
> either within the same query or in subsequent queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14389) OOM during BroadcastNestedLoopJoin

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14389.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 2.0.0

This is resolved by https://issues.apache.org/jira/browse/SPARK-13213

> OOM during BroadcastNestedLoopJoin
> --
>
> Key: SPARK-14389
> URL: https://issues.apache.org/jira/browse/SPARK-14389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OS: Amazon Linux AMI 2015.09
> EMR: 4.3.0 & 4.5.0
> Hadoop: Amazon 2.7.1 & 2.7.2
> Spark 1.6.0 & 1.6.1
> Ganglia 3.7.2
> Master: m3.xlarge & m4.xlarge
> Core: m3.xlarge & m4.xlarge
> m3.xlarge: 4 CPU, 15GB mem, 2x40GB SSD
> m4.xlarge: 4 CPU, 16GB mem, EBS SSD
>Reporter: Steve Johnston
>Assignee: Davies Liu
> Fix For: 2.0.0
>
> Attachments: controller_2.txt, jps_command_results.txt, lineitem.tbl, 
> plans.txt, sample_script.py, stderr_2.txt, stdout.txt, stdout_2.txt
>
>
> When executing attached sample_script.py in client mode with a single 
> executor an exception occurs, "java.lang.OutOfMemoryError: Java heap space", 
> during the self join of a small table, TPC-H lineitem generated for a 1M 
> dataset. Also see execution log stdout.txt attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15071) Check the result of all TPCDS queries

2016-05-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-15071:
--

 Summary: Check the result of all TPCDS queries
 Key: SPARK-15071
 URL: https://issues.apache.org/jira/browse/SPARK-15071
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Davies Liu


We should compare the results of all TPCDS query again other Database that 
could support all of them (for example, IBM Big SQL, PostgreSQL)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14953) LocalBackend should revive offers periodically

2016-05-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267625#comment-15267625
 ] 

Davies Liu commented on SPARK-14953:


see this one as an reference https://github.com/apache/spark/pull/4147

> LocalBackend should revive offers periodically
> --
>
> Key: SPARK-14953
> URL: https://issues.apache.org/jira/browse/SPARK-14953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Cheng Lian
>
> {{LocalBackend}} only revives offers when tasks are submitted, succeed, or 
> fail. This may lead to deadlock due to delayed scheduling. A case study is 
> provided in [this PR 
> comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425].
> Basically, a job may have a task is delayed to be scheduled due to locality 
> mismatch. The default delay timeout is 3s. If all other tasks finish during 
> this period, {{LocalBackend}} won't revive any offer after the timeout since 
> no tasks are submitted, succeed or fail then. Thus, the delayed task will 
> never be scheduled again and the job never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14834:
---
Target Version/s:   (was: 2.0.0)

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> This is for enforcing user to add python doc when adding new python api with 
> @since annotation. But I think about it again, this is only suitable for 
> adding new api for existing python module. If it is a new python module 
> migrating from scala api, python doc is not mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10343) Consider nullability of expression in codegen

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10343.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 2.0.0

> Consider nullability of expression in codegen
> -
>
> Key: SPARK-10343
> URL: https://issues.apache.org/jira/browse/SPARK-10343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> In codegen, we didn't consider nullability of expressions. Once considering 
> this, we can avoid lots of null check (reduce the size of generated code, 
> also improve performance).
> Before that, we should double-check the correctness of nullablity of all 
> expressions and schema, or we will hit NPE or wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13950) Generate code for sort merge left/right outer join

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-13950.
--
Resolution: Won't Fix

> Generate code for sort merge left/right outer join
> --
>
> Key: SPARK-13950
> URL: https://issues.apache.org/jira/browse/SPARK-13950
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14476:
---
Priority: Critical  (was: Major)

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>Priority: Critical
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14009) Fail the tests if the any catalyst rule reach max number of iteration.

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-14009.
--
Resolution: Duplicate

> Fail the tests if the any catalyst rule reach max number of iteration.
> --
>
> Key: SPARK-14009
> URL: https://issues.apache.org/jira/browse/SPARK-14009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Recently some catalyst rule becoming not stable (conflict with each other), 
> or continue adding stuff into query plan, we should detect this early by fail 
> the tests if any rule reach max number of iterations (200).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14476) Show table name or path in string of DataSourceScan

2016-05-02 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-14476:
---
Target Version/s: 2.0.0

> Show table name or path in string of DataSourceScan
> ---
>
> Key: SPARK-14476
> URL: https://issues.apache.org/jira/browse/SPARK-14476
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> right now, the string of DataSourceScan is only "HadoopFiles xxx", without 
> any information about the table name or path. 
> Since we have that in 1.6, this is kind of regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >