[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

[~cloud_fan] This PR in the Jira entry proposes two fixes
 # Read data in a table cache directry from a columnar storage
 # Generate code to build a table cache

We already implemented 1. But, we have not implmented 2. yet. Let us address 2. 
in the next release.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-11 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611628#comment-16611628
 ] 

Vincent commented on SPARK-25412:
-

[~nick.pentre...@gmail.com] thanks.

> FeatureHasher would change the value of output feature
> --
>
> Key: SPARK-25412
> URL: https://issues.apache.org/jira/browse/SPARK-25412
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be changed with a new 
> valued (sum of current and old value)
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25412) FeatureHasher would change the value of output feature

2018-09-11 Thread Vincent (JIRA)
Vincent created SPARK-25412:
---

 Summary: FeatureHasher would change the value of output feature
 Key: SPARK-25412
 URL: https://issues.apache.org/jira/browse/SPARK-25412
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.1
Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on 
the hashed value is used to determine the vector index, it's suggested to use a 
large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, 
for example. when getting feature importance from decision tree training 
followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 
'numFeature' is relatively small, its value would be changed with a new valued 
(sum of current and old value)
 #  to avoid confliction, we should set the 'numFeature' with a large number, 
highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-11 Thread Michael Spector (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611606#comment-16611606
 ] 

Michael Spector commented on SPARK-25380:
-

[~vanzin] Here's the breakdown:

!Screen Shot 2018-09-12 at 8.20.05.png|width=1021,height=713!

LMK, if you need more information, I'll be more than glad to help.

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-11 Thread Michael Spector (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Spector updated SPARK-25380:

Attachment: Screen Shot 2018-09-12 at 8.20.05.png

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25385) Upgrade jackson version to 2.7.8

2018-09-11 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-25385.
-
Resolution: Won't Fix

> Upgrade jackson version to 2.7.8
> 
>
> Key: SPARK-25385
> URL: https://issues.apache.org/jira/browse/SPARK-25385
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> This upgrade to fix {{JsonMappingException}}:
> {noformat}
> export SPARK_PREPEND_CLASSES=true
> build/sbt clean package -Phadoop-3.1
> spark-shell
> scala> spark.range(10).write.parquet("/tmp/spark/parquet")
> com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson 
> version: 2.7.8
>   at 
> com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
>   at 
> com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25410) Spark executor on YARN does not include memoryOverhead when starting an ExecutorRunnable

2018-09-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25410.

Resolution: Not A Bug

bq. This means that the amount of memoryOverhead will not be used in running 
the job, hence wasted.

That's completely wrong. The "executor memory" is just the Java heap memory. 
The overhead accounts for all the non-heap memory - memory used by the JVM 
other than for the "Java heap", shared libraries, JNI libraries that allocate 
memory, etc, etc. 

> Spark executor on YARN does not include memoryOverhead when starting an 
> ExecutorRunnable
> 
>
> Key: SPARK-25410
> URL: https://issues.apache.org/jira/browse/SPARK-25410
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.1
>Reporter: Anbang Hu
>Priority: Major
>
> When deploying on YARN, only {{executorMemory}} is used to launch executors 
> in 
> [YarnAllocator.scala#L529|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L529]:
> {code}
>   try {
> new ExecutorRunnable(
>   Some(container),
>   conf,
>   sparkConf,
>   driverUrl,
>   executorId,
>   executorHostname,
>   executorMemory,
>   executorCores,
>   appAttemptId.getApplicationId.toString,
>   securityMgr,
>   localResources
> ).run()
> updateInternalState()
>   } catch {
> {code}
> However, resource capability requested for each executor is {{executorMemory 
> + memoryOverhead}} in 
> [YarnAllocator.scala#L142|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L142]:
> {code:scala}
>   // Resource capability requested for each executors
>   private[yarn] val resource = Resource.newInstance(executorMemory + 
> memoryOverhead, executorCores)
> {code}
> This means that the amount of {{memoryOverhead}} will not be used in running 
> the job, hence wasted.
> Checking both k8s and Mesos, it looks like they both include overhead memory.
> For k8s, in 
> [ExecutorPodFactory.scala#L179|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L179]:
> {code}
> val executorContainer = new ContainerBuilder()
>   .withName("executor")
>   .withImage(executorContainerImage)
>   .withImagePullPolicy(imagePullPolicy)
>   .withNewResources()
> .addToRequests("memory", executorMemoryQuantity)
> .addToLimits("memory", executorMemoryLimitQuantity)
> .addToRequests("cpu", executorCpuQuantity)
> .endResources()
>   .addAllToEnv(executorEnv.asJava)
>   .withPorts(requiredPorts.asJava)
>   .addToArgs("executor")
>   .build()
> {code}
> For Mesos, 
> in[MesosSchedulerUtils.scala#L374|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L374]:
> {code}
>   /**
>* Return the amount of memory to allocate to each executor, taking into 
> account
>* container overheads.
>*
>* @param sc SparkContext to use to get 
> `spark.mesos.executor.memoryOverhead` value
>* @return memory requirement as (0.1 * memoryOverhead) or 
> MEMORY_OVERHEAD_MINIMUM
>* (whichever is larger)
>*/
>   def executorMemory(sc: SparkContext): Int = {
> sc.conf.getInt("spark.mesos.executor.memoryOverhead",
>   math.max(MEMORY_OVERHEAD_FRACTION * sc.executorMemory, 
> MEMORY_OVERHEAD_MINIMUM).toInt) +
>   sc.executorMemory
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2018-09-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611541#comment-16611541
 ] 

Hyukjin Kwon commented on SPARK-8000:
-

Yup, it's still a good to do.

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611538#comment-16611538
 ] 

Hyukjin Kwon commented on SPARK-25378:
--

If there's a simple way to fix, it might be okay but still it's not a public 
API ...

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611537#comment-16611537
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

Yea, I postponed the closing thing for the try I made at that time IIRC - that 
should also be related with handling malformed one. Yea, I hope that's not so 
complicated since, as you already know, the code here is quite convoluted. One 
possibility is we have another method to only parse array only that return 
iterator.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-09-11 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611535#comment-16611535
 ] 

Liang-Chi Hsieh commented on SPARK-25271:
-

Yeah, looks like after some changes, this kind of queries now uses Hive's 
record writer. So it inherits the issue in Hive.

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
> Attachments: image-2018-09-07-09-12-34-944.png, 
> image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, 
> image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png
>
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> 

[jira] [Assigned] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-09-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23483:


Assignee: Huaxin Gao

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-09-11 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611528#comment-16611528
 ] 

Hyukjin Kwon commented on SPARK-23483:
--

I am also assigning this to [~huaxingao] since she's been mainly working on 
this.

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-09-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23483.
--
Resolution: Done

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-09-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-23483:
--

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23483) Feature parity for Python vs Scala APIs

2018-09-11 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23483.
--
Resolution: Fixed

Let me leave this resolved.

> Feature parity for Python vs Scala APIs
> ---
>
> Key: SPARK-23483
> URL: https://issues.apache.org/jira/browse/SPARK-23483
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Investigate the feature parity for Python vs Scala APIs and address them



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25238) Lint-Python: Upgrading to the current version of pycodestyle fails

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611523#comment-16611523
 ] 

Apache Spark commented on SPARK-25238:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22400

> Lint-Python: Upgrading to the current version of pycodestyle fails
> --
>
> Key: SPARK-25238
> URL: https://issues.apache.org/jira/browse/SPARK-25238
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.1
> Environment: https://github.com/apache/spark
>Reporter: cclauss
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> See https://github.com/apache/spark/pull/22231
> 
> Running Python style checks
> 
> pycodestyle checks failed.
> ./python/pyspark/sql/context.py:484:13: W504 line break after binary operator
> ./python/pyspark/sql/dataframe.py:1906:30: W504 line break after binary 
> operator
> ./python/pyspark/sql/dataframe.py:2153:21: W504 line break after binary 
> operator
> ./python/pyspark/sql/types.py:1604:29: W504 line break after binary operator
> ./python/pyspark/sql/types.py:1657:29: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:232:16: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:233:16: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1100:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1101:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1780:17: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1781:17: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3131:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3132:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3133:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4145:46: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4315:33: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4328:33: W504 line break after binary operator
> ./python/pyspark/sql/readwriter.py:455:-6594: W605 invalid escape sequence 
> '\`'
> ./python/pyspark/sql/readwriter.py:916:-2652: W605 invalid escape sequence 
> '\`'
> ./python/pyspark/sql/functions.py:123:13: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:142:24: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:144:23: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:146:25: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:148:24: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:174:15: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:176:20: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:178:19: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:286:101: E501 line too long (104 > 100 
> characters)
> ./python/pyspark/sql/functions.py:349:101: E501 line too long (103 > 100 
> characters)
> ./python/pyspark/sql/functions.py:1458:16: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:1703:-361: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/functions.py:1703:-355: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/functions.py:1703:-205: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/streaming.py:396:34: W504 line break after binary 
> operator
> ./python/pyspark/sql/streaming.py:667:-6306: W605 invalid escape sequence '\`'
> ./python/pyspark/ml/classification.py:249:23: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:250:23: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:255:20: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:370:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:371:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:406:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:407:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:412:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:876:22: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:877:22: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:1250:22: W504 line break after binary 
> operator
> 

[jira] [Commented] (SPARK-25238) Lint-Python: Upgrading to the current version of pycodestyle fails

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611524#comment-16611524
 ] 

Apache Spark commented on SPARK-25238:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22400

> Lint-Python: Upgrading to the current version of pycodestyle fails
> --
>
> Key: SPARK-25238
> URL: https://issues.apache.org/jira/browse/SPARK-25238
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.1
> Environment: https://github.com/apache/spark
>Reporter: cclauss
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> See https://github.com/apache/spark/pull/22231
> 
> Running Python style checks
> 
> pycodestyle checks failed.
> ./python/pyspark/sql/context.py:484:13: W504 line break after binary operator
> ./python/pyspark/sql/dataframe.py:1906:30: W504 line break after binary 
> operator
> ./python/pyspark/sql/dataframe.py:2153:21: W504 line break after binary 
> operator
> ./python/pyspark/sql/types.py:1604:29: W504 line break after binary operator
> ./python/pyspark/sql/types.py:1657:29: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:232:16: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:233:16: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1100:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1101:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1780:17: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:1781:17: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3131:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3132:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:3133:28: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4145:46: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4315:33: W504 line break after binary operator
> ./python/pyspark/sql/tests.py:4328:33: W504 line break after binary operator
> ./python/pyspark/sql/readwriter.py:455:-6594: W605 invalid escape sequence 
> '\`'
> ./python/pyspark/sql/readwriter.py:916:-2652: W605 invalid escape sequence 
> '\`'
> ./python/pyspark/sql/functions.py:123:13: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:142:24: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:144:23: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:146:25: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:148:24: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:174:15: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:176:20: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:178:19: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:286:101: E501 line too long (104 > 100 
> characters)
> ./python/pyspark/sql/functions.py:349:101: E501 line too long (103 > 100 
> characters)
> ./python/pyspark/sql/functions.py:1458:16: W504 line break after binary 
> operator
> ./python/pyspark/sql/functions.py:1703:-361: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/functions.py:1703:-355: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/functions.py:1703:-205: W605 invalid escape sequence '\d'
> ./python/pyspark/sql/streaming.py:396:34: W504 line break after binary 
> operator
> ./python/pyspark/sql/streaming.py:667:-6306: W605 invalid escape sequence '\`'
> ./python/pyspark/ml/classification.py:249:23: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:250:23: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:255:20: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:370:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:371:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:406:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:407:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:412:34: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:876:22: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:877:22: W504 line break after binary 
> operator
> ./python/pyspark/ml/classification.py:1250:22: W504 line break after binary 
> operator
> 

[jira] [Created] (SPARK-25411) Implement range partition in Spark

2018-09-11 Thread Wang, Gang (JIRA)
Wang, Gang created SPARK-25411:
--

 Summary: Implement range partition in Spark
 Key: SPARK-25411
 URL: https://issues.apache.org/jira/browse/SPARK-25411
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wang, Gang


In our PROD environment, there are some partitioned fact tables, which are all 
quite huge. To accelerate join execution, we need make them also bucketed. Than 
comes the problem, if the bucket number is large enough, there may be two many 
files(files count = bucket number * partition count), which may bring pressure 
to the HDFS. And if the bucket number is small, Spark will launch equal number 
of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range 
partition in Oracle/MySQL 
([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). 
Say, we can partition by a date column, and make every two months as a 
partition, or partitioned by a integer column, make interval of 1 as a 
partition.

 

Ideally, feature like range partition should be implemented in Hive. While, 
it's been always hard to update Hive version in a prod environment, and much 
lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611502#comment-16611502
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

Although I created another JIRA 
https://issues.apache.org/jira/browse/SPARK-20479, there is no PR. Let me check 
the performance in 2.4 branch.

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

I see. I will check this.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25354) Parquet vectorized record reader has unneeded operation in several methods

2018-09-11 Thread SongYadong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SongYadong resolved SPARK-25354.

Resolution: Won't Fix

The benefit is not obvious. 

> Parquet vectorized record reader has unneeded operation in several methods
> --
>
> Key: SPARK-25354
> URL: https://issues.apache.org/jira/browse/SPARK-25354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: SongYadong
>Priority: Major
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> VectorizedParquetRecordReader class has unneeded operation in nextKeyValue 
> method and other functions called from it:
> 1. In nextKeyValue() method, we call resultBatch() for only initializing a 
> columnar batch if not initialized, not for a return of columnar batch. so we 
> can move initBatch() operation to nextBatch();
> 2. In nextBatch() method, we need not reset columnVectors every time. When 
> rowsReturned >= totalRowCount, function return, reset cost is vasted. so we 
> can put "if (rowsReturned >= totalRowCount) return false;" before 
> columnVectors reset for performance.
> 3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. 
> When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do 
> nothing but just return, so we can call checkEndOfRowGroup only when 
> rowsReturned == totalCountLoadedSoFar for reducing function calling.
> 4. In checkEndOfRowGroup() function, we need not get columns of 
> requestedSchema every time. we can get columns only for the first time and 
> save it for future use for performance.
> Accoring to analysis of spark application with JMC tool, we found parquet 
> vectorized record reader call nextKeyValue() and subsequent function very 
> very frequent, performance gains from optimizition of this process is worth 
> to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25354) Parquet vectorized record reader has unneeded operation in several methods

2018-09-11 Thread SongYadong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611432#comment-16611432
 ] 

SongYadong commented on SPARK-25354:


The benefit is not obvious. PR is closed.

> Parquet vectorized record reader has unneeded operation in several methods
> --
>
> Key: SPARK-25354
> URL: https://issues.apache.org/jira/browse/SPARK-25354
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: SongYadong
>Priority: Major
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> VectorizedParquetRecordReader class has unneeded operation in nextKeyValue 
> method and other functions called from it:
> 1. In nextKeyValue() method, we call resultBatch() for only initializing a 
> columnar batch if not initialized, not for a return of columnar batch. so we 
> can move initBatch() operation to nextBatch();
> 2. In nextBatch() method, we need not reset columnVectors every time. When 
> rowsReturned >= totalRowCount, function return, reset cost is vasted. so we 
> can put "if (rowsReturned >= totalRowCount) return false;" before 
> columnVectors reset for performance.
> 3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. 
> When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do 
> nothing but just return, so we can call checkEndOfRowGroup only when 
> rowsReturned == totalCountLoadedSoFar for reducing function calling.
> 4. In checkEndOfRowGroup() function, we need not get columns of 
> requestedSchema every time. we can get columns only for the first time and 
> save it for future use for performance.
> Accoring to analysis of spark application with JMC tool, we found parquet 
> vectorized record reader call nextKeyValue() and subsequent function very 
> very frequent, performance gains from optimizition of this process is worth 
> to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25410) Spark executor on YARN does not include memoryOverhead when starting an ExecutorRunnable

2018-09-11 Thread Anbang Hu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anbang Hu updated SPARK-25410:
--
Description: 
When deploying on YARN, only {{executorMemory}} is used to launch executors in 
[YarnAllocator.scala#L529|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L529]:
{code}
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
updateInternalState()
  } catch {
{code}
However, resource capability requested for each executor is {{executorMemory + 
memoryOverhead}} in 
[YarnAllocator.scala#L142|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L142]:

{code:scala}
  // Resource capability requested for each executors
  private[yarn] val resource = Resource.newInstance(executorMemory + 
memoryOverhead, executorCores)
{code}

This means that the amount of {{memoryOverhead}} will not be used in running 
the job, hence wasted.

Checking both k8s and Mesos, it looks like they both include overhead memory.
For k8s, in 
[ExecutorPodFactory.scala#L179|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L179]:

{code}
val executorContainer = new ContainerBuilder()
  .withName("executor")
  .withImage(executorContainerImage)
  .withImagePullPolicy(imagePullPolicy)
  .withNewResources()
.addToRequests("memory", executorMemoryQuantity)
.addToLimits("memory", executorMemoryLimitQuantity)
.addToRequests("cpu", executorCpuQuantity)
.endResources()
  .addAllToEnv(executorEnv.asJava)
  .withPorts(requiredPorts.asJava)
  .addToArgs("executor")
  .build()
{code}

For Mesos, 
in[MesosSchedulerUtils.scala#L374|https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L374]:

{code}
  /**
   * Return the amount of memory to allocate to each executor, taking into 
account
   * container overheads.
   *
   * @param sc SparkContext to use to get `spark.mesos.executor.memoryOverhead` 
value
   * @return memory requirement as (0.1 * memoryOverhead) or 
MEMORY_OVERHEAD_MINIMUM
   * (whichever is larger)
   */
  def executorMemory(sc: SparkContext): Int = {
sc.conf.getInt("spark.mesos.executor.memoryOverhead",
  math.max(MEMORY_OVERHEAD_FRACTION * sc.executorMemory, 
MEMORY_OVERHEAD_MINIMUM).toInt) +
  sc.executorMemory
  }
{code}


  was:
When deploying on YARN, only `executorMemory` is used to launch executors: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L529

However, resource capability requested for each executor is `executorMemory + 
memoryOverhead`: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L142

This means that the amount of `memoryOverhead` will not be used in running the 
job, hence wasted.

 Checking both k8s and Mesos, it looks like they both include overhead memory: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L179
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L374






> Spark executor on YARN does not include memoryOverhead when starting an 
> ExecutorRunnable
> 
>
> Key: SPARK-25410
> URL: https://issues.apache.org/jira/browse/SPARK-25410
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.1
>Reporter: Anbang Hu
>Priority: Major
>
> When deploying on YARN, only {{executorMemory}} is used to launch executors 
> in 
> 

[jira] [Created] (SPARK-25410) Spark executor on YARN does not include memoryOverhead when starting an ExecutorRunnable

2018-09-11 Thread Anbang Hu (JIRA)
Anbang Hu created SPARK-25410:
-

 Summary: Spark executor on YARN does not include memoryOverhead 
when starting an ExecutorRunnable
 Key: SPARK-25410
 URL: https://issues.apache.org/jira/browse/SPARK-25410
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.3.1
Reporter: Anbang Hu


When deploying on YARN, only `executorMemory` is used to launch executors: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L529

However, resource capability requested for each executor is `executorMemory + 
memoryOverhead`: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L142

This means that the amount of `memoryOverhead` will not be used in running the 
job, hence wasted.

 Checking both k8s and Mesos, it looks like they both include overhead memory: 
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L179
https://github.com/apache/spark/blob/18688d370399dcf92f4228db6c7e3cb186804c18/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L374







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19489) Stable serialization format for external & native code integration

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19489:

Target Version/s:   (was: 2.4.0)

> Stable serialization format for external & native code integration
> --
>
> Key: SPARK-19489
> URL: https://issues.apache.org/jira/browse/SPARK-19489
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.3.0
>
>
> As a Spark user, I want access to a (semi) stable serialization format that 
> is high performance so I can integrate Spark with my application written in 
> native code (C, C++, Rust, etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19489) Stable serialization format for external & native code integration

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19489.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Stable serialization format for external & native code integration
> --
>
> Key: SPARK-19489
> URL: https://issues.apache.org/jira/browse/SPARK-19489
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.3.0
>
>
> As a Spark user, I want access to a (semi) stable serialization format that 
> is high performance so I can integrate Spark with my application written in 
> native code (C, C++, Rust, etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-11 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611409#comment-16611409
 ] 

Yuming Wang commented on SPARK-25409:
-

Please create pull request from Github: https://github.com/apache/spark/pulls

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25395) Replace Spark Optional class with Java Optional

2018-09-11 Thread Mario Molina (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Molina updated SPARK-25395:
-
Summary: Replace Spark Optional class with Java Optional  (was: Remove 
Spark Optional Java API)

> Replace Spark Optional class with Java Optional
> ---
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24572) "eager execution" for R shell, IDE

2018-09-11 Thread Weiqiang Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611357#comment-16611357
 ] 

Weiqiang Zhuang commented on SPARK-24572:
-

[~felixcheung] are you thinking of using the same config introduced by the 
SPARK-24215 and doing something like below in red?
{code:java}
setMethod("show", "SparkDataFrame",
  function(object) {
if (identical(sparkR.conf("spark.sql.repl.eagerEval.enabled", 
"false")[[1]], "true")) {
  return(showDF(object))
}
cols <- lapply(dtypes(object), function(l) {
  paste(l, collapse = ":")
})
s <- paste(cols, collapse = ", ")
cat(paste(class(object), "[", s, "]\n", sep = ""))
  })
{code}
 

> "eager execution" for R shell, IDE
> --
>
> Key: SPARK-24572
> URL: https://issues.apache.org/jira/browse/SPARK-24572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Felix Cheung
>Priority: Major
>
> like python in SPARK-24215
> we could also have eager execution when SparkDataFrame is returned to the R 
> shell



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-11 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-25399.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22386
[https://github.com/apache/spark/pull/22386]

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Critical
>  Labels: correctness
> Fix For: 3.0.0, 2.4.0
>
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-11 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-25399:
-

Assignee: Mukul Murthy

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.0, 3.0.0
>
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-11 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611331#comment-16611331
 ] 

Bruce Robbins commented on SPARK-25164:
---

Thanks [~Tagar] for the feedback. I assume the 44% improvement was for a table 
with lots of file splits.

I have thoughts on this issue, but fetching 70k rows from a wide table that has 
itself only 70k rows should be somewhat fast. With a table like that, I can 
fetch 70k rows on my laptop in under a minute.

Fetching 70k rows from a table that has, say, 10m million rows, can be pretty 
poky, even with this fix.

I have theories (or partially informed speculation) about why this is. Here is 
the gist:

Let's say
 * your projection includes every column of a wide table (i.e., {{select *}})
 * you are filtering away most rows from a large table (e.g., {{select * from 
table where id1 = 1}}, which would fetch, say, only 0.2% of the rows)
 * matching rows are sprinkled fairly evenly throughout the table

In this case, Spark ends up reading (potentially) every data page from the 
parquet file, and realizing each wide row in memory, just to pass the row to 
Spark's filter operator so it can be (most likely) discarded.

This is true even when Spark pushes down the filter to the parquet reader.

This is because the matching rows are sprinkled evenly throughout the table, so 
(potentially) every data page for column id1 has at least one entry where value 
= 1. When a page has even a single matching entry, Spark realizes all the rows 
associated with that page.

Realizing very wide rows in memory seems to be somewhat expensive, according to 
my profiling. I am not sure yet what part of realizing the rows in memory is 
expensive.

If the matching rows tend to be clumped together in one part of the table (say, 
the table is sorted on id1), most data pages will not contain matching rows. 
Spark can skip reading most data pages, and therefore avoid realizing most rows 
in memory. In that case, the query will be much faster: In a test I just ran, 
my query on the sorted table was 3 times faster. The filter push down code 
seems to be intended for cases like this.

{{select * from table limit }} is also slow, although not as slow as 
filtering the same number of records, but I have not looked into why.

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on 

[jira] [Updated] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-11 Thread Rong Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Tang updated SPARK-25409:
--
Attachment: SPARK-25409.0001.patch

> Speed up Spark History at start if there are tens of thousands of 
> applications.
> ---
>
> Key: SPARK-25409
> URL: https://issues.apache.org/jira/browse/SPARK-25409
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rong Tang
>Priority: Major
> Attachments: SPARK-25409.0001.patch
>
>
> We have a spark history server, storing 7 days' applications. it usually has 
> 10K to 20K attempts.
> We found that it can take hours at start up,loading/replaying the logs in 
> event-logs folder.  thus, new finished applications have to wait several 
> hours to be seem. So I made 2 improvements for it.
>  # As we run spark on yarn. the on-going applications' information can also 
> be seen via resource manager, so I introduce in a flag 
> spark.history.fs.load.incomplete to say loading logs for incomplete attempts 
> or not.
>  # Incremental loading applications. as I said, we have more then 10K 
> applications stored, it can take hours to load all of them at the first time. 
> so I introduced in a config spark.history.fs.appRefreshNum to say how many 
> application to load each time, then it gets a chance the check the latest 
> updates.
> Here are the benchmark I did.  our system has 1K incomplete application ( it 
> was not cleaned up for some reason, that is another issue that I need 
> investigate), and applications' log size can be gigabytes. 
>  
> Not load incomplete attempts.
> | |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|All|No|13K|31 minutes| yes|
>  
>  
> Limit each time how much to load.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase 
> with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
> |2|3000|Yes|13K|42 minutes except last 1.6K
> (The last 1.6K attempts cost extremely long 2.5 hours)|NO|
>  
>  
> Limit each time how many to load, and not load incomplete jobs.
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
> |2|3000|NO|12K|17minutes
>  |10 minutes
> ( 41 minutes in total)|NO|
>  
>  
> | |Load Count|Load incomplete APPs|All attempts number|Worst 
> Cost|Avg|Increase with more attempts|
> |1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
> |2|3000|NO|18.5K|20minutes|18 minutes
> (2 hours 18 minutes in total)
>  |NO|
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25409) Speed up Spark History at start if there are tens of thousands of applications.

2018-09-11 Thread Rong Tang (JIRA)
Rong Tang created SPARK-25409:
-

 Summary: Speed up Spark History at start if there are tens of 
thousands of applications.
 Key: SPARK-25409
 URL: https://issues.apache.org/jira/browse/SPARK-25409
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Rong Tang


We have a spark history server, storing 7 days' applications. it usually has 
10K to 20K attempts.

We found that it can take hours at start up,loading/replaying the logs in 
event-logs folder.  thus, new finished applications have to wait several hours 
to be seem. So I made 2 improvements for it.
 # As we run spark on yarn. the on-going applications' information can also be 
seen via resource manager, so I introduce in a flag 
spark.history.fs.load.incomplete to say loading logs for incomplete attempts or 
not.
 # Incremental loading applications. as I said, we have more then 10K 
applications stored, it can take hours to load all of them at the first time. 
so I introduced in a config spark.history.fs.appRefreshNum to say how many 
application to load each time, then it gets a chance the check the latest 
updates.

Here are the benchmark I did.  our system has 1K incomplete application ( it 
was not cleaned up for some reason, that is another issue that I need 
investigate), and applications' log size can be gigabytes. 

 

Not load incomplete attempts.
| |Load Count|Load incomplete APPs|All attempts number|Time Cost|Increase with 
more attempts|
|1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
|2|All|No|13K|31 minutes| yes|

 

 

Limit each time how much to load.

 
| |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Increase with 
more attempts|
|1 ( current implementation)|All|Yes|13K|2 hours 14 minutes|Yes|
|2|3000|Yes|13K|42 minutes except last 1.6K
(The last 1.6K attempts cost extremely long 2.5 hours)|NO|

 

 

Limit each time how many to load, and not load incomplete jobs.

 
| |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Avg|Increase 
with more attempts|
|1 ( current implementation)|All|Yes|13K|2 hours 14 minutes| |Yes|
|2|3000|NO|12K|17minutes
 |10 minutes
( 41 minutes in total)|NO|

 

 
| |Load Count|Load incomplete APPs|All attempts number|Worst Cost|Avg|Increase 
with more attempts|
|1 ( current implementation)|All|Yes|20K|1 hour 52 minutes| |Yes|
|2|3000|NO|18.5K|20minutes|18 minutes
(2 hours 18 minutes in total)
 |NO|

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-11 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611244#comment-16611244
 ] 

Mihaly Toth commented on SPARK-25331:
-

I will try to make it idempotent then.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611227#comment-16611227
 ] 

Apache Spark commented on SPARK-25408:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22399

> Move to idiomatic Java 8
> 
>
> Key: SPARK-25408
> URL: https://issues.apache.org/jira/browse/SPARK-25408
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Priority: Major
>
> Java8 has some nice functions such as try-with-resource and the Collections 
> library, which isn't used a lot in the Spark codebase. We might consider to 
> using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611223#comment-16611223
 ] 

Apache Spark commented on SPARK-25408:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22399

> Move to idiomatic Java 8
> 
>
> Key: SPARK-25408
> URL: https://issues.apache.org/jira/browse/SPARK-25408
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Priority: Major
>
> Java8 has some nice functions such as try-with-resource and the Collections 
> library, which isn't used a lot in the Spark codebase. We might consider to 
> using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25408:


Assignee: Apache Spark

> Move to idiomatic Java 8
> 
>
> Key: SPARK-25408
> URL: https://issues.apache.org/jira/browse/SPARK-25408
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>
> Java8 has some nice functions such as try-with-resource and the Collections 
> library, which isn't used a lot in the Spark codebase. We might consider to 
> using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25408:


Assignee: (was: Apache Spark)

> Move to idiomatic Java 8
> 
>
> Key: SPARK-25408
> URL: https://issues.apache.org/jira/browse/SPARK-25408
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Fokko Driesprong
>Priority: Major
>
> Java8 has some nice functions such as try-with-resource and the Collections 
> library, which isn't used a lot in the Spark codebase. We might consider to 
> using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-25408:


 Summary: Move to idiomatic Java 8
 Key: SPARK-25408
 URL: https://issues.apache.org/jira/browse/SPARK-25408
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Fokko Driesprong


Java8 has some nice functions such as try-with-resource and the Collections 
library, which isn't used a lot in the Spark codebase. We might consider to 
using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611175#comment-16611175
 ] 

Apache Spark commented on SPARK-23820:
--

User 'michaelmior' has created a pull request for this issue:
https://github.com/apache/spark/pull/22398

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23820:


Assignee: Michael Mior  (was: Apache Spark)

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23820:


Assignee: Apache Spark  (was: Michael Mior)

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Apache Spark
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25170) Add Task Metrics description to the documentation

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611159#comment-16611159
 ] 

Apache Spark commented on SPARK-25170:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22397

> Add Task Metrics description to the documentation
> -
>
> Key: SPARK-25170
> URL: https://issues.apache.org/jira/browse/SPARK-25170
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Luca Canali
>Priority: Minor
>  Labels: documentation
>
> The REST API, as well as other instrumentation and tools based on the Spark 
> ListenerBus expose the values of the Executor Task Metrics, which are quite 
> useful for workload/ performance troubleshooting. I propose to add the list 
> of the available Task Metrics with a short description to the monitoring 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23820:
--
Fix Version/s: (was: 2.4.0)

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-23820:
---

Reopened pending some further discussion about the implementation

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25398.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22384
[https://github.com/apache/spark/pull/22384]

> Minor bugs from comparing unrelated types
> -
>
> Key: SPARK-25398
> URL: https://issues.apache.org/jira/browse/SPARK-25398
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.0
>
>
> I noticed a potential issue from Scala inspections, like this clause in 
> LiveEntity.scala around line 586:
> {code:java}
>  (!acc.metadata.isDefined ||
>   acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
> The issue is that acc.metadata is Option[String], so can't equal 
> Some[String]. This just meant to be:
> {code:java}
>  acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
> This may or may not actually cause a bug, but seems worth fixing. And then 
> there are a number of other ones like this, mostly in tests, that might 
> likewise mask real assertion problems.
> Many are, interestingly, flagging items like this on a Seq[String]:
> {code:java}
> .filter(_.getFoo.equals("foo")){code}
> It complains that Any => Any is compared to String. Either it's wrong, or 
> somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy 
> enough to write this more clearly as:
> {code:java}
> .filter(_.getFoo == "foo"){code}
> And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15815:


Assignee: (was: Apache Spark)

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15815:


Assignee: Apache Spark

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Apache Spark
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611133#comment-16611133
 ] 

Apache Spark commented on SPARK-15815:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/22288

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -
>
> Key: SPARK-15815
> URL: https://issues.apache.org/jira/browse/SPARK-15815
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled 
> DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor 
> are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to 
> enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 
> executors, due to we already have executor A, so the oldTargetNumExecutor  == 
> targetNumExecutor = 1, so will never add more Executors...even if Executor A 
> was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-11 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611089#comment-16611089
 ] 

Ruslan Dautkhanov edited comment on SPARK-25164 at 9/11/18 7:00 PM:


Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *reading 70k rows for over 10 minutes* with multiple executors is 
still quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
 Is there is any way to optimize this further?

Thanks.

 


was (Author: tagar):
Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *saving 70k rows for over 10 minutes* with multiple executors is still 
quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
Is there is any way to optimize this further?

Thanks.

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-09-11 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611089#comment-16611089
 ] 

Ruslan Dautkhanov commented on SPARK-25164:
---

Thanks [~bersprockets]

Very good find ! Thanks.

As described in SPARK-24316, "even simple queries of fetching 70k rows takes 20 
minutes". 

This PR-22188 gives 21-44% improvement, reducing total runtime to 11-16 minutes.

It seems *saving 70k rows for over 10 minutes* with multiple executors is still 
quite slow. 

Do you think there might be other issue? So it seems time complexity of reading 
parquet files is O(num_columns * num_parquet_files)?
Is there is any way to optimize this further?

Thanks.

 

> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns = new 
> ArrayList(paths.size());
> for (String[] path : paths) {
>   // TODO: optimize this  
>   
>   PrimitiveType primitiveType = getType(path).asPrimitiveType();
>   columns.add(new ColumnDescriptor(
>   path,
>   primitiveType,
>   getMaxRepetitionLevel(path),
>   getMaxDefinitionLevel(path)));
> }
> return columns;
>   }
> {noformat}
> This means that for each parquet file, this routine indirectly iterates 
> colCount*colCount times.
> This is actually not particularly noticeable unless you have:
>  - many parquet files
>  - many columns
> To verify that this is an issue, I created a 1 million record parquet table 
> with 6000 columns of type double and 67 files (so initializeInternal is 
> called 67 times). I ran the following query:
> {noformat}
> sql("select * from 6000_1m_double where id1 = 1").collect
> {noformat}
> I used Spark from the master branch. I had 8 executor threads. The filter 
> returns only a few thousand records. The query ran (on average) for 6.4 
> minutes.
> Then I cached the column list at the top of {{initializeInternal}} as follows:
> {noformat}
> List columnCache = requestedSchema.getColumns();
> {noformat}
> Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
> {{requestedSchema.getColumns()}}.
> With the column cache variable, the same query runs in 5 minutes. So with my 
> simple query, you save %22 of time by not rebuilding the column list for each 
> column.
> You get additional savings with a paths cache variable, now saving 34% in 
> total on the above query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-09-11 Thread John Bauer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611085#comment-16611085
 ] 

John Bauer commented on SPARK-21542:


You don't show your code for __init__ or setParams.  I recall getting this 
error before using the @keyword_only decorator, for example see 
https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml

I will be trying to get my custom transformer pipeline to persist sometime next 
week I hope.  If I succeed, I will try to provide an example if no one else has.

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611079#comment-16611079
 ] 

Apache Spark commented on SPARK-23425:
--

User 'sujith71955' has created a pull request for this issue:
https://github.com/apache/spark/pull/22396

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-11 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-23425:
---
Docs Text: 
Release notes:

Wildcard symbols {{*}} and {{?}} can now be used in SQL paths when loading 
data, e.g.:

LOAD DATA INPATH 'hdfs://hacluster/user/ext*'
LOAD DATA INPATH 'hdfs://hacluster/user/???/data'

Where these characters are used literally in paths, they must be escaped with a 
backslash.

Wildcards can be used in the folder level of a local File system in Load 
command from now.

 e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/

Now onward normal Space convention can be used in folder/file names (e.g file 
Name.csv), Older versions space in folder/file names has been represented using 
'%20'(e.g. myFile%20Name).

#

  was:
Release notes:

Wildcard symbols {{*}} and {{?}} can now be used in SQL paths when loading 
data, e.g.:

LOAD DATA INPATH 'hdfs://hacluster/user/ext*'
LOAD DATA INPATH 'hdfs://hacluster/user/???/data'

Where these characters are used literally in paths, they must be escaped with a 
backslash.

Wildcards can be used in the folder level of a local File system in Load 
command from now.

 e.g. LOAD DATA LOCAL INPATH 'tmp/folder*/


> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-11 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611012#comment-16611012
 ] 

Marcelo Vanzin commented on SPARK-25380:


Another bit of information that would be useful is the breakdown of memory 
usage of the {{SQLExecutionUIData}} instances. They seem to hold a lot more 
memory than just the plan graph structures do, it would be nice to know what 
exactly is holding on to that memory.

(Assuming you can't share the heap dump...)

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, heapdump_OOM.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-09-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24889.

   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 22341
[https://github.com/apache/spark/pull/22341]

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0, 2.3.2
>
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats

2018-09-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24889:
--

Assignee: Liang-Chi Hsieh

> dataset.unpersist() doesn't update storage memory stats
> ---
>
> Key: SPARK-24889
> URL: https://issues.apache.org/jira/browse/SPARK-24889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Yuri Bogomolov
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
> Attachments: image-2018-07-23-10-53-58-474.png
>
>
> Steps to reproduce:
> 1) Start a Spark cluster, and check the storage memory value from the Spark 
> Web UI "Executors" tab (it should be equal to zero if you just started)
> 2) Run:
> {code:java}
> val df = spark.sqlContext.range(1, 10)
> df.cache()
> df.count()
> df.unpersist(true){code}
> 3) Check the storage memory value again, now it's equal to 1GB
>  
> Looks like the memory is actually released, but stats aren't updated. This 
> issue makes cluster management more complicated.
> !image-2018-07-23-10-53-58-474.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2018-09-11 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19903:
-
Target Version/s:   (was: 2.4.0)

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>Priority: Major
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (word, line._2)))
> .toDF("word", "timestamp")
> // Group the data by 

[jira] [Commented] (SPARK-19903) Watermark metadata is lost when using resolved attributes

2018-09-11 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610931#comment-16610931
 ] 

Shixiong Zhu commented on SPARK-19903:
--

Yes. I removed the target version.

> Watermark metadata is lost when using resolved attributes
> -
>
> Key: SPARK-19903
> URL: https://issues.apache.org/jira/browse/SPARK-19903
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Ubuntu Linux
>Reporter: Piotr Nestorow
>Priority: Major
>
> PySpark example reads a Kafka stream. There is watermarking set when handling 
> the data window. The defined query uses output Append mode.
> The PySpark engine reports the error:
> 'Append output mode not supported when there are streaming aggregations on 
> streaming DataFrames/DataSets'
> The Python example:
> ---
> {code}
> import sys
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import explode, split, window
> if __name__ == "__main__":
> if len(sys.argv) != 4:
> print("""
> Usage: structured_kafka_wordcount.py  
>  
> """, file=sys.stderr)
> exit(-1)
> bootstrapServers = sys.argv[1]
> subscribeType = sys.argv[2]
> topics = sys.argv[3]
> spark = SparkSession\
> .builder\
> .appName("StructuredKafkaWordCount")\
> .getOrCreate()
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", bootstrapServers)\
> .option(subscribeType, topics)\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array 
> into multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> {code}
> The corresponding example in Zeppelin notebook:
> {code}
> %spark.pyspark
> from pyspark.sql.functions import explode, split, window
> # Create DataSet representing the stream of input lines from kafka
> lines = spark\
> .readStream\
> .format("kafka")\
> .option("kafka.bootstrap.servers", "localhost:9092")\
> .option("subscribe", "words")\
> .load()\
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
> # Split the lines into words, retaining timestamps
> # split() splits each line into an array, and explode() turns the array into 
> multiple rows
> words = lines.select(
> explode(split(lines.value, ' ')).alias('word'),
> lines.timestamp
> )
> # Group the data by window and word and compute the count of each group
> windowedCounts = words.withWatermark("timestamp", "30 seconds").groupBy(
> window(words.timestamp, "30 seconds", "30 seconds"), words.word
> ).count()
> # Start running the query that prints the running counts to the console
> query = windowedCounts\
> .writeStream\
> .outputMode('append')\
> .format('console')\
> .option("truncate", "false")\
> .start()
> query.awaitTermination()
> --
> Note that the Scala version of the same example in Zeppelin notebook works 
> fine:
> 
> import java.sql.Timestamp
> import org.apache.spark.sql.streaming.ProcessingTime
> import org.apache.spark.sql.functions._
> // Create DataSet representing the stream of input lines from kafka
> val lines = spark
> .readStream
> .format("kafka")
> .option("kafka.bootstrap.servers", "localhost:9092")
> .option("subscribe", "words")
> .load()
> // Split the lines into words, retaining timestamps
> val words = lines
> .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS 
> TIMESTAMP)")
> .as[(String, Timestamp)]
> .flatMap(line => line._1.split(" ").map(word => (word, line._2)))
> .toDF("word", 

[jira] [Assigned] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-09-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25221:
--

Assignee: Gera Shegalov

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Fix For: 2.4.0
>
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-09-11 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25221.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22213
[https://github.com/apache/spark/pull/22213]

> [DEPLOY] Consistent trailing whitespace treatment of conf values
> 
>
> Key: SPARK-25221
> URL: https://issues.apache.org/jira/browse/SPARK-25221
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Major
> Fix For: 2.4.0
>
>
> When you use a custom line delimiter 
> {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
> trailing whitespace character it's only possible when specified via  
> {{--conf}} . Our pipeline consists of a highly customized generated jobs. 
> Storing all the config in a properities file is not only better for 
> readability but even necessary to avoid dealing with {{ARGS_MAX}} on 
> different OS. Spark should uniformly avoid trimming conf values in both 
> cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging

2018-09-11 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610841#comment-16610841
 ] 

Michael Allman commented on SPARK-25407:


I have a code-complete patch for this bug, but I want to add some code comments 
before submitting it.

> Spark throws a `ParquetDecodingException` when attempting to read a field 
> from a complex type in certain cases of schema merging
> 
>
> Key: SPARK-25407
> URL: https://issues.apache.org/jira/browse/SPARK-25407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Priority: Major
>
> Spark supports merging schemata across table partitions in which one 
> partition is missing a subfield that's present in another. However, 
> attempting to select that missing field with a query that includes a 
> partition pruning predicate the filters out the partitions that include that 
> field results in a `ParquetDecodingException` when attempting to get the 
> query results.
> This bug is specifically exercised by the failing (but ignored) test case 
> https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25407) Spark throws a `ParquetDecodingException` when attempting to read a field from a complex type in certain cases of schema merging

2018-09-11 Thread Michael Allman (JIRA)
Michael Allman created SPARK-25407:
--

 Summary: Spark throws a `ParquetDecodingException` when attempting 
to read a field from a complex type in certain cases of schema merging
 Key: SPARK-25407
 URL: https://issues.apache.org/jira/browse/SPARK-25407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Michael Allman


Spark supports merging schemata across table partitions in which one partition 
is missing a subfield that's present in another. However, attempting to select 
that missing field with a query that includes a partition pruning predicate the 
filters out the partitions that include that field results in a 
`ParquetDecodingException` when attempting to get the query results.

This bug is specifically exercised by the failing (but ignored) test case 
https://github.com/apache/spark/blob/f2d35427eedeacceb6edb8a51974a7e8bbb94bc2/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala#L125-L131.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610837#comment-16610837
 ] 

Apache Spark commented on SPARK-16323:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22395

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-11 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25389:
-

Assignee: Dongjoon Hyun

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610836#comment-16610836
 ] 

Apache Spark commented on SPARK-16323:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22395

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25389) INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields

2018-09-11 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25389.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22378
[https://github.com/apache/spark/pull/22378]

> INSERT OVERWRITE DIRECTORY STORED AS should prevent duplicate fields
> 
>
> Key: SPARK-25389
> URL: https://issues.apache.org/jira/browse/SPARK-25389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Like `INSERT OVERWRITE DIRECTORY USING` syntax, `INSERT OVERWRITE DIRECTORY 
> STORED AS` should not generate files with duplicate fields because Spark 
> cannot read those files.
> *INSERT OVERWRITE DIRECTORY USING*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' USING parquet 
> SELECT 'id', 'id2' id")
> ... ERROR InsertIntoDataSourceDirCommand: Failed to write to directory ...
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) when 
> inserting into file:/tmp/parquet: `id`;
> {code}
> *INSERT OVERWRITE DIRECTORY STORED AS*
> {code}
> scala> sql("INSERT OVERWRITE DIRECTORY 'file:///tmp/parquet' STORED AS 
> parquet SELECT 'id', 'id2' id")
> scala> spark.read.parquet("/tmp/parquet").show
> 18/09/09 22:09:57 WARN DataSource: Found duplicate column(s) in the data 
> schema and the partition schema: `id`;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-11 Thread Michael Allman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Allman updated SPARK-25406:
---
Priority: Major  (was: Critical)

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Priority: Major
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19489) Stable serialization format for external & native code integration

2018-09-11 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610810#comment-16610810
 ] 

Reynold Xin commented on SPARK-19489:
-

We can close this now.




> Stable serialization format for external & native code integration
> --
>
> Key: SPARK-19489
> URL: https://issues.apache.org/jira/browse/SPARK-19489
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Priority: Major
>
> As a Spark user, I want access to a (semi) stable serialization format that 
> is high performance so I can integrate Spark with my application written in 
> native code (C, C++, Rust, etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19489) Stable serialization format for external & native code integration

2018-09-11 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610798#comment-16610798
 ] 

Wes McKinney commented on SPARK-19489:
--

Since there's a native Rust library for Arrow in development it would be 
interesting to see a POC for UDFs written in Rust at some point

> Stable serialization format for external & native code integration
> --
>
> Key: SPARK-19489
> URL: https://issues.apache.org/jira/browse/SPARK-19489
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Priority: Major
>
> As a Spark user, I want access to a (semi) stable serialization format that 
> is high performance so I can integrate Spark with my application written in 
> native code (C, C++, Rust, etc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25406:


Assignee: (was: Apache Spark)

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Priority: Critical
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610788#comment-16610788
 ] 

Apache Spark commented on SPARK-25406:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/22394

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Priority: Critical
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25406:


Assignee: Apache Spark

> Incorrect usage of withSQLConf method in Parquet schema pruning test suite 
> masks failing tests
> --
>
> Key: SPARK-25406
> URL: https://issues.apache.org/jira/browse/SPARK-25406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Michael Allman
>Assignee: Apache Spark
>Priority: Critical
>
> In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} 
> to set configuration settings within the scope of a test. However, the way we 
> use that method is incorrect, and as a result the desired configuration 
> settings are not propagated to the test cases. This masks some test case 
> failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25406) Incorrect usage of withSQLConf method in Parquet schema pruning test suite masks failing tests

2018-09-11 Thread Michael Allman (JIRA)
Michael Allman created SPARK-25406:
--

 Summary: Incorrect usage of withSQLConf method in Parquet schema 
pruning test suite masks failing tests
 Key: SPARK-25406
 URL: https://issues.apache.org/jira/browse/SPARK-25406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Michael Allman


In {{ParquetSchemaPruning.scala}}, we use the helper method {{withSQLConf}} to 
set configuration settings within the scope of a test. However, the way we use 
that method is incorrect, and as a result the desired configuration settings 
are not propagated to the test cases. This masks some test case failures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7768) Make user-defined type (UDT) API public

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610733#comment-16610733
 ] 

Wenchen Fan edited comment on SPARK-7768 at 9/11/18 2:47 PM:
-

I'm retargeting to 3.0. BTW is there any concerns about making UDT public? cc 
[~josephkb] [~rxin] [~WeichenXu123]


was (Author: cloud_fan):
I'm retargeting to 3.0. BTW is there any concerns about making UDT public? cc 
[~josephkb] [~rxin]

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610733#comment-16610733
 ] 

Wenchen Fan commented on SPARK-7768:


I'm retargeting to 3.0. BTW is there any concerns about making UDT public? cc 
[~josephkb] [~rxin]

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9576) DataFrame API improvement umbrella ticket (in Spark 2.x)

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-9576:
---
Target Version/s: 3.0.0  (was: 2.4.0)

> DataFrame API improvement umbrella ticket (in Spark 2.x)
> 
>
> Key: SPARK-9576
> URL: https://issues.apache.org/jira/browse/SPARK-9576
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7768) Make user-defined type (UDT) API public

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-7768:
---
Target Version/s: 3.0.0  (was: 2.4.0)

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610731#comment-16610731
 ] 

Wenchen Fan commented on SPARK-8000:


do we still want to do it?

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610729#comment-16610729
 ] 

Wenchen Fan commented on SPARK-12978:
-

Sorry we missed this issue. I think this is a very complicated problem now and 
we need to write a design doc for it. Possible approaches:
1. introduce logical partial aggregate
2. treat partial aggregate as a physical optimization. We need to think about 
how it interacts with EnsureRequirements

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2018-09-11 Thread sandeep katta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610728#comment-16610728
 ] 

sandeep katta commented on SPARK-25392:
---

Yes the link is active,I will analyze more with code perspective and check how 
to disable the link.

if possible I ll take the guidance of [~vanzin] to solve this :)

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13682) Finalize the public API for FileFormat

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610723#comment-16610723
 ] 

Wenchen Fan commented on SPARK-13682:
-

I'm closing it. We will keep `FileFormat` API private, and will migrate file 
source to data source v2.

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Major
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13682) Finalize the public API for FileFormat

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13682.
-
Resolution: Won't Fix

> Finalize the public API for FileFormat
> --
>
> Key: SPARK-13682
> URL: https://issues.apache.org/jira/browse/SPARK-13682
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Major
>
> The current file format interface needs to be cleaned up before its 
> acceptable for public consumption:
>  - Have a version that takes Row and does a conversion, hide the internal API.
>  - Remove bucketing
>  - Remove RDD and the broadcastedConf
>  - Remove SQLContext (maybe include SparkSession?)
>  - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14098:

Target Version/s: 3.0.0  (was: 2.4.0)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Major
>  Labels: releasenotes
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-14922:

Target Version/s: 3.0.0  (was: 2.4.0)

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15380) Generate code that stores a float/double value in each column from ColumnarBatch when DataFrame.cache() is used

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15380.
-
Resolution: Won't Fix

> Generate code that stores a float/double value in each column from 
> ColumnarBatch when DataFrame.cache() is used
> ---
>
> Key: SPARK-15380
> URL: https://issues.apache.org/jira/browse/SPARK-15380
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When DataFrame.cache() is called, data will be stored as column-oriented 
> storage in CachedBatch. The current Catalyst generates Java program to store 
> a computed value to an InternalRow and then the value is stored into 
> CachedBatch even if data is read from ColumnarBatch for ParquetReader. 
> This JIRA generates Java code to store a value into a ColumnarBatch, and 
> store data from the ColumnarBatch to the CachedBatch. This JIRA handles only 
> float and double types for a value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15420) Repartition and sort before Parquet writes

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610719#comment-16610719
 ] 

Wenchen Fan commented on SPARK-15420:
-

Have we fixed it?

> Repartition and sort before Parquet writes
> --
>
> Key: SPARK-15420
> URL: https://issues.apache.org/jira/browse/SPARK-15420
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ryan Blue
>Priority: Major
>
> Parquet requires buffering data in memory before writing a group of rows 
> organized by column. This causes significant memory pressure when writing 
> partitioned output because each open file must buffer rows.
> Currently, Spark will sort data and spill if necessary in the 
> {{WriterContainer}} to avoid keeping many files open at once. But, this isn't 
> a full solution for a few reasons:
> * The final sort is always performed, even if incoming data is already sorted 
> correctly. For example, a global sort will cause two sorts to happen, even if 
> the global sort correctly prepares the data.
> * To prevent a large number of output small output files, users must manually 
> add a repartition step. That step is also ignored by the sort within the 
> writer.
> * Hive does not currently support {{DataFrameWriter#sortBy}}
> The sort in {{WriterContainer}} makes sense to prevent problems, but should 
> detect if the incoming data is already sorted. The {{DataFrameWriter}} should 
> also expose the ability to repartition data before the write stage, and the 
> query planner should expose an option to automatically insert repartition 
> operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15690:

Target Version/s:   (was: 2.4.0)

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>Priority: Major
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610715#comment-16610715
 ] 

Wenchen Fan commented on SPARK-15690:
-

I'm removing the target version, since no one is working on it.

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>Priority: Major
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610713#comment-16610713
 ] 

Wenchen Fan commented on SPARK-15693:
-

Do we still want to do it?

> Write schema definition out for file-based data sources to avoid schema 
> inference
> -
>
> Key: SPARK-15693
> URL: https://issues.apache.org/jira/browse/SPARK-15693
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15691) Refactor and improve Hive support

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15691:

Target Version/s: 3.0.0  (was: 2.4.0)

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog (So, for a CatalogTable returned by HiveExternalCatalog, 
> we do not need to distinguish tables stored in hive formats and data source 
> tables).
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15694) Implement ScriptTransformation in sql/core

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15694:

Target Version/s: 3.0.0  (was: 2.4.0)

> Implement ScriptTransformation in sql/core
> --
>
> Key: SPARK-15694
> URL: https://issues.apache.org/jira/browse/SPARK-15694
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16011) SQL metrics include duplicated attempts

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610705#comment-16610705
 ] 

Wenchen Fan commented on SPARK-16011:
-

Since the behavior is intentional, I'm closing this ticket. We can create a new 
ticket if we want to support SQL metrics without duplicated attempts.

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Major
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15867:

Target Version/s:   (was: 2.4.0)

> Use bucket files for TABLESAMPLE BUCKET
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>Priority: Major
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15867) Use bucket files for TABLESAMPLE BUCKET

2018-09-11 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610709#comment-16610709
 ] 

Wenchen Fan commented on SPARK-15867:
-

I'm removing the target version, since no one is working on it.

> Use bucket files for TABLESAMPLE BUCKET
> ---
>
> Key: SPARK-15867
> URL: https://issues.apache.org/jira/browse/SPARK-15867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Andrew Or
>Priority: Major
>
> {code}
> SELECT * FROM boxes TABLESAMPLE (BUCKET 3 OUT OF 16)
> {code}
> In Hive, this would select the 3rd bucket out of every 16 buckets there are 
> in the table. E.g. if the table was clustered by 32 buckets then this would 
> sample the 3rd and the 19th bucket. (See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling)
> In Spark, however, we simply sample 3/16 of the number of input rows.
> Either we don't support it in Spark or do it in a way that's consistent with 
> Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16323) Avoid unnecessary cast when doing integral divide

2018-09-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610703#comment-16610703
 ] 

Marco Gaido commented on SPARK-16323:
-

Sure [~cloud_fan], will do. Thanks.

> Avoid unnecessary cast when doing integral divide
> -
>
> Key: SPARK-16323
> URL: https://issues.apache.org/jira/browse/SPARK-16323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sean Zhong
>Priority: Minor
>
> This is a follow up of issue SPARK-15776
> *Problem:*
> For Integer divide operator div:
> {code}
> scala> spark.sql("select 6 div 3").explain(true)
> ...
> == Analyzed Logical Plan ==
> CAST((6 / 3) AS BIGINT): bigint
> Project [cast((cast(6 as double) / cast(3 as double)) as bigint) AS CAST((6 / 
> 3) AS BIGINT)#5L]
> +- OneRowRelation$
> ...
> {code}
> For performance reason, we should not do unnecessary cast {{cast(xx as 
> double)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16011) SQL metrics include duplicated attempts

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16011.
-
Resolution: Won't Fix

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Major
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16011) SQL metrics include duplicated attempts

2018-09-11 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16011:

Target Version/s:   (was: 2.4.0)

> SQL metrics include duplicated attempts
> ---
>
> Key: SPARK-16011
> URL: https://issues.apache.org/jira/browse/SPARK-16011
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Major
>
> When I ran a simple scan and aggregate query, the number of rows in scan 
> could be different from run to run, but actually scanned result is correct, 
> the SQL metrics is wrong (should not include duplicated attempt), this is a 
> regression since 1.6.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >