date:20170516

[jira] [Commented] (SPARK-20780) Spark Kafka10 Consumer Hangs

2017-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013633#comment-16013633
 ] 

Sean Owen commented on SPARK-20780:
---

This doesn't sound like a Spark problem though. You're timing out in reading 
from Kafka?

> Spark Kafka10 Consumer Hangs
> 
>
> Key: SPARK-20780
> URL: https://issues.apache.org/jira/browse/SPARK-20780
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Spark Streaming Kafka 010
> Yarn - Cluster Mode
> CDH 5.8.4
> CentOS Linux release 7.2
>Reporter: jayadeepj
> Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png
>
>
> We have recently upgraded our Streaming App with Direct Stream to Spark 2 
> (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
> 10 . We find abnormal delays after the application has run for a couple of 
> hours or completed consumption of approx. ~ 5 million records.
> See screenshot 1 & 2
> There is a sudden dip in the processing time from ~15 seconds (usual for this 
> app) to ~3 minutes & from then on the processing time keeps degrading 
> throughout.
> We have seen that the delay is due to certain tasks taking the exact time 
> duration of the configured Kafka Consumer 'request.timeout.ms' . We have 
> tested this by varying timeout property to different values.
> See screenshot 3.
> I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
> subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually 
> timing out on some of the partitions without reading data. But the executor 
> logs it as successfully completed after the exact timeout duration. Note that 
> most other tasks are completing successfully with millisecond duration. The 
> timeout is most likely from the 
> org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any 
> network latency difference.
> We have observed this across multiple clusters & multiple apps with & without 
> TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
> performance
> 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446288
> 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
> (TID 446288)
> 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, 
> partition 0 offsets 776843 -> 779591
> 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
> spark-executor-default1 XX-XXX-XX 0 776843
> 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
> (TID 446288). 1699 bytes result sent to driver
> 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446329
> 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 
> (TID 446329)
> 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 
> and clearing cache
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 6807
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
> as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 6807 took 4 ms
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
> values in m
> We can see that the log statement differ with the exact timeout duration.
> Our consumer config is below.
> 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
> 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
>   metric.reporters = []
>   metadata.max.age.ms = 30
>   partition.assignment.strategy = 
> [org.apache.kafka.clients.consumer.RangeAssignor]
>   reconnect.backoff.ms = 50
>   sasl.kerberos.ticket.renew.window.factor = 0.8
>   max.partition.fetch.bytes = 1048576
>   bootstrap.servers = [x.xxx.xxx:9092]
>   ssl.keystore.type = JKS
>   enable.auto.commit = true
>   sasl.mechanism = GSSAPI
>   interceptor.classes = null
>   exclude.internal.topics = true
>   ssl.truststore.password = null
>   client.id =
>   ssl.endpoint.identification.algorithm = null
>   max.poll.records = 2147483647
>   check.crcs = true
>   request.timeout.ms = 5
>   heartbeat.interval.ms = 3000
>   auto.commit.interval.ms = 5000
>   receive.buffer.bytes = 65536
>   ssl.truststore.type = JKS
>   ssl.truststore.location = null
>   ssl.keystore.password = null
>   fetch.min.bytes = 1
>   send.buffer.bytes = 131072
>   value.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>   group.i

[jira] [Commented] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013624#comment-16013624
 ] 

Sean Owen commented on SPARK-20779:
---

You don't need a JIRA for this. There's not even a 'right' place as these 
headers are optional, although recommended practice. It's OK to make things 
consistent but it's trivial.

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
>
> when i test some examples, i found the license is not at the top in some 
> files.  and it will be best if we update these places of the ASF header to be 
> consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20772) Add support for query parameters in redirects on Yarn

2017-05-16 Thread Bjorn Jonsson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjorn Jonsson closed SPARK-20772.
-
Resolution: Not A Problem

> Add support for query parameters in redirects on Yarn
> -
>
> Key: SPARK-20772
> URL: https://issues.apache.org/jira/browse/SPARK-20772
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> Spark uses rewrites of query parameters to paths 
> (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). 
> This works fine in local or standalone mode, but does not work on Yarn (with 
> the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where 
> the query parameter is dropped.
> The repro steps are:
> - Start up the spark-shell and run a job
> - Try to access the job details through http://:4040/jobs/job?id=0
> - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)
> Going directly through the RM proxy works (does not cause query parameters to 
> be dropped).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20772) Add support for query parameters in redirects on Yarn

2017-05-16 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013618#comment-16013618
 ] 

Wilfred Spiegelenburg commented on SPARK-20772:
---

correct, opening up a yarn issue and providing a fix there: YARN-6615
Please close this

> Add support for query parameters in redirects on Yarn
> -
>
> Key: SPARK-20772
> URL: https://issues.apache.org/jira/browse/SPARK-20772
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> Spark uses rewrites of query parameters to paths 
> (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). 
> This works fine in local or standalone mode, but does not work on Yarn (with 
> the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where 
> the query parameter is dropped.
> The repro steps are:
> - Start up the spark-shell and run a job
> - Try to access the job details through http://:4040/jobs/job?id=0
> - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)
> Going directly through the RM proxy works (does not cause query parameters to 
> be dropped).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013617#comment-16013617
 ] 

Sean Owen commented on SPARK-20778:
---

This should just be a UDF. What's the argument that it needs to be a built-in?

> Implement array_intersect function
> --
>
> Key: SPARK-20778
> URL: https://issues.apache.org/jira/browse/SPARK-20778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> Implement an array_intersect function that takes array arguments and returns 
> an array containing all elements of the first array that is common with the 
> remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20781:


Assignee: Apache Spark

> the location of Dockerfile in docker.properties.template is wrong
> -
>
> Key: SPARK-20781
> URL: https://issues.apache.org/jira/browse/SPARK-20781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: Apache Spark
>
> the location of Dockerfile in docker.properties.template should be 
> "../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20781:


Assignee: (was: Apache Spark)

> the location of Dockerfile in docker.properties.template is wrong
> -
>
> Key: SPARK-20781
> URL: https://issues.apache.org/jira/browse/SPARK-20781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> the location of Dockerfile in docker.properties.template should be 
> "../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013563#comment-16013563
 ] 

Apache Spark commented on SPARK-20781:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18013

> the location of Dockerfile in docker.properties.template is wrong
> -
>
> Key: SPARK-20781
> URL: https://issues.apache.org/jira/browse/SPARK-20781
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> the location of Dockerfile in docker.properties.template should be 
> "../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong

2017-05-16 Thread liuzhaokun (JIRA)

liuzhaokun created SPARK-20781:
--

 Summary: the location of Dockerfile in docker.properties.template 
is wrong
 Key: SPARK-20781
 URL: https://issues.apache.org/jira/browse/SPARK-20781
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.1.1
Reporter: liuzhaokun


the location of Dockerfile in docker.properties.template should be 
"../external/docker/spark-mesos/Dockerfile"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013540#comment-16013540
 ] 

Reynold Xin commented on SPARK-12297:
-

I don't think the CSV example you gave make sense. It is still interpreted 
timestamp with timezone. Just specify a timezone in the string and Spark will 
use that timezone.


> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-16 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013540#comment-16013540
 ] 

Reynold Xin edited comment on SPARK-12297 at 5/17/17 5:12 AM:
--

I don't think the CSV example you gave makes sense or supports your argument. 
It is still interpreted timestamp with timezone. Just specify a timezone in the 
string and Spark will use that timezone.



was (Author: rxin):
I don't think the CSV example you gave make sense. It is still interpreted 
timestamp with timezone. Just specify a timezone in the string and Spark will 
use that timezone.


> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20776.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18008
[https://github.com/apache/spark/pull/18008]

> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent constructing empty 
> TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we 
> can slightly simplify the code to remove the need to construct one empty 
> TaskMetrics per onTaskSubmitted event.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20690) Analyzer shouldn't add missing attributes through subquery

2017-05-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20690.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> Analyzer shouldn't add missing attributes through subquery
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20690) Analyzer shouldn't add missing attributes through subquery

2017-05-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20690:

Labels: release-notes  (was: )

> Analyzer shouldn't add missing attributes through subquery
> --
>
> Key: SPARK-20690
> URL: https://issues.apache.org/jira/browse/SPARK-20690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>  Labels: release-notes
>
> We add missing attributes into Filter in Analyzer. But we shouldn't do it 
> through subqueries like this:
> {code}
> select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1
> {code}
> This query works in current codebase. However, the outside where clause 
> shouldn't be able to refer t1.c1 attribute.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs

2017-05-16 Thread jayadeepj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jayadeepj updated SPARK-20780:
--
Environment: 
Spark 2.1.0
Spark Streaming Kafka 010
Yarn - Cluster Mode
CDH 5.8.4
CentOS Linux release 7.2

  was:
Spark 2.1.0
Spark Streaming Kafka 010
CDH 5.8.4
CentOS Linux release 7.2


> Spark Kafka10 Consumer Hangs
> 
>
> Key: SPARK-20780
> URL: https://issues.apache.org/jira/browse/SPARK-20780
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Spark Streaming Kafka 010
> Yarn - Cluster Mode
> CDH 5.8.4
> CentOS Linux release 7.2
>Reporter: jayadeepj
> Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png
>
>
> We have recently upgraded our Streaming App with Direct Stream to Spark 2 
> (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
> 10 . We find abnormal delays after the application has run for a couple of 
> hours or completed consumption of approx. ~ 5 million records.
> See screenshot 1 & 2
> There is a sudden dip in the processing time from ~15 seconds (usual for this 
> app) to ~3 minutes & from then on the processing time keeps degrading 
> throughout.
> We have seen that the delay is due to certain tasks taking the exact time 
> duration of the configured Kafka Consumer 'request.timeout.ms' . We have 
> tested this by varying timeout property to different values.
> See screenshot 3.
> I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
> subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually 
> timing out on some of the partitions without reading data. But the executor 
> logs it as successfully completed after the exact timeout duration. Note that 
> most other tasks are completing successfully with millisecond duration. The 
> timeout is most likely from the 
> org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any 
> network latency difference.
> We have observed this across multiple clusters & multiple apps with & without 
> TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
> performance
> 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446288
> 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
> (TID 446288)
> 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, 
> partition 0 offsets 776843 -> 779591
> 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
> spark-executor-default1 XX-XXX-XX 0 776843
> 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
> (TID 446288). 1699 bytes result sent to driver
> 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446329
> 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 
> (TID 446329)
> 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 
> and clearing cache
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 6807
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
> as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 6807 took 4 ms
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
> values in m
> We can see that the log statement differ with the exact timeout duration.
> Our consumer config is below.
> 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
> 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
>   metric.reporters = []
>   metadata.max.age.ms = 30
>   partition.assignment.strategy = 
> [org.apache.kafka.clients.consumer.RangeAssignor]
>   reconnect.backoff.ms = 50
>   sasl.kerberos.ticket.renew.window.factor = 0.8
>   max.partition.fetch.bytes = 1048576
>   bootstrap.servers = [x.xxx.xxx:9092]
>   ssl.keystore.type = JKS
>   enable.auto.commit = true
>   sasl.mechanism = GSSAPI
>   interceptor.classes = null
>   exclude.internal.topics = true
>   ssl.truststore.password = null
>   client.id =
>   ssl.endpoint.identification.algorithm = null
>   max.poll.records = 2147483647
>   check.crcs = true
>   request.timeout.ms = 5
>   heartbeat.interval.ms = 3000
>   auto.commit.interval.ms = 5000
>   receive.buffer.bytes = 65536
>   ssl.truststore.type = JKS
>   ssl.truststore.location = null
>   ssl.keystore.password = null
>   fetch.min.bytes = 1
>   send.buffer.bytes = 131072
>   value.deserializer = class 
> org.apache.kafka.common.serial

[jira] [Updated] (SPARK-20762) Make String Params Case-Insensitive

2017-05-16 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-20762:
-
Description: 
Make String Params (excpet Cols) case-insensitve:
{{solver}}
{{modelType}}
{{initMode}}
{{metricName}}
{{handleInvalid}}
{{strategy}}
{{stringOrderType}}
{{coldStartStrategy}}
{{impurity}}
{{lossType}}
{{featureSubsetStrategy}}
{{intermediateStorageLevel}}
{{finalStorageLevel}}

  was:
Make String Params (excpet Cols) case-insensitve:
{{solver}}
{{modelType}}
{{initMode}}
{{metricName}}
{{handleInvalid}}
{{strategy}}
{{stringOrderType}}
{{coldStartStrategy}}
{{impurity}}
{{lossType}}
{{}}


> Make String Params Case-Insensitive
> ---
>
> Key: SPARK-20762
> URL: https://issues.apache.org/jira/browse/SPARK-20762
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> Make String Params (excpet Cols) case-insensitve:
> {{solver}}
> {{modelType}}
> {{initMode}}
> {{metricName}}
> {{handleInvalid}}
> {{strategy}}
> {{stringOrderType}}
> {{coldStartStrategy}}
> {{impurity}}
> {{lossType}}
> {{featureSubsetStrategy}}
> {{intermediateStorageLevel}}
> {{finalStorageLevel}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-20779:

Description: when i test some examples, i found the license is not at the 
top in some files.  and it will be best if we update these places of the ASF 
header to be consistent with other files.  (was: The license is not at the top 
in some files.  and it will be best if we update these places of the ASF header 
to be consistent with other files.)

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
>
> when i test some examples, i found the license is not at the top in some 
> files.  and it will be best if we update these places of the ASF header to be 
> consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs

2017-05-16 Thread jayadeepj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jayadeepj updated SPARK-20780:
--
Environment: 
Spark 2.1.0
Spark Streaming Kafka 010
CDH 5.8.4
CentOS Linux release 7.2

  was:

Spark 2.1.0
Spark Streaming Kafka 010
CDH 5.8.4
CentOS Linux release 7.2

   Priority: Major  (was: Critical)

> Spark Kafka10 Consumer Hangs
> 
>
> Key: SPARK-20780
> URL: https://issues.apache.org/jira/browse/SPARK-20780
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Spark Streaming Kafka 010
> CDH 5.8.4
> CentOS Linux release 7.2
>Reporter: jayadeepj
> Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png
>
>
> We have recently upgraded our Streaming App with Direct Stream to Spark 2 
> (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
> 10 . We find abnormal delays after the application has run for a couple of 
> hours or completed consumption of approx. ~ 5 million records.
> See screenshot 1 & 2
> There is a sudden dip in the processing time from ~15 seconds (usual for this 
> app) to ~3 minutes & from then on the processing time keeps degrading 
> throughout.
> We have seen that the delay is due to certain tasks taking the exact time 
> duration of the configured Kafka Consumer 'request.timeout.ms' . We have 
> tested this by varying timeout property to different values.
> See screenshot 3.
> I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
> subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually 
> timing out on some of the partitions without reading data. But the executor 
> logs it as successfully completed after the exact timeout duration. Note that 
> most other tasks are completing successfully with millisecond duration. The 
> timeout is most likely from the 
> org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any 
> network latency difference.
> We have observed this across multiple clusters & multiple apps with & without 
> TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
> performance
> 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446288
> 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
> (TID 446288)
> 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, 
> partition 0 offsets 776843 -> 779591
> 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
> spark-executor-default1 XX-XXX-XX 0 776843
> 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
> (TID 446288). 1699 bytes result sent to driver
> 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446329
> 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 
> (TID 446329)
> 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 
> and clearing cache
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 6807
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
> as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 6807 took 4 ms
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
> values in m
> We can see that the log statement differ with the exact timeout duration.
> Our consumer config is below.
> 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
> 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
>   metric.reporters = []
>   metadata.max.age.ms = 30
>   partition.assignment.strategy = 
> [org.apache.kafka.clients.consumer.RangeAssignor]
>   reconnect.backoff.ms = 50
>   sasl.kerberos.ticket.renew.window.factor = 0.8
>   max.partition.fetch.bytes = 1048576
>   bootstrap.servers = [x.xxx.xxx:9092]
>   ssl.keystore.type = JKS
>   enable.auto.commit = true
>   sasl.mechanism = GSSAPI
>   interceptor.classes = null
>   exclude.internal.topics = true
>   ssl.truststore.password = null
>   client.id =
>   ssl.endpoint.identification.algorithm = null
>   max.poll.records = 2147483647
>   check.crcs = true
>   request.timeout.ms = 5
>   heartbeat.interval.ms = 3000
>   auto.commit.interval.ms = 5000
>   receive.buffer.bytes = 65536
>   ssl.truststore.type = JKS
>   ssl.truststore.location = null
>   ssl.keystore.password = null
>   fetch.min.bytes = 1
>   send.buffer.bytes = 131072
>   value.deserializer = class 
> org.apache.kafka.common.seriali

[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs

2017-05-16 Thread jayadeepj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jayadeepj updated SPARK-20780:
--
Attachment: streaming_1.png
streaming_2.png
tasks_timing_out_3.png

> Spark Kafka10 Consumer Hangs
> 
>
> Key: SPARK-20780
> URL: https://issues.apache.org/jira/browse/SPARK-20780
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Spark Streaming Kafka 010
> CDH 5.8.4
> CentOS Linux release 7.2
>Reporter: jayadeepj
>Priority: Critical
> Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png
>
>
> We have recently upgraded our Streaming App with Direct Stream to Spark 2 
> (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
> 10 . We find abnormal delays after the application has run for a couple of 
> hours or completed consumption of approx. ~ 5 million records.
> See screenshot 1 & 2
> There is a sudden dip in the processing time from ~15 seconds (usual for this 
> app) to ~3 minutes & from then on the processing time keeps degrading 
> throughout.
> We have seen that the delay is due to certain tasks taking the exact time 
> duration of the configured Kafka Consumer 'request.timeout.ms' . We have 
> tested this by varying timeout property to different values.
> See screenshot 3.
> I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
> subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually 
> timing out on some of the partitions without reading data. But the executor 
> logs it as successfully completed after the exact timeout duration. Note that 
> most other tasks are completing successfully with millisecond duration. The 
> timeout is most likely from the 
> org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any 
> network latency difference.
> We have observed this across multiple clusters & multiple apps with & without 
> TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
> performance
> 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446288
> 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
> (TID 446288)
> 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, 
> partition 0 offsets 776843 -> 779591
> 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
> spark-executor-default1 XX-XXX-XX 0 776843
> 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
> (TID 446288). 1699 bytes result sent to driver
> 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446329
> 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 
> (TID 446329)
> 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 
> and clearing cache
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 6807
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
> as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 6807 took 4 ms
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
> values in m
> We can see that the log statement differ with the exact timeout duration.
> Our consumer config is below.
> 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
> 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
>   metric.reporters = []
>   metadata.max.age.ms = 30
>   partition.assignment.strategy = 
> [org.apache.kafka.clients.consumer.RangeAssignor]
>   reconnect.backoff.ms = 50
>   sasl.kerberos.ticket.renew.window.factor = 0.8
>   max.partition.fetch.bytes = 1048576
>   bootstrap.servers = [x.xxx.xxx:9092]
>   ssl.keystore.type = JKS
>   enable.auto.commit = true
>   sasl.mechanism = GSSAPI
>   interceptor.classes = null
>   exclude.internal.topics = true
>   ssl.truststore.password = null
>   client.id =
>   ssl.endpoint.identification.algorithm = null
>   max.poll.records = 2147483647
>   check.crcs = true
>   request.timeout.ms = 5
>   heartbeat.interval.ms = 3000
>   auto.commit.interval.ms = 5000
>   receive.buffer.bytes = 65536
>   ssl.truststore.type = JKS
>   ssl.truststore.location = null
>   ssl.keystore.password = null
>   fetch.min.bytes = 1
>   send.buffer.bytes = 131072
>   value.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>   group.id = default1
>   retry.backoff.

[jira] [Updated] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-20779:

Issue Type: Improvement  (was: Bug)

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
>
> The license is not at the top in some files.  and it will be best if we 
> update these places of the ASF header to be consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20780) Spark Kafka10 Consumer Hangs

2017-05-16 Thread jayadeepj (JIRA)

jayadeepj created SPARK-20780:
-

 Summary: Spark Kafka10 Consumer Hangs
 Key: SPARK-20780
 URL: https://issues.apache.org/jira/browse/SPARK-20780
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0
 Environment: 
Spark 2.1.0
Spark Streaming Kafka 010
CDH 5.8.4
CentOS Linux release 7.2
Reporter: jayadeepj
Priority: Critical


We have recently upgraded our Streaming App with Direct Stream to Spark 2 
(spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
10 . We find abnormal delays after the application has run for a couple of 
hours or completed consumption of approx. ~ 5 million records.

See screenshot 1 & 2

There is a sudden dip in the processing time from ~15 seconds (usual for this 
app) to ~3 minutes & from then on the processing time keeps degrading 
throughout.

We have seen that the delay is due to certain tasks taking the exact time 
duration of the configured Kafka Consumer 'request.timeout.ms' . We have tested 
this by varying timeout property to different values.

See screenshot 3.

I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually timing 
out on some of the partitions without reading data. But the executor logs it as 
successfully completed after the exact timeout duration. Note that most other 
tasks are completing successfully with millisecond duration. The timeout is 
most likely from the org.apache.kafka.clients.consumer.KafkaConsumer & we did 
not observe any network latency difference.

We have observed this across multiple clusters & multiple apps with & without 
TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
performance

17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
446288
17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
(TID 446288)
17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, partition 
0 offsets 776843 -> 779591
17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
spark-executor-default1 XX-XXX-XX 0 776843
17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
(TID 446288). 1699 bytes result sent to driver
17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
446329
17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 (TID 
446329)
17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 and 
clearing cache
17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 6807
17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
6807 took 4 ms
17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
values in m

We can see that the log statement differ with the exact timeout duration.


Our consumer config is below.

17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 30
partition.assignment.strategy = 
[org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [x.xxx.xxx:9092]
ssl.keystore.type = JKS
enable.auto.commit = true
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id =
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 5
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
group.id = default1
retry.backoff.ms = 100
ssl.secure.random.implementation = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 6
connections.max.idle.ms = 54
session.timeout.ms = 3

[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20779:


Assignee: (was: Apache Spark)

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
>
> The license is not at the top in some files.  and it will be best if we 
> update these places of the ASF header to be consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20779:


Assignee: Apache Spark

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Trivial
>
> The license is not at the top in some files.  and it will be best if we 
> update these places of the ASF header to be consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013465#comment-16013465
 ] 

Apache Spark commented on SPARK-20779:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18012

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
>
> The license is not at the top in some files.  and it will be best if we 
> update these places of the ASF header to be consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-16 Thread zuotingbing (JIRA)

zuotingbing created SPARK-20779:
---

 Summary: The ASF header placed in an incorrect location in some 
files
 Key: SPARK-20779
 URL: https://issues.apache.org/jira/browse/SPARK-20779
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.1.1, 2.1.0
Reporter: zuotingbing
Priority: Trivial


The license is not at the top in some files.  and it will be best if we update 
these places of the ASF header to be consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20772) Add support for query parameters in redirects on Yarn

2017-05-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013431#comment-16013431
 ] 

Saisai Shao commented on SPARK-20772:
-

I'm guessing if it is an issue of {{AmIpFilter}}, should be a yarn issue, not 
related to Spark?

> Add support for query parameters in redirects on Yarn
> -
>
> Key: SPARK-20772
> URL: https://issues.apache.org/jira/browse/SPARK-20772
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> Spark uses rewrites of query parameters to paths 
> (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). 
> This works fine in local or standalone mode, but does not work on Yarn (with 
> the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where 
> the query parameter is dropped.
> The repro steps are:
> - Start up the spark-shell and run a job
> - Try to access the job details through http://:4040/jobs/job?id=0
> - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)
> Going directly through the RM proxy works (does not cause query parameters to 
> be dropped).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20772) Add support for query parameters in redirects on Yarn

2017-05-16 Thread Bjorn Jonsson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjorn Jonsson updated SPARK-20772:
--
Description: 
Spark uses rewrites of query parameters to paths 
(http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This 
works fine in local or standalone mode, but does not work on Yarn (with the 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the 
query parameter is dropped.

The repro steps are:
- Start up the spark-shell and run a job
- Try to access the job details through http://:4040/jobs/job?id=0
- A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)

Going directly through the RM proxy works (does not cause query parameters to 
be dropped).

  was:
Spark uses rewrites of query parameters to paths 
(http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This 
works fine in local or standalone mode, but does not work on Yarn (with the 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the 
query parameter is dropped.

The repro steps are:
- Start up the spark-shell in yarn client or cluster mode and run a job
- Try to access the job details through http://:4040/jobs/job?id=0
- A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)

Going directly through the RM proxy works (does not cause query parameters to 
be dropped).


> Add support for query parameters in redirects on Yarn
> -
>
> Key: SPARK-20772
> URL: https://issues.apache.org/jira/browse/SPARK-20772
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.0
>Reporter: Bjorn Jonsson
>Priority: Minor
>
> Spark uses rewrites of query parameters to paths 
> (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). 
> This works fine in local or standalone mode, but does not work on Yarn (with 
> the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where 
> the query parameter is dropped.
> The repro steps are:
> - Start up the spark-shell and run a job
> - Try to access the job details through http://:4040/jobs/job?id=0
> - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)
> Going directly through the RM proxy works (does not cause query parameters to 
> be dropped).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19089) Support nested arrays/seqs in Datasets

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19089:


Assignee: (was: Apache Spark)

> Support nested arrays/seqs in Datasets
> --
>
> Key: SPARK-19089
> URL: https://issues.apache.org/jira/browse/SPARK-19089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Priority: Minor
>
> Nested arrays and seqs are not supported in Datasets:
> {code}
> scala> spark.createDataset(Seq(Array(Array(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Array(Array(1
>   ^
> scala> Seq(Array(Array(1))).toDS()
> :24: error: value toDS is not a member of Seq[Array[Array[Int]]]
>Seq(Array(Array(1))).toDS()
> scala> spark.createDataset(Seq(Seq(Seq(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Seq(Seq(1
> scala> Seq(Seq(Seq(1))).toDS()
> :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]]
>Seq(Seq(Seq(1))).toDS()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19089) Support nested arrays/seqs in Datasets

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19089:


Assignee: Apache Spark

> Support nested arrays/seqs in Datasets
> --
>
> Key: SPARK-19089
> URL: https://issues.apache.org/jira/browse/SPARK-19089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Assignee: Apache Spark
>Priority: Minor
>
> Nested arrays and seqs are not supported in Datasets:
> {code}
> scala> spark.createDataset(Seq(Array(Array(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Array(Array(1
>   ^
> scala> Seq(Array(Array(1))).toDS()
> :24: error: value toDS is not a member of Seq[Array[Array[Int]]]
>Seq(Array(Array(1))).toDS()
> scala> spark.createDataset(Seq(Seq(Seq(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Seq(Seq(1
> scala> Seq(Seq(Seq(1))).toDS()
> :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]]
>Seq(Seq(Seq(1))).toDS()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19089) Support nested arrays/seqs in Datasets

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013350#comment-16013350
 ] 

Apache Spark commented on SPARK-19089:
--

User 'michalsenkyr' has created a pull request for this issue:
https://github.com/apache/spark/pull/18011

> Support nested arrays/seqs in Datasets
> --
>
> Key: SPARK-19089
> URL: https://issues.apache.org/jira/browse/SPARK-19089
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michal Šenkýř
>Priority: Minor
>
> Nested arrays and seqs are not supported in Datasets:
> {code}
> scala> spark.createDataset(Seq(Array(Array(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Array(Array(1
>   ^
> scala> Seq(Array(Array(1))).toDS()
> :24: error: value toDS is not a member of Seq[Array[Array[Int]]]
>Seq(Array(Array(1))).toDS()
> scala> spark.createDataset(Seq(Seq(Seq(1
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(Seq(Seq(1
> scala> Seq(Seq(Seq(1))).toDS()
> :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]]
>Seq(Seq(Seq(1))).toDS()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20778:


Assignee: (was: Apache Spark)

> Implement array_intersect function
> --
>
> Key: SPARK-20778
> URL: https://issues.apache.org/jira/browse/SPARK-20778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> Implement an array_intersect function that takes array arguments and returns 
> an array containing all elements of the first array that is common with the 
> remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20778:


Assignee: Apache Spark

> Implement array_intersect function
> --
>
> Key: SPARK-20778
> URL: https://issues.apache.org/jira/browse/SPARK-20778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Assignee: Apache Spark
>Priority: Minor
>
> Implement an array_intersect function that takes array arguments and returns 
> an array containing all elements of the first array that is common with the 
> remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013337#comment-16013337
 ] 

Apache Spark commented on SPARK-20778:
--

User 'ericvandenbergfb' has created a pull request for this issue:
https://github.com/apache/spark/pull/18010

> Implement array_intersect function
> --
>
> Key: SPARK-20778
> URL: https://issues.apache.org/jira/browse/SPARK-20778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Eric Vandenberg
>Priority: Minor
>
> Implement an array_intersect function that takes array arguments and returns 
> an array containing all elements of the first array that is common with the 
> remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20778) Implement array_intersect function

2017-05-16 Thread Eric Vandenberg (JIRA)

Eric Vandenberg created SPARK-20778:
---

 Summary: Implement array_intersect function
 Key: SPARK-20778
 URL: https://issues.apache.org/jira/browse/SPARK-20778
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Eric Vandenberg
Priority: Minor


Implement an array_intersect function that takes array arguments and returns an 
array containing all elements of the first array that is common with the 
remaining arrays.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-16 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013317#comment-16013317
 ] 

Josh Rosen commented on SPARK-18838:


I think that SPARK-20776 / https://github.com/apache/spark/pull/18008 might 
help here: it addresses a nasty performance bug in 
JobProgressListener.onTaskStart caused by the unnecessary creation of empty 
TaskMetrics objects.

[~sitalke...@gmail.com] [~zsxwing], I'd be interested to see if we can do a 
coarse-grained split between users' custom listeners and Spark's own internal 
listeners. If we're careful in performance optimization of Spark's core 
internal listeners (such as ExecutorAllocationManagerListener) then it might be 
okay to publish events directly to those listeners (without buffering) and use 
buffering only for third-party listeners where we don't want to risk perf. bugs 
slowing down the cluster.

Alternatively, we could use two queues, one for internal listeners and another 
for external ones. This wouldn't be as fine-grained as thread-per-listener but 
might buy us a lot of the benefits with perhaps less code needed.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets

2017-05-16 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013290#comment-16013290
 ] 

Franck Tago commented on SPARK-20235:
-

was this comment meant for me? 
what does that mean ?

> Hive on S3  s3:sse and non S3:sse buckets
> -
>
> Key: SPARK-20235
> URL: https://issues.apache.org/jira/browse/SPARK-20235
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> my spark application writes into 2 hive tables .
> both tables are external with data residing on S3
> I want to encrypt the data when writing into hive table1 ,  but I do not want 
> to encrypt the data when writing into hive table 2.
> given that the parameter fs.s3a.server-side-encryption-algorithm is set 
> globally , I do not see how these use cases are supported in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18891) Support for specific collection types

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013278#comment-16013278
 ] 

Apache Spark commented on SPARK-18891:
--

User 'michalsenkyr' has created a pull request for this issue:
https://github.com/apache/spark/pull/18009

> Support for specific collection types
> -
>
> Key: SPARK-18891
> URL: https://issues.apache.org/jira/browse/SPARK-18891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which 
> force users to only define classes with the most generic type.
> An [example 
> error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]:
> {code}
> case class SpecificCollection(aList: List[Int])
> Seq(SpecificCollection(1 :: Nil)).toDS().collect()
> {code}
> {code}
> java.lang.RuntimeException: Error while decoding: 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 98, Column 120: No applicable constructor/method found 
> for actual parameters "scala.collection.Seq"; candidates are: 
> "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20777) Spark Streaming NullPointerException when restoring from hdfs checkpoint

2017-05-16 Thread Richard Moorhead (JIRA)

Richard Moorhead created SPARK-20777:


 Summary: Spark Streaming NullPointerException when restoring from 
hdfs checkpoint
 Key: SPARK-20777
 URL: https://issues.apache.org/jira/browse/SPARK-20777
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
 Environment: AWS EMR 5.2
Reporter: Richard Moorhead
Priority: Minor


When restoring a spark streaming job from a checkpoint I am experiencing 
infrequent NullPointerExceptions when transformations are applied to RDDs with 
`foreachRDD`

http://stackoverflow.com/questions/43984672/spark-kinesis-streaming-checkpoint-recovery-rdd-nullpointer-exception




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20776:
---
Attachment: (was: screenshot-1.png)

> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent constructing empty 
> TaskMetrics instances. As I'll show in a PR, we can slightly simplify the 
> code to remove the need to construct one empty TaskMetrics per 
> onTaskSubmitted event.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20776:
---
Description: 
In 

{code}
./bin/spark-shell --master=local[64]
{code}

I ran 

{code}
  sc.parallelize(1 to 10, 10).count()
{code}

and profiled the time spend in the LiveListenerBus event processing thread. I 
discovered that the majority of the time was being spent constructing empty 
TaskMetrics instances. As I'll show in a PR, we can slightly simplify the code 
to remove the need to construct one empty TaskMetrics per onTaskSubmitted event.

  was:
In 

{code}
./bin/spark-shell --master=local[64]
{code}

I ran 

{code}
  sc.parallelize(1 to 10, 10).count()
{code}

and profiled the time spend in the LiveListenerBus event processing thread. I 
discovered that the majority of the time was being spent initializing the 
{{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to remove 
this bottleneck and prevent dropped listener events.


> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent constructing empty 
> TaskMetrics instances. As I'll show in a PR, we can slightly simplify the 
> code to remove the need to construct one empty TaskMetrics per 
> onTaskSubmitted event.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20776:
---
Description: 
In 

{code}
./bin/spark-shell --master=local[64]
{code}

I ran 

{code}
  sc.parallelize(1 to 10, 10).count()
{code}

and profiled the time spend in the LiveListenerBus event processing thread. I 
discovered that the majority of the time was being spent constructing empty 
TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we can 
slightly simplify the code to remove the need to construct one empty 
TaskMetrics per onTaskSubmitted event.

  was:
In 

{code}
./bin/spark-shell --master=local[64]
{code}

I ran 

{code}
  sc.parallelize(1 to 10, 10).count()
{code}

and profiled the time spend in the LiveListenerBus event processing thread. I 
discovered that the majority of the time was being spent constructing empty 
TaskMetrics instances. As I'll show in a PR, we can slightly simplify the code 
to remove the need to construct one empty TaskMetrics per onTaskSubmitted event.


> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent constructing empty 
> TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we 
> can slightly simplify the code to remove the need to construct one empty 
> TaskMetrics per onTaskSubmitted event.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20776:
---
Summary: Fix JobProgressListener perf. problems caused by empty TaskMetrics 
initialization  (was: Fix performance problems in TaskMetrics.nameToAccums map 
initialization)

> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Attachments: screenshot-1.png
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent initializing the 
> {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
> use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to 
> remove this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2017-05-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013247#comment-16013247
 ] 

Ruslan Dautkhanov commented on SPARK-16441:
---

We did not have spark.dynamicAllocation.maxExecutors set too, and after we set 
it to a value = max number of containers possible per yarn allocation,
issue 
{noformat}
Dropping SparkListenerEvent because no remaining room in event queue. 
This likely means one of the SparkListeners is too slow and cannot keep up 
with the rate at which tasks are being started by the scheduler
{noformat}
doesn't happen as often, but does happen anyway. We think having concurrentSQL 
turned on demands higher values of spark.scheduler.listenerbus.eventqueue.size 
too?

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0, 2.1.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
> Attachments: SPARK-16441-compare-apply-PR-16819.zip, 
> SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, 
> SPARK-16441-yarn-metrics.jpg
>
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListener

[jira] [Commented] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013220#comment-16013220
 ] 

Apache Spark commented on SPARK-20776:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18008

> Fix performance problems in TaskMetrics.nameToAccums map initialization
> ---
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Attachments: screenshot-1.png
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent initializing the 
> {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
> use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to 
> remove this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20776:


Assignee: Apache Spark  (was: Josh Rosen)

> Fix performance problems in TaskMetrics.nameToAccums map initialization
> ---
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
> Attachments: screenshot-1.png
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent initializing the 
> {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
> use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to 
> remove this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20776:


Assignee: Josh Rosen  (was: Apache Spark)

> Fix performance problems in TaskMetrics.nameToAccums map initialization
> ---
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Attachments: screenshot-1.png
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent initializing the 
> {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
> use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to 
> remove this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization

2017-05-16 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-20776:
--

 Summary: Fix performance problems in TaskMetrics.nameToAccums map 
initialization
 Key: SPARK-20776
 URL: https://issues.apache.org/jira/browse/SPARK-20776
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
 Attachments: screenshot-1.png

In 

{code}
./bin/spark-shell --master=local[64]
{code}

I ran 

{code}
  sc.parallelize(1 to 10, 10).count()
{code}

and profiled the time spend in the LiveListenerBus event processing thread. I 
discovered that the majority of the time was being spent initializing the 
{{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to remove 
this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization

2017-05-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-20776:
---
Attachment: screenshot-1.png

> Fix performance problems in TaskMetrics.nameToAccums map initialization
> ---
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Attachments: screenshot-1.png
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent initializing the 
> {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the 
> use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to 
> remove this bottleneck and prevent dropped listener events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable

2017-05-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013208#comment-16013208
 ] 

Ruslan Dautkhanov commented on SPARK-15703:
---

We keep running into this issue too - would be great to document 
spark.scheduler.listenerbus.eventqueue.size

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, 
> SparkListenerBus .png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-16 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013199#comment-16013199
 ] 

Jiang Xingbo edited comment on SPARK-20700 at 5/16/17 10:23 PM:


In the previous approach we used `aliasMap` to link an `Attribute` to the 
expression with potentially the form `f(a, b)`, but we only searched the 
`expressions` and `children.expressions` for this, which is not enough when an 
`Alias` may lies deep in the logical plan. In that case, we can't generate the 
valid equivalent constraint classes and thus we fail at preventing the 
recursive deductions.

I'll send a PR to fix this later today.


was (Author: jiangxb1987):
In the previous approach we used `aliasMap` to link an `Attribute` to the 
expression with potentially the form `f(a, b)`, but we only searched the 
`expressions` and `children.expressions` for this, which is not enough when an 
`Alias` may lies deep in the logical plan. In that case, we can't generate the 
valid equivalent constraint classes and thus we fail to prevent the recursive 
deductions.

I'll send a PR to fix this later today.

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);
> SELECT
> AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
> float_col,
> COUNT(t1.smallint_col_2) AS int_col
> FROM table_5 t1
> INNER JOIN (
> SELECT
> (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
> (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
> AS boolean_col,
> t2.a,
> (t1.int_col_4) * (t1.int_col_4) AS int_col
> FROM table_5 t1
> LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
> WHERE
> (t1.smallint_col_2) > (t1.smallint_col_2)
> GROUP BY
> t2.a,
> (t1.int_col_4) * (t1.int_col_4)
> HAVING
> ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
> SUM(t1.int_col_4))
> ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
> ((t2.a) = (t1.smallint_col_2));
> {code}
> (I haven't tried to minimize this failing case yet).
> Based on sampled jstacks from the driver, it looks like the query might be 
> repeatedly inferring filters from constraints and then pruning those filters.
> Here's part of the stack at the point where it stackoverflows:
> {code}
> [... repeats ...]
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(Tra

[jira] [Commented] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)

2017-05-16 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013199#comment-16013199
 ] 

Jiang Xingbo commented on SPARK-20700:
--

In the previous approach we used `aliasMap` to link an `Attribute` to the 
expression with potentially the form `f(a, b)`, but we only searched the 
`expressions` and `children.expressions` for this, which is not enough when an 
`Alias` may lies deep in the logical plan. In that case, we can't generate the 
valid equivalent constraint classes and thus we fail to prevent the recursive 
deductions.

I'll send a PR to fix this later today.

> InferFiltersFromConstraints stackoverflows for query (v2)
> -
>
> Key: SPARK-20700
> URL: https://issues.apache.org/jira/browse/SPARK-20700
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
>
> The following (complicated) query eventually fails with a stack overflow 
> during optimization:
> {code}
> CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, 
> int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES
>   ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', 
> TIMESTAMP('2015-01-14 00:00:00.0'), '947'),
>   ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', 
> TIMESTAMP('1999-08-15 00:00:00.0'), '437'),
>   ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', 
> TIMESTAMP('1991-05-23 00:00:00.0'), '630'),
>   ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), 
> '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'),
>   ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS 
> STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'),
>   ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', 
> CAST(NULL AS TIMESTAMP), '-740'),
>   ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL 
> AS TIMESTAMP), CAST(NULL AS STRING)),
>   ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', 
> CAST(NULL AS TIMESTAMP), '181'),
>   ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', 
> TIMESTAMP('2016-06-30 00:00:00.0'), '487'),
>   ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS 
> STRING), CAST(NULL AS TIMESTAMP), '-62');
> CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null);
> SELECT
> AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS 
> float_col,
> COUNT(t1.smallint_col_2) AS int_col
> FROM table_5 t1
> INNER JOIN (
> SELECT
> (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * 
> (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) 
> AS boolean_col,
> t2.a,
> (t1.int_col_4) * (t1.int_col_4) AS int_col
> FROM table_5 t1
> LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4)
> WHERE
> (t1.smallint_col_2) > (t1.smallint_col_2)
> GROUP BY
> t2.a,
> (t1.int_col_4) * (t1.int_col_4)
> HAVING
> ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), 
> SUM(t1.int_col_4))
> ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND 
> ((t2.a) = (t1.smallint_col_2));
> {code}
> (I haven't tried to minimize this failing case yet).
> Based on sampled jstacks from the driver, it looks like the query might be 
> repeatedly inferring filters from constraints and then pruning those filters.
> Here's part of the stack at the point where it stackoverflows:
> {code}
> [... repeats ...]
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.immutable.List.flatMap(List.scala:344)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$ca

[jira] [Resolved] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20140.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
> Fix For: 2.2.1, 2.3.0
>
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013191#comment-16013191
 ] 

Burak Yavuz commented on SPARK-20140:
-

resolved by https://github.com/apache/spark/pull/17467

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries

2017-05-16 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-20140:
---

Assignee: Yash Sharma

> Remove hardcoded kinesis retry wait and max retries
> ---
>
> Key: SPARK-20140
> URL: https://issues.apache.org/jira/browse/SPARK-20140
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>  Labels: kinesis, recovery
>
> The pull requests proposes to remove the hardcoded values for Amazon Kinesis 
> - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES.
> This change is critical for kinesis checkpoint recovery when the kinesis 
> backed rdd is huge.
> Following happens in a typical kinesis recovery :
> - kinesis throttles large number of requests while recovering
> - retries in case of throttling are not able to recover due to the small wait 
> period
> - kinesis throttles per second, the wait period should be configurable for 
> recovery
> The patch picks the spark kinesis configs from:
> - spark.streaming.kinesis.retry.wait.time
> - spark.streaming.kinesis.retry.max.attempts



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-05-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19372:


Assignee: Kazuaki Ishizaki

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-05-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19372.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Fix For: 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-16 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-20775:
---

 Summary: from_json should also have an API where the schema is 
specified with a string
 Key: SPARK-20775
 URL: https://issues.apache.org/jira/browse/SPARK-20775
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Burak Yavuz


Right now you also have to provide a java.util.Map which is not nice for Scala 
users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20503:
--
Fix Version/s: 2.2.0

> ML 2.2 QA: API: Python API coverage
> ---
>
> Key: SPARK-20503
> URL: https://issues.apache.org/jira/browse/SPARK-20503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
> Fix For: 2.2.0
>
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20509.
---
   Resolution: Done
Fix Version/s: 2.2.0

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20774) BroadcastExchangeExec doesn't cancel the Spark job if broadcasting a relation timeouts.

2017-05-16 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-20774:


 Summary: BroadcastExchangeExec doesn't cancel the Spark job if 
broadcasting a relation timeouts.
 Key: SPARK-20774
 URL: https://issues.apache.org/jira/browse/SPARK-20774
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.1.0
Reporter: Shixiong Zhu


When broadcasting a table takes too long and triggers timeout, the SQL query 
will fail. However, the background Spark job is still running and it wastes 
resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013060#comment-16013060
 ] 

Joseph K. Bradley commented on SPARK-20509:
---

I checked the new and changed APIs, comparing the Scala and R docs.  I did not 
see any issues, so I'll mark this complete.

But [~felixcheung] and others, please say if you find any during your checks.  
Thanks!

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2017-05-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013001#comment-16013001
 ] 

Takeshi Yamamuro commented on SPARK-14584:
--

Could we close this as resolved? It seems the merged pr in SPARK-18284 
[~hyukjin.kwon] suggested includes test cases this ticket describes.

> Improve recognition of non-nullability in Dataset transformations
> -
>
> Key: SPARK-14584
> URL: https://issues.apache.org/jira/browse/SPARK-14584
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>
> There are many cases where we can statically know that a field will never be 
> null. For instance, a field in a case class with a primitive type will never 
> return null. However, there are currently several cases in the Dataset API 
> where we do not properly recognize this non-nullability. For instance:
> {code}
> case class MyCaseClass(foo: Int)
> sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
> {code}
> claims that the {{foo}} field is nullable even though this is impossible.
> I believe that this is due to the way that we reason about nullability when 
> constructing serializer expressions in ExpressionEncoders. The following 
> assertion will currently fail if added to ExpressionEncoder:
> {code}
>   require(schema.size == serializer.size)
>   schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
> require(field.dataType == fieldSerializer.dataType, s"Field 
> ${field.name}'s data type is " +
>   s"${field.dataType} in the schema but ${fieldSerializer.dataType} in 
> its serializer")
> require(field.nullable == fieldSerializer.nullable, s"Field 
> ${field.name}'s nullability is " +
>   s"${field.nullable} in the schema but ${fieldSerializer.nullable} in 
> its serializer")
>   }
> {code}
> Most often, the schema claims that a field is non-nullable while the encoder 
> allows for nullability, but occasionally we see a mismatch in the datatypes 
> due to disagreements over the nullability of nested structs' fields (or 
> fields of structs in arrays).
> I think the problem is that when we're reasoning about nullability in a 
> struct's schema we consider its fields' nullability to be independent of the 
> nullability of the struct itself, whereas in the serializer expressions we 
> are considering those field extraction expressions to be nullable if the 
> input objects themselves can be nullable.
> I'm not sure what's the simplest way to fix this. One proposal would be to 
> leave the serializers unchanged and have ObjectOperator derive its output 
> attributes from an explicitly-passed schema rather than using the 
> serializers' attributes. However, I worry that this might introduce bugs in 
> case the serializer and schema disagree.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20773:


Assignee: (was: Apache Spark)

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Priority: Minor
>  Labels: easyfix, performance
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20773:


Assignee: Apache Spark

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix, performance
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012992#comment-16012992
 ] 

Apache Spark commented on SPARK-20773:
--

User 'tpoterba' has created a pull request for this issue:
https://github.com/apache/spark/pull/18005

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Priority: Minor
>  Labels: easyfix, performance
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields

2017-05-16 Thread T Poterba (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Poterba updated SPARK-20773:
--
Summary: ParquetWriteSupport.writeFields is quadratic in number of fields  
(was: ParquetWriteSupport.writeFields has is quadratic in number of fields)

> ParquetWriteSupport.writeFields is quadratic in number of fields
> 
>
> Key: SPARK-20773
> URL: https://issues.apache.org/jira/browse/SPARK-20773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: T Poterba
>Priority: Minor
>  Labels: easyfix, performance
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
> elements. Since the fieldWriters object is a List, this is a quadratic 
> operation.
> See line 123: 
> https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20746) Built-in SQL Function Improvement

2017-05-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012982#comment-16012982
 ] 

Takeshi Yamamuro commented on SPARK-20746:
--

Could I take on some of them?  [~smilegator]

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-20746
> URL: https://issues.apache.org/jira/browse/SPARK-20746
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> SQL functions are part of the core of the ISO/ANSI standards. This umbrella 
> JIRA is trying to list all the ISO/ANS SQL functions that are not fully 
> implemented by Spark SQL, fix the documentation and test case issues in the 
> supported functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20773) ParquetWriteSupport.writeFields has is quadratic in number of fields

2017-05-16 Thread T Poterba (JIRA)

T Poterba created SPARK-20773:
-

 Summary: ParquetWriteSupport.writeFields has is quadratic in 
number of fields
 Key: SPARK-20773
 URL: https://issues.apache.org/jira/browse/SPARK-20773
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: T Poterba
Priority: Minor


The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all 
elements. Since the fieldWriters object is a List, this is a quadratic 
operation.

See line 123: 
https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20509:
-

Assignee: Joseph K. Bradley

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs

2017-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012913#comment-16012913
 ] 

Joseph K. Bradley commented on SPARK-20509:
---

I'll take this one

> SparkR 2.2 QA: New R APIs and API docs
> --
>
> Key: SPARK-20509
> URL: https://issues.apache.org/jira/browse/SPARK-20509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20772) Add support for query parameters in redirects on Yarn

2017-05-16 Thread Bjorn Jonsson (JIRA)

Bjorn Jonsson created SPARK-20772:
-

 Summary: Add support for query parameters in redirects on Yarn
 Key: SPARK-20772
 URL: https://issues.apache.org/jira/browse/SPARK-20772
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.1.0
Reporter: Bjorn Jonsson
Priority: Minor


Spark uses rewrites of query parameters to paths 
(http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This 
works fine in local or standalone mode, but does not work on Yarn (with the 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the 
query parameter is dropped.

The repro steps are:
- Start up the spark-shell in yarn client or cluster mode and run a job
- Try to access the job details through http://:4040/jobs/job?id=0
- A HTTP ERROR 400 is thrown (requirement failed: missing id parameter)

Going directly through the RM proxy works (does not cause query parameters to 
be dropped).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20771) Usability issues with weekofyear()

2017-05-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-20771:
--
Description: 
The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation or ordering 
wrong. A better implementation should return the year number of the week as 
well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.

  was:
The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation wrong. A better 
implementation should return the year number of the week as well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.


> Usability issues with weekofyear()
> --
>
> Key: SPARK-20771
> URL: https://issues.apache.org/jira/browse/SPARK-20771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> The weekofyear() implementation follows HIVE / ISO 8601 week number. However 
> it is not useful because it doesn't return the year of the week start. For 
> example,
> weekofyear("2017-01-01") returns 52
> Anyone using this with groupBy('week) might do the aggregation or ordering 
> wrong. A better implementation should return the year number of the week as 
> well.
> MySQL's yearweek() is much better in this sense: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.
> Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20771) Usability issues with weekofyear()

2017-05-16 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-20771:
-

 Summary: Usability issues with weekofyear()
 Key: SPARK-20771
 URL: https://issues.apache.org/jira/browse/SPARK-20771
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiangrui Meng
Priority: Minor


The weekofyear() implementation follows HIVE / ISO 8601 week number. However it 
is not useful because it doesn't return the year of the week start. For example,

weekofyear("2017-01-01") returns 52

Anyone using this with groupBy('week) might do the aggregation wrong. A better 
implementation should return the year number of the week as well.

MySQL's yearweek() is much better in this sense: 
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek.

Maybe we should implement that in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2017-05-16 Thread Brian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012818#comment-16012818
 ] 

Brian Zhang commented on SPARK-15616:
-

Hello,
Just wondering what's the current status of this issue? I think this fix would 
be really helpful.

Thanks!

> Metastore relation should fallback to HDFS size of partitions that are 
> involved in Query if statistics are not available.
> -
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012813#comment-16012813
 ] 

Apache Spark commented on SPARK-18838:
--

User 'bOOm-X' has created a pull request for this issue:
https://github.com/apache/spark/pull/18004

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20529) Worker should not use the received Master address

2017-05-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20529.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

> Worker should not use the received Master address
> -
>
> Key: SPARK-20529
> URL: https://issues.apache.org/jira/browse/SPARK-20529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Right now when worker connects to master, master will send its address to the 
> worker. Then worker will save this address and use it to reconnect in case of 
> failure.
> However, sometimes, this address is not correct. If there is a proxy between 
> master and worker, the address master sent is not the address of proxy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage

2017-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012774#comment-16012774
 ] 

Joseph K. Bradley commented on SPARK-20503:
---

Thanks a lot!

> ML 2.2 QA: API: Python API coverage
> ---
>
> Key: SPARK-20503
> URL: https://issues.apache.org/jira/browse/SPARK-20503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20770) Improve ColumnStats

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20770:


Assignee: Apache Spark

> Improve ColumnStats
> ---
>
> Key: SPARK-20770
> URL: https://issues.apache.org/jira/browse/SPARK-20770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> We improve the implementation of {{ColumnStats}} by using the following 
> approaches.
> 1. Declare subclasses of {{ColumnStats}} as {{final}}
> 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} 
> 3. Remove the dependency on {{GenericInternalRow}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20770) Improve ColumnStats

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012636#comment-16012636
 ] 

Apache Spark commented on SPARK-20770:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18002

> Improve ColumnStats
> ---
>
> Key: SPARK-20770
> URL: https://issues.apache.org/jira/browse/SPARK-20770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> We improve the implementation of {{ColumnStats}} by using the following 
> approaches.
> 1. Declare subclasses of {{ColumnStats}} as {{final}}
> 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} 
> 3. Remove the dependency on {{GenericInternalRow}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20770) Improve ColumnStats

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20770:


Assignee: (was: Apache Spark)

> Improve ColumnStats
> ---
>
> Key: SPARK-20770
> URL: https://issues.apache.org/jira/browse/SPARK-20770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> We improve the implementation of {{ColumnStats}} by using the following 
> approaches.
> 1. Declare subclasses of {{ColumnStats}} as {{final}}
> 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} 
> 3. Remove the dependency on {{GenericInternalRow}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2017-05-16 Thread Richard Moorhead (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012591#comment-16012591
 ] 

Richard Moorhead commented on SPARK-9215:
-

WAL is not necessary for fault tolerant Kinesis streaming?

Would checkpointing be enabled on the StreamingContext with writeAheadLog 
disabled then?

> Implement WAL-free Kinesis receiver that give at-least once guarantee
> -
>
> Key: SPARK-9215
> URL: https://issues.apache.org/jira/browse/SPARK-9215
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>
> Currently, the KinesisReceiver can loose some data in the case of certain 
> failures (receiver and driver failures). Using the write ahead logs can 
> mitigate some of the problem, but it is not ideal because WALs dont work with 
> S3 (eventually consistency, etc.) which is the most likely file system to be 
> used in the EC2 environment. Hence, we have to take a different approach to 
> improving reliability for Kinesis.
> Detailed design doc - 
> https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-16 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542
 ] 

Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:16 PM:


What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) 
contains a timestamp and you define the type of the column as TIMESTAMP, then 
SparkSQL interprets that timestamp as a local time value instead of a 
UTC-normalized one. So if you have such a table with some data and run a select 
in SparkSQL, then change the local timezone and run the same select again 
(using SparkSQL again), you will see the same timestamp. If you do the same 
with a Parquet table, you will see a different timestamp after changing the 
local timezone.

I mentioned Avro as an example by mistake, as Avro-backed tables do not support 
the timestamp type at this moment.


was (Author: zi):
What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) 
contains a timestamp and you define the type of the column as TIMESTAMP, then 
SparkSQL interprets that timestamp as a local time value instead of a 
UTC-normalized one. So if you have such a table with some data and run a select 
in SparkSQL, then change the local timezone and run the same select again 
(using SparkSQL again), you will see the same timestamp. If you do the same 
with a Parquet table, you will see a different timestamp after changing the 
local timezone.

I mentioned Avro as an example by mistake, as Avro-backed tables do not support 
the timestamp type at this moment. I may have been thinking about ORC.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multi

[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-16 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542
 ] 

Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:11 PM:


What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) 
contains a timestamp and you define the type of the column as TIMESTAMP, then 
SparkSQL interprets that timestamp as a local time value instead of a 
UTC-normalized one. So if you have such a table with some data and run a select 
in SparkSQL, then change the local timezone and run the same select again 
(using SparkSQL again), you will see the same timestamp. If you do the same 
with a Parquet table, you will see a different timestamp after changing the 
local timezone.

I mentioned Avro as an example by mistake, as Avro-backed tables do not support 
the timestamp type at this moment. I may have been thinking about ORC.


was (Author: zi):
What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) 
contains a timestamp and you define the type of the column as TIMESTAMP, then 
SparkSQL interprets that timestamp as a local time value instead of a 
UTC-normalized one. So if you have such a table and insert a timestamp into it 
in SparkSQL, then change the local timezone and read the value back (using 
SparkSQL again), you will see the same timestamp. If you do the same with a 
Parquet table, you will see a different timestamp after changing the local 
timezone.

I mentioned Avro as an example by mistake, as Avro-backed tables do not support 
the timestamp type at this moment. I may have been thinking about ORC.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say

[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2017-05-16 Thread Zoltan Ivanfi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542
 ] 

Zoltan Ivanfi commented on SPARK-12297:
---

What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) 
contains a timestamp and you define the type of the column as TIMESTAMP, then 
SparkSQL interprets that timestamp as a local time value instead of a 
UTC-normalized one. So if you have such a table and insert a timestamp into it 
in SparkSQL, then change the local timezone and read the value back (using 
SparkSQL again), you will see the same timestamp. If you do the same with a 
Parquet table, you will see a different timestamp after changing the local 
timezone.

I mentioned Avro as an example by mistake, as Avro-backed tables do not support 
the timestamp type at this moment. I may have been thinking about ORC.

> Add work-around for Parquet/Hive int96 timestamp bug.
> -
>
> Key: SPARK-12297
> URL: https://issues.apache.org/jira/browse/SPARK-12297
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Ryan Blue
>
> Spark copied Hive's behavior for parquet, but this was inconsistent with 
> other file formats, and inconsistent with Impala (which is the original 
> source of putting a timestamp as an int96 in parquet, I believe).  This made 
> timestamps in parquet act more like timestamps with timezones, while in other 
> file formats, timestamps have no time zone, they are a "floating time".
> The easiest way to see this issue is to write out a table with timestamps in 
> multiple different formats from one timezone, then try to read them back in 
> another timezone.  Eg., here I write out a few timestamps to parquet and 
> textfile hive tables, and also just as a json file, all in the 
> "America/Los_Angeles" timezone:
> {code}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val tblPrefix = args(0)
> val schema = new StructType().add("ts", TimestampType)
> val rows = sc.parallelize(Seq(
>   "2015-12-31 23:50:59.123",
>   "2015-12-31 22:49:59.123",
>   "2016-01-01 00:39:59.123",
>   "2016-01-01 01:29:59.123"
> ).map { x => Row(java.sql.Timestamp.valueOf(x)) })
> val rawData = spark.createDataFrame(rows, schema).toDF()
> rawData.show()
> Seq("parquet", "textfile").foreach { format =>
>   val tblName = s"${tblPrefix}_$format"
>   spark.sql(s"DROP TABLE IF EXISTS $tblName")
>   spark.sql(
> raw"""CREATE TABLE $tblName (
>   |  ts timestamp
>   | )
>   | STORED AS $format
>  """.stripMargin)
>   rawData.write.insertInto(tblName)
> }
> rawData.write.json(s"${tblPrefix}_json")
> {code}
> Then I start a spark-shell in "America/New_York" timezone, and read the data 
> back from each table:
> {code}
> scala> spark.sql("select * from la_parquet").collect().foreach{println}
> [2016-01-01 02:50:59.123]
> [2016-01-01 01:49:59.123]
> [2016-01-01 03:39:59.123]
> [2016-01-01 04:29:59.123]
> scala> spark.sql("select * from la_textfile").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").collect().foreach{println}
> [2015-12-31 23:50:59.123]
> [2015-12-31 22:49:59.123]
> [2016-01-01 00:39:59.123]
> [2016-01-01 01:29:59.123]
> scala> spark.read.json("la_json").join(spark.sql("select * from 
> la_textfile"), "ts").show()
> ++
> |  ts|
> ++
> |2015-12-31 23:50:...|
> |2015-12-31 22:49:...|
> |2016-01-01 00:39:...|
> |2016-01-01 01:29:...|
> ++
> scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), 
> "ts").show()
> +---+
> | ts|
> +---+
> +---+
> {code}
> The textfile and json based data shows the same times, and can be joined 
> against each other, while the times from the parquet data have changed (and 
> obviously joins fail).
> This is a big problem for any organization that may try to read the same data 
> (say in S3) with clusters in multiple timezones.  It can also be a nasty 
> surprise as an organization tries to migrate file formats.  Finally, its a 
> source of incompatibility between Hive, Impala, and Spark.
> HIVE-12767 aims to fix this by introducing a table property which indicates 
> the "storage timezone" for the table.  Spark should add the same to ensure 
> consistency between file formats, and with Hive & Impala.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20770) Improve ColumnStats

2017-05-16 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-20770:


 Summary: Improve ColumnStats
 Key: SPARK-20770
 URL: https://issues.apache.org/jira/browse/SPARK-20770
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


We improve the implementation of {{ColumnStats}} by using the following 
approaches.

1. Declare subclasses of {{ColumnStats}} as {{final}}
2. Remove unnecessary call of {{row.isNullAt(ordinal)}} 
3. Remove the dependency on {{GenericInternalRow}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20769) Incorrect documentation for using Jupyter notebook

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20769:


Assignee: Apache Spark

> Incorrect documentation for using Jupyter notebook
> --
>
> Key: SPARK-20769
> URL: https://issues.apache.org/jira/browse/SPARK-20769
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-13973 incorrectly removed the required 
> PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
> Jupyter notebook



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20769) Incorrect documentation for using Jupyter notebook

2017-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20769:


Assignee: (was: Apache Spark)

> Incorrect documentation for using Jupyter notebook
> --
>
> Key: SPARK-20769
> URL: https://issues.apache.org/jira/browse/SPARK-20769
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Priority: Minor
>
> SPARK-13973 incorrectly removed the required 
> PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
> Jupyter notebook



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20769) Incorrect documentation for using Jupyter notebook

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012527#comment-16012527
 ] 

Apache Spark commented on SPARK-20769:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/18001

> Incorrect documentation for using Jupyter notebook
> --
>
> Key: SPARK-20769
> URL: https://issues.apache.org/jira/browse/SPARK-20769
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>Priority: Minor
>
> SPARK-13973 incorrectly removed the required 
> PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
> Jupyter notebook



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-05-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-14098:
-
Description: 
[Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
 is a design document for this change (***TODO: Update the document***).

This JIRA implements a new in-memory cache feature used by DataFrame.cache and 
Dataset.cache. The followings are basic design based on discussions with 
Sameer, Weichen, Xiao, Herman, and Nong.

* Use ColumnarBatch with ColumnVector that are common data representations for 
columnar storage
* Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
ColumnVector in ColumnarBatch depends on its data typpe
* Generate code that is simple and specialized for each in-memory cache to 
build an in-memory cache
* Generate code that directly reads data from ColumnVector for the in-memory 
cache by whole-stage codegen.
* Enhance ColumnVector to keep UnsafeArrayData
* Use primitive-type array for primitive uncompressed data type in ColumnVector
* Use byte[] for UnsafeArrayData and compressed data

Based on this design, this JIRA generates two kinds of Java code for 
DataFrame.cache()/Dataset.cache()

* Generate Java code to build CachedColumnarBatch, which keeps data in 
ColumnarBatch
* Generate Java code to get a value of each column from ColumnarBatch
** a Get a value directly from from ColumnarBatch in code generated by whole 
stage code gen (primary path)
** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
of columns is more than 100, as backup path)


  was:
When DataFrame.cache() is called, data is stored as column-oriented storage in 
CachedBatch. The current Catalyst generates Java program to get a value of a 
column from an InternalRow that is translated from CachedBatch. This issue 
generates Java code to get a value of a column from CachedBatch. While a column 
for a cache may be compressed, this issue handles float and double types that 
are never compressed. 
Other primitive types, whose column may be compressed, will be addressed in 
another entry.


> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20769) Incorrect documentation for using Jupyter notebook

2017-05-16 Thread Andrew Ray (JIRA)

Andrew Ray created SPARK-20769:
--

 Summary: Incorrect documentation for using Jupyter notebook
 Key: SPARK-20769
 URL: https://issues.apache.org/jira/browse/SPARK-20769
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.1
Reporter: Andrew Ray
Priority: Minor


SPARK-13973 incorrectly removed the required 
PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with 
Jupyter notebook



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-05-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-14098:
-
Issue Type: Umbrella  (was: Improvement)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. While a 
> column for a cache may be compressed, this issue handles float and double 
> types that are never compressed. 
> Other primitive types, whose column may be compressed, will be addressed in 
> another entry.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-05-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012488#comment-16012488
 ] 

Kazuaki Ishizaki commented on SPARK-14098:
--

[~lins05] Sorry, I overlooked this message. I synced titles at least. Later, I 
will update description.

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. While a 
> column for a cache may be compressed, this issue handles float and double 
> types that are never compressed. 
> Other primitive types, whose column may be compressed, will be addressed in 
> another entry.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2017-05-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-14098:
-
Summary: Generate Java code to build CachedColumnarBatch and get values 
from CachedColumnarBatch when DataFrame.cache() is called  (was: Generate code 
that get a float/double value in each column from CachedBatch when 
DataFrame.cache() is called)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When DataFrame.cache() is called, data is stored as column-oriented storage 
> in CachedBatch. The current Catalyst generates Java program to get a value of 
> a column from an InternalRow that is translated from CachedBatch. This issue 
> generates Java code to get a value of a column from CachedBatch. While a 
> column for a cache may be compressed, this issue handles float and double 
> types that are never compressed. 
> Other primitive types, whose column may be compressed, will be addressed in 
> another entry.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20748) Built-in SQL Function Support - CH[A]R

2017-05-16 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012487#comment-16012487
 ] 

Yuming Wang commented on SPARK-20748:
-

I am working on this

> Built-in SQL Function Support - CH[A]R
> --
>
> Key: SPARK-20748
> URL: https://issues.apache.org/jira/browse/SPARK-20748
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> CH[A]R()
> {noformat}
> Returns a character when given its ASCII code.
> Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread APeng Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012465#comment-16012465
 ] 

APeng Zhang commented on SPARK-20765:
-

Yes, the class is on the classpath.
The problem is the current implementation can not map my Scala class name 
(com.abc.xyz.ml.SomeClass) to Python class name (xyz.ml.SomeClass).

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing

2017-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012455#comment-16012455
 ] 

Sean Owen commented on SPARK-18359:
---

Using the JVM locale is a bad way to get this behavior, because it's not 
portable. Input would mysteriously work on one machine and not another, or 
succeed but quietly give the wrong output. It also caused some SQL-related 
methods to return the wrong value on non-US-locale machines. That's a big(ger) 
problem that had to be fixed.

Yes, the problem is there isn't a way to specify non-US locales just for the 
CSV parsing. That's what this is about, and yes you should work on it if you 
need the functionality.

As a workaround you can do some preprocessing to parse the dates manually. Not 
great, but not hard either.

> Let user specify locale in CSV parsing
> --
>
> Key: SPARK-18359
> URL: https://issues.apache.org/jira/browse/SPARK-18359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: yannick Radji
>
> On the DataFrameReader object there no CSV-specific option to set decimal 
> delimiter on comma whereas dot like it use to be in France and Europe.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20740) Expose UserDefinedType make sure could extends it

2017-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012453#comment-16012453
 ] 

Hyukjin Kwon commented on SPARK-20740:
--

ping [~darion]. If we can't explain the use case, I would like to suggest to 
close this.

> Expose UserDefinedType make sure could extends it
> -
>
> Key: SPARK-20740
> URL: https://issues.apache.org/jira/browse/SPARK-20740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> User may want to extends UserDefinedType and create data types . We should 
> make UserDefinedType as a public class .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20761) Union uses column order rather than schema

2017-05-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20761.
--
Resolution: Duplicate

I am pretty sure that it is a duplicate of SPARK-15918. Please reopen this if I 
misunderstood.

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"

2017-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012439#comment-16012439
 ] 

Sean Owen commented on SPARK-20765:
---

Yes, but doesn't this code lead com.abc.xyz as com.abc.xyz as desired? that's 
your class name. Is that class on the classpath when you load?

> Cannot load persisted PySpark ML Pipeline that includes 3rd party stage 
> (Transformer or Estimator) if the package name of stage is not 
> "org.apache.spark" and "pyspark"
> ---
>
> Key: SPARK-20765
> URL: https://issues.apache.org/jira/browse/SPARK-20765
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: APeng Zhang
>
> When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will 
> invoke JavaParams._from_java() to create Python instance of persisted stage. 
> In JavaParams._from_java(), the name of python class is derived from java 
> class name by replace string "pyspark" with "org.apache.spark". This is OK 
> for ML Transformer and Estimator inside PySpark, but for 3rd party 
> Transformer and Estimator if package name is not org.apache.spark and 
> pyspark, there will be an error:
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 228, in load
> return cls.read().load(path)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line 
> 180, in load
> return self._clazz._from_java(java_obj)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", 
> line 160, in _from_java
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 169, in _from_java
> py_type = __get_class(stage_name)
>   File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 163, in __get_class
> m = __import__(module)
> ImportError: No module named com.abc.xyz.ml.testclass
> Related code in PySpark:
> In pyspark/ml/pipeline.py
> class Pipeline(Estimator, MLReadable, MLWritable):
> @classmethod
> def _from_java(cls, java_stage):
> # Create a new instance of this stage.
> py_stage = cls()
> # Load information from java_stage to the instance.
> py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
> class JavaParams(JavaWrapper, Params):
> @staticmethod
> def _from_java(java_stage):
> def __get_class(clazz):
> """
> Loads Python class from its name.
> """
> parts = clazz.split('.')
> module = ".".join(parts[:-1])
> m = __import__(module)
> for comp in parts[1:]:
> m = getattr(m, comp)
> return m
> stage_name = 
> java_stage.getClass().getName().replace("org.apache.spark", "pyspark")
> # Generate a default new instance from the stage_name class.
> py_type = __get_class(stage_name)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

2017-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012422#comment-16012422
 ] 

Apache Spark commented on SPARK-20364:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18000

> Parquet predicate pushdown on columns with dots return empty results
> 
>
> Key: SPARK-20364
> URL: https://issues.apache.org/jira/browse/SPARK-20364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> ++
> |col.dots|
> ++
> ++
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +---+
> |coldots|
> +---+
> |  1|
> +---+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
> new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
> new Column[java.lang.Long] (
>   ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
> new Column[java.lang.Float] (
>   ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
> new Column[java.lang.Double] (
>   ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
> new Column[java.lang.Boolean] (
>   ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
> new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 182 matches

Mail list logo