[jira] [Commented] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013633#comment-16013633 ] Sean Owen commented on SPARK-20780: --- This doesn't sound like a Spark problem though. You're timing out in reading from Kafka? > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > Yarn - Cluster Mode > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs = true > request.timeout.ms = 5 > heartbeat.interval.ms = 3000 > auto.commit.interval.ms = 5000 > receive.buffer.bytes = 65536 > ssl.truststore.type = JKS > ssl.truststore.location = null > ssl.keystore.password = null > fetch.min.bytes = 1 > send.buffer.bytes = 131072 > value.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > group.i
[jira] [Commented] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013624#comment-16013624 ] Sean Owen commented on SPARK-20779: --- You don't need a JIRA for this. There's not even a 'right' place as these headers are optional, although recommended practice. It's OK to make things consistent but it's trivial. > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > > when i test some examples, i found the license is not at the top in some > files. and it will be best if we update these places of the ASF header to be > consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20772) Add support for query parameters in redirects on Yarn
[ https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjorn Jonsson closed SPARK-20772. - Resolution: Not A Problem > Add support for query parameters in redirects on Yarn > - > > Key: SPARK-20772 > URL: https://issues.apache.org/jira/browse/SPARK-20772 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0 >Reporter: Bjorn Jonsson >Priority: Minor > > Spark uses rewrites of query parameters to paths > (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). > This works fine in local or standalone mode, but does not work on Yarn (with > the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where > the query parameter is dropped. > The repro steps are: > - Start up the spark-shell and run a job > - Try to access the job details through http://:4040/jobs/job?id=0 > - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) > Going directly through the RM proxy works (does not cause query parameters to > be dropped). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20772) Add support for query parameters in redirects on Yarn
[ https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013618#comment-16013618 ] Wilfred Spiegelenburg commented on SPARK-20772: --- correct, opening up a yarn issue and providing a fix there: YARN-6615 Please close this > Add support for query parameters in redirects on Yarn > - > > Key: SPARK-20772 > URL: https://issues.apache.org/jira/browse/SPARK-20772 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0 >Reporter: Bjorn Jonsson >Priority: Minor > > Spark uses rewrites of query parameters to paths > (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). > This works fine in local or standalone mode, but does not work on Yarn (with > the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where > the query parameter is dropped. > The repro steps are: > - Start up the spark-shell and run a job > - Try to access the job details through http://:4040/jobs/job?id=0 > - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) > Going directly through the RM proxy works (does not cause query parameters to > be dropped). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20778) Implement array_intersect function
[ https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013617#comment-16013617 ] Sean Owen commented on SPARK-20778: --- This should just be a UDF. What's the argument that it needs to be a built-in? > Implement array_intersect function > -- > > Key: SPARK-20778 > URL: https://issues.apache.org/jira/browse/SPARK-20778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > Implement an array_intersect function that takes array arguments and returns > an array containing all elements of the first array that is common with the > remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
[ https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20781: Assignee: Apache Spark > the location of Dockerfile in docker.properties.template is wrong > - > > Key: SPARK-20781 > URL: https://issues.apache.org/jira/browse/SPARK-20781 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: Apache Spark > > the location of Dockerfile in docker.properties.template should be > "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
[ https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20781: Assignee: (was: Apache Spark) > the location of Dockerfile in docker.properties.template is wrong > - > > Key: SPARK-20781 > URL: https://issues.apache.org/jira/browse/SPARK-20781 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.1 >Reporter: liuzhaokun > > the location of Dockerfile in docker.properties.template should be > "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
[ https://issues.apache.org/jira/browse/SPARK-20781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013563#comment-16013563 ] Apache Spark commented on SPARK-20781: -- User 'liu-zhaokun' has created a pull request for this issue: https://github.com/apache/spark/pull/18013 > the location of Dockerfile in docker.properties.template is wrong > - > > Key: SPARK-20781 > URL: https://issues.apache.org/jira/browse/SPARK-20781 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.1 >Reporter: liuzhaokun > > the location of Dockerfile in docker.properties.template should be > "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20781) the location of Dockerfile in docker.properties.template is wrong
liuzhaokun created SPARK-20781: -- Summary: the location of Dockerfile in docker.properties.template is wrong Key: SPARK-20781 URL: https://issues.apache.org/jira/browse/SPARK-20781 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 2.1.1 Reporter: liuzhaokun the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile" -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013540#comment-16013540 ] Reynold Xin commented on SPARK-12297: - I don't think the CSV example you gave make sense. It is still interpreted timestamp with timezone. Just specify a timezone in the string and Spark will use that timezone. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013540#comment-16013540 ] Reynold Xin edited comment on SPARK-12297 at 5/17/17 5:12 AM: -- I don't think the CSV example you gave makes sense or supports your argument. It is still interpreted timestamp with timezone. Just specify a timezone in the string and Spark will use that timezone. was (Author: rxin): I don't think the CSV example you gave make sense. It is still interpreted timestamp with timezone. Just specify a timezone in the string and Spark will use that timezone. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20776. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18008 [https://github.com/apache/spark/pull/18008] > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent constructing empty > TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we > can slightly simplify the code to remove the need to construct one empty > TaskMetrics per onTaskSubmitted event. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20690) Analyzer shouldn't add missing attributes through subquery
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20690. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.3.0 > Analyzer shouldn't add missing attributes through subquery > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Labels: release-notes > Fix For: 2.3.0 > > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20690) Analyzer shouldn't add missing attributes through subquery
[ https://issues.apache.org/jira/browse/SPARK-20690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20690: Labels: release-notes (was: ) > Analyzer shouldn't add missing attributes through subquery > -- > > Key: SPARK-20690 > URL: https://issues.apache.org/jira/browse/SPARK-20690 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Labels: release-notes > > We add missing attributes into Filter in Analyzer. But we shouldn't do it > through subqueries like this: > {code} > select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1 > {code} > This query works in current codebase. However, the outside where clause > shouldn't be able to refer t1.c1 attribute. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jayadeepj updated SPARK-20780: -- Environment: Spark 2.1.0 Spark Streaming Kafka 010 Yarn - Cluster Mode CDH 5.8.4 CentOS Linux release 7.2 was: Spark 2.1.0 Spark Streaming Kafka 010 CDH 5.8.4 CentOS Linux release 7.2 > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > Yarn - Cluster Mode > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs = true > request.timeout.ms = 5 > heartbeat.interval.ms = 3000 > auto.commit.interval.ms = 5000 > receive.buffer.bytes = 65536 > ssl.truststore.type = JKS > ssl.truststore.location = null > ssl.keystore.password = null > fetch.min.bytes = 1 > send.buffer.bytes = 131072 > value.deserializer = class > org.apache.kafka.common.serial
[jira] [Updated] (SPARK-20762) Make String Params Case-Insensitive
[ https://issues.apache.org/jira/browse/SPARK-20762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-20762: - Description: Make String Params (excpet Cols) case-insensitve: {{solver}} {{modelType}} {{initMode}} {{metricName}} {{handleInvalid}} {{strategy}} {{stringOrderType}} {{coldStartStrategy}} {{impurity}} {{lossType}} {{featureSubsetStrategy}} {{intermediateStorageLevel}} {{finalStorageLevel}} was: Make String Params (excpet Cols) case-insensitve: {{solver}} {{modelType}} {{initMode}} {{metricName}} {{handleInvalid}} {{strategy}} {{stringOrderType}} {{coldStartStrategy}} {{impurity}} {{lossType}} {{}} > Make String Params Case-Insensitive > --- > > Key: SPARK-20762 > URL: https://issues.apache.org/jira/browse/SPARK-20762 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > Make String Params (excpet Cols) case-insensitve: > {{solver}} > {{modelType}} > {{initMode}} > {{metricName}} > {{handleInvalid}} > {{strategy}} > {{stringOrderType}} > {{coldStartStrategy}} > {{impurity}} > {{lossType}} > {{featureSubsetStrategy}} > {{intermediateStorageLevel}} > {{finalStorageLevel}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-20779: Description: when i test some examples, i found the license is not at the top in some files. and it will be best if we update these places of the ASF header to be consistent with other files. (was: The license is not at the top in some files. and it will be best if we update these places of the ASF header to be consistent with other files.) > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > > when i test some examples, i found the license is not at the top in some > files. and it will be best if we update these places of the ASF header to be > consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jayadeepj updated SPARK-20780: -- Environment: Spark 2.1.0 Spark Streaming Kafka 010 CDH 5.8.4 CentOS Linux release 7.2 was: Spark 2.1.0 Spark Streaming Kafka 010 CDH 5.8.4 CentOS Linux release 7.2 Priority: Major (was: Critical) > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs = true > request.timeout.ms = 5 > heartbeat.interval.ms = 3000 > auto.commit.interval.ms = 5000 > receive.buffer.bytes = 65536 > ssl.truststore.type = JKS > ssl.truststore.location = null > ssl.keystore.password = null > fetch.min.bytes = 1 > send.buffer.bytes = 131072 > value.deserializer = class > org.apache.kafka.common.seriali
[jira] [Updated] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jayadeepj updated SPARK-20780: -- Attachment: streaming_1.png streaming_2.png tasks_timing_out_3.png > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj >Priority: Critical > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs = true > request.timeout.ms = 5 > heartbeat.interval.ms = 3000 > auto.commit.interval.ms = 5000 > receive.buffer.bytes = 65536 > ssl.truststore.type = JKS > ssl.truststore.location = null > ssl.keystore.password = null > fetch.min.bytes = 1 > send.buffer.bytes = 131072 > value.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > group.id = default1 > retry.backoff.
[jira] [Updated] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-20779: Issue Type: Improvement (was: Bug) > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > > The license is not at the top in some files. and it will be best if we > update these places of the ASF header to be consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20780) Spark Kafka10 Consumer Hangs
jayadeepj created SPARK-20780: - Summary: Spark Kafka10 Consumer Hangs Key: SPARK-20780 URL: https://issues.apache.org/jira/browse/SPARK-20780 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.1.0 Environment: Spark 2.1.0 Spark Streaming Kafka 010 CDH 5.8.4 CentOS Linux release 7.2 Reporter: jayadeepj Priority: Critical We have recently upgraded our Streaming App with Direct Stream to Spark 2 (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 10 . We find abnormal delays after the application has run for a couple of hours or completed consumption of approx. ~ 5 million records. See screenshot 1 & 2 There is a sudden dip in the processing time from ~15 seconds (usual for this app) to ~3 minutes & from then on the processing time keeps degrading throughout. We have seen that the delay is due to certain tasks taking the exact time duration of the configured Kafka Consumer 'request.timeout.ms' . We have tested this by varying timeout property to different values. See screenshot 3. I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually timing out on some of the partitions without reading data. But the executor logs it as successfully completed after the exact timeout duration. Note that most other tasks are completing successfully with millisecond duration. The timeout is most likely from the org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any network latency difference. We have observed this across multiple clusters & multiple apps with & without TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent performance 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 446288 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 (TID 446288) 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, partition 0 offsets 776843 -> 779591 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for spark-executor-default1 XX-XXX-XX 0 776843 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 (TID 446288). 1699 bytes result sent to driver 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 446329 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 (TID 446329) 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 and clearing cache 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 6807 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored as bytes in memory (estimated size 13.1 KB, free 4.1 GB) 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 6807 took 4 ms 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as values in m We can see that the log statement differ with the exact timeout duration. Our consumer config is below. 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: metric.reporters = [] metadata.max.age.ms = 30 partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor] reconnect.backoff.ms = 50 sasl.kerberos.ticket.renew.window.factor = 0.8 max.partition.fetch.bytes = 1048576 bootstrap.servers = [x.xxx.xxx:9092] ssl.keystore.type = JKS enable.auto.commit = true sasl.mechanism = GSSAPI interceptor.classes = null exclude.internal.topics = true ssl.truststore.password = null client.id = ssl.endpoint.identification.algorithm = null max.poll.records = 2147483647 check.crcs = true request.timeout.ms = 5 heartbeat.interval.ms = 3000 auto.commit.interval.ms = 5000 receive.buffer.bytes = 65536 ssl.truststore.type = JKS ssl.truststore.location = null ssl.keystore.password = null fetch.min.bytes = 1 send.buffer.bytes = 131072 value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer group.id = default1 retry.backoff.ms = 100 ssl.secure.random.implementation = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 ssl.trustmanager.algorithm = PKIX ssl.key.password = null fetch.max.wait.ms = 500 sasl.kerberos.min.time.before.relogin = 6 connections.max.idle.ms = 54 session.timeout.ms = 3
[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20779: Assignee: (was: Apache Spark) > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > > The license is not at the top in some files. and it will be best if we > update these places of the ASF header to be consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20779: Assignee: Apache Spark > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Assignee: Apache Spark >Priority: Trivial > > The license is not at the top in some files. and it will be best if we > update these places of the ASF header to be consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013465#comment-16013465 ] Apache Spark commented on SPARK-20779: -- User 'zuotingbing' has created a pull request for this issue: https://github.com/apache/spark/pull/18012 > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > > The license is not at the top in some files. and it will be best if we > update these places of the ASF header to be consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20779) The ASF header placed in an incorrect location in some files
zuotingbing created SPARK-20779: --- Summary: The ASF header placed in an incorrect location in some files Key: SPARK-20779 URL: https://issues.apache.org/jira/browse/SPARK-20779 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 2.1.1, 2.1.0 Reporter: zuotingbing Priority: Trivial The license is not at the top in some files. and it will be best if we update these places of the ASF header to be consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20772) Add support for query parameters in redirects on Yarn
[ https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013431#comment-16013431 ] Saisai Shao commented on SPARK-20772: - I'm guessing if it is an issue of {{AmIpFilter}}, should be a yarn issue, not related to Spark? > Add support for query parameters in redirects on Yarn > - > > Key: SPARK-20772 > URL: https://issues.apache.org/jira/browse/SPARK-20772 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0 >Reporter: Bjorn Jonsson >Priority: Minor > > Spark uses rewrites of query parameters to paths > (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). > This works fine in local or standalone mode, but does not work on Yarn (with > the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where > the query parameter is dropped. > The repro steps are: > - Start up the spark-shell and run a job > - Try to access the job details through http://:4040/jobs/job?id=0 > - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) > Going directly through the RM proxy works (does not cause query parameters to > be dropped). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20772) Add support for query parameters in redirects on Yarn
[ https://issues.apache.org/jira/browse/SPARK-20772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjorn Jonsson updated SPARK-20772: -- Description: Spark uses rewrites of query parameters to paths (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This works fine in local or standalone mode, but does not work on Yarn (with the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the query parameter is dropped. The repro steps are: - Start up the spark-shell and run a job - Try to access the job details through http://:4040/jobs/job?id=0 - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) Going directly through the RM proxy works (does not cause query parameters to be dropped). was: Spark uses rewrites of query parameters to paths (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This works fine in local or standalone mode, but does not work on Yarn (with the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the query parameter is dropped. The repro steps are: - Start up the spark-shell in yarn client or cluster mode and run a job - Try to access the job details through http://:4040/jobs/job?id=0 - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) Going directly through the RM proxy works (does not cause query parameters to be dropped). > Add support for query parameters in redirects on Yarn > - > > Key: SPARK-20772 > URL: https://issues.apache.org/jira/browse/SPARK-20772 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.1.0 >Reporter: Bjorn Jonsson >Priority: Minor > > Spark uses rewrites of query parameters to paths > (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). > This works fine in local or standalone mode, but does not work on Yarn (with > the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where > the query parameter is dropped. > The repro steps are: > - Start up the spark-shell and run a job > - Try to access the job details through http://:4040/jobs/job?id=0 > - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) > Going directly through the RM proxy works (does not cause query parameters to > be dropped). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19089) Support nested arrays/seqs in Datasets
[ https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19089: Assignee: (was: Apache Spark) > Support nested arrays/seqs in Datasets > -- > > Key: SPARK-19089 > URL: https://issues.apache.org/jira/browse/SPARK-19089 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Priority: Minor > > Nested arrays and seqs are not supported in Datasets: > {code} > scala> spark.createDataset(Seq(Array(Array(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Array(Array(1 > ^ > scala> Seq(Array(Array(1))).toDS() > :24: error: value toDS is not a member of Seq[Array[Array[Int]]] >Seq(Array(Array(1))).toDS() > scala> spark.createDataset(Seq(Seq(Seq(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Seq(Seq(1 > scala> Seq(Seq(Seq(1))).toDS() > :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]] >Seq(Seq(Seq(1))).toDS() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19089) Support nested arrays/seqs in Datasets
[ https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19089: Assignee: Apache Spark > Support nested arrays/seqs in Datasets > -- > > Key: SPARK-19089 > URL: https://issues.apache.org/jira/browse/SPARK-19089 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Assignee: Apache Spark >Priority: Minor > > Nested arrays and seqs are not supported in Datasets: > {code} > scala> spark.createDataset(Seq(Array(Array(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Array(Array(1 > ^ > scala> Seq(Array(Array(1))).toDS() > :24: error: value toDS is not a member of Seq[Array[Array[Int]]] >Seq(Array(Array(1))).toDS() > scala> spark.createDataset(Seq(Seq(Seq(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Seq(Seq(1 > scala> Seq(Seq(Seq(1))).toDS() > :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]] >Seq(Seq(Seq(1))).toDS() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19089) Support nested arrays/seqs in Datasets
[ https://issues.apache.org/jira/browse/SPARK-19089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013350#comment-16013350 ] Apache Spark commented on SPARK-19089: -- User 'michalsenkyr' has created a pull request for this issue: https://github.com/apache/spark/pull/18011 > Support nested arrays/seqs in Datasets > -- > > Key: SPARK-19089 > URL: https://issues.apache.org/jira/browse/SPARK-19089 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michal Šenkýř >Priority: Minor > > Nested arrays and seqs are not supported in Datasets: > {code} > scala> spark.createDataset(Seq(Array(Array(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Array(Array(1 > ^ > scala> Seq(Array(Array(1))).toDS() > :24: error: value toDS is not a member of Seq[Array[Array[Int]]] >Seq(Array(Array(1))).toDS() > scala> spark.createDataset(Seq(Seq(Seq(1 > :24: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. >spark.createDataset(Seq(Seq(Seq(1 > scala> Seq(Seq(Seq(1))).toDS() > :24: error: value toDS is not a member of Seq[Seq[Seq[Int]]] >Seq(Seq(Seq(1))).toDS() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20778) Implement array_intersect function
[ https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20778: Assignee: (was: Apache Spark) > Implement array_intersect function > -- > > Key: SPARK-20778 > URL: https://issues.apache.org/jira/browse/SPARK-20778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > Implement an array_intersect function that takes array arguments and returns > an array containing all elements of the first array that is common with the > remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20778) Implement array_intersect function
[ https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20778: Assignee: Apache Spark > Implement array_intersect function > -- > > Key: SPARK-20778 > URL: https://issues.apache.org/jira/browse/SPARK-20778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Assignee: Apache Spark >Priority: Minor > > Implement an array_intersect function that takes array arguments and returns > an array containing all elements of the first array that is common with the > remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20778) Implement array_intersect function
[ https://issues.apache.org/jira/browse/SPARK-20778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013337#comment-16013337 ] Apache Spark commented on SPARK-20778: -- User 'ericvandenbergfb' has created a pull request for this issue: https://github.com/apache/spark/pull/18010 > Implement array_intersect function > -- > > Key: SPARK-20778 > URL: https://issues.apache.org/jira/browse/SPARK-20778 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Eric Vandenberg >Priority: Minor > > Implement an array_intersect function that takes array arguments and returns > an array containing all elements of the first array that is common with the > remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20778) Implement array_intersect function
Eric Vandenberg created SPARK-20778: --- Summary: Implement array_intersect function Key: SPARK-20778 URL: https://issues.apache.org/jira/browse/SPARK-20778 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Eric Vandenberg Priority: Minor Implement an array_intersect function that takes array arguments and returns an array containing all elements of the first array that is common with the remaining arrays. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013317#comment-16013317 ] Josh Rosen commented on SPARK-18838: I think that SPARK-20776 / https://github.com/apache/spark/pull/18008 might help here: it addresses a nasty performance bug in JobProgressListener.onTaskStart caused by the unnecessary creation of empty TaskMetrics objects. [~sitalke...@gmail.com] [~zsxwing], I'd be interested to see if we can do a coarse-grained split between users' custom listeners and Spark's own internal listeners. If we're careful in performance optimization of Spark's core internal listeners (such as ExecutorAllocationManagerListener) then it might be okay to publish events directly to those listeners (without buffering) and use buffering only for third-party listeners where we don't want to risk perf. bugs slowing down the cluster. Alternatively, we could use two queues, one for internal listeners and another for external ones. This wouldn't be as fine-grained as thread-per-listener but might buy us a lot of the benefits with perhaps less code needed. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20235) Hive on S3 s3:sse and non S3:sse buckets
[ https://issues.apache.org/jira/browse/SPARK-20235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013290#comment-16013290 ] Franck Tago commented on SPARK-20235: - was this comment meant for me? what does that mean ? > Hive on S3 s3:sse and non S3:sse buckets > - > > Key: SPARK-20235 > URL: https://issues.apache.org/jira/browse/SPARK-20235 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Franck Tago >Priority: Minor > > my spark application writes into 2 hive tables . > both tables are external with data residing on S3 > I want to encrypt the data when writing into hive table1 , but I do not want > to encrypt the data when writing into hive table 2. > given that the parameter fs.s3a.server-side-encryption-algorithm is set > globally , I do not see how these use cases are supported in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18891) Support for specific collection types
[ https://issues.apache.org/jira/browse/SPARK-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013278#comment-16013278 ] Apache Spark commented on SPARK-18891: -- User 'michalsenkyr' has created a pull request for this issue: https://github.com/apache/spark/pull/18009 > Support for specific collection types > - > > Key: SPARK-18891 > URL: https://issues.apache.org/jira/browse/SPARK-18891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.3, 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > > Encoders treat all collections the same (i.e. {{Seq}} vs {{List}}) which > force users to only define classes with the most generic type. > An [example > error|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/2398463439880241/2840265927289860/latest.html]: > {code} > case class SpecificCollection(aList: List[Int]) > Seq(SpecificCollection(1 :: Nil)).toDS().collect() > {code} > {code} > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 98, Column 120: No applicable constructor/method found > for actual parameters "scala.collection.Seq"; candidates are: > "line29e7e4b1e36445baa3505b2e102aa86b29.$read$$iw$$iw$$iw$$iw$SpecificCollection(scala.collection.immutable.List)" > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20777) Spark Streaming NullPointerException when restoring from hdfs checkpoint
Richard Moorhead created SPARK-20777: Summary: Spark Streaming NullPointerException when restoring from hdfs checkpoint Key: SPARK-20777 URL: https://issues.apache.org/jira/browse/SPARK-20777 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: AWS EMR 5.2 Reporter: Richard Moorhead Priority: Minor When restoring a spark streaming job from a checkpoint I am experiencing infrequent NullPointerExceptions when transformations are applied to RDDs with `foreachRDD` http://stackoverflow.com/questions/43984672/spark-kinesis-streaming-checkpoint-recovery-rdd-nullpointer-exception -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20776: --- Attachment: (was: screenshot-1.png) > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent constructing empty > TaskMetrics instances. As I'll show in a PR, we can slightly simplify the > code to remove the need to construct one empty TaskMetrics per > onTaskSubmitted event. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20776: --- Description: In {code} ./bin/spark-shell --master=local[64] {code} I ran {code} sc.parallelize(1 to 10, 10).count() {code} and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent constructing empty TaskMetrics instances. As I'll show in a PR, we can slightly simplify the code to remove the need to construct one empty TaskMetrics per onTaskSubmitted event. was: In {code} ./bin/spark-shell --master=local[64] {code} I ran {code} sc.parallelize(1 to 10, 10).count() {code} and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent initializing the {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to remove this bottleneck and prevent dropped listener events. > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent constructing empty > TaskMetrics instances. As I'll show in a PR, we can slightly simplify the > code to remove the need to construct one empty TaskMetrics per > onTaskSubmitted event. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20776: --- Description: In {code} ./bin/spark-shell --master=local[64] {code} I ran {code} sc.parallelize(1 to 10, 10).count() {code} and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent constructing empty TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we can slightly simplify the code to remove the need to construct one empty TaskMetrics per onTaskSubmitted event. was: In {code} ./bin/spark-shell --master=local[64] {code} I ran {code} sc.parallelize(1 to 10, 10).count() {code} and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent constructing empty TaskMetrics instances. As I'll show in a PR, we can slightly simplify the code to remove the need to construct one empty TaskMetrics per onTaskSubmitted event. > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent constructing empty > TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we > can slightly simplify the code to remove the need to construct one empty > TaskMetrics per onTaskSubmitted event. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20776: --- Summary: Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization (was: Fix performance problems in TaskMetrics.nameToAccums map initialization) > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Attachments: screenshot-1.png > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent initializing the > {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the > use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to > remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013247#comment-16013247 ] Ruslan Dautkhanov commented on SPARK-16441: --- We did not have spark.dynamicAllocation.maxExecutors set too, and after we set it to a value = max number of containers possible per yarn allocation, issue {noformat} Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler {noformat} doesn't happen as often, but does happen anyway. We think having concurrentSQL turned on demands higher values of spark.scheduler.listenerbus.eventqueue.size too? > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0, 2.1.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > Attachments: SPARK-16441-compare-apply-PR-16819.zip, > SPARK-16441-stage.jpg, SPARK-16441-threadDump.jpg, > SPARK-16441-yarn-metrics.jpg > > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListener
[jira] [Commented] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013220#comment-16013220 ] Apache Spark commented on SPARK-20776: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/18008 > Fix performance problems in TaskMetrics.nameToAccums map initialization > --- > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Attachments: screenshot-1.png > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent initializing the > {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the > use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to > remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20776: Assignee: Apache Spark (was: Josh Rosen) > Fix performance problems in TaskMetrics.nameToAccums map initialization > --- > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Apache Spark > Attachments: screenshot-1.png > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent initializing the > {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the > use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to > remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20776: Assignee: Josh Rosen (was: Apache Spark) > Fix performance problems in TaskMetrics.nameToAccums map initialization > --- > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Attachments: screenshot-1.png > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent initializing the > {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the > use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to > remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization
Josh Rosen created SPARK-20776: -- Summary: Fix performance problems in TaskMetrics.nameToAccums map initialization Key: SPARK-20776 URL: https://issues.apache.org/jira/browse/SPARK-20776 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Attachments: screenshot-1.png In {code} ./bin/spark-shell --master=local[64] {code} I ran {code} sc.parallelize(1 to 10, 10).count() {code} and profiled the time spend in the LiveListenerBus event processing thread. I discovered that the majority of the time was being spent initializing the {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20776) Fix performance problems in TaskMetrics.nameToAccums map initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-20776: --- Attachment: screenshot-1.png > Fix performance problems in TaskMetrics.nameToAccums map initialization > --- > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Attachments: screenshot-1.png > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent initializing the > {{TaskMetrics.nameToAccums}} map (see attached screenshot). By replacing the > use of Scala's LinkedHashMap with a pre-sized Java hashmap I was able to > remove this bottleneck and prevent dropped listener events. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013208#comment-16013208 ] Ruslan Dautkhanov commented on SPARK-15703: --- We keep running into this issue too - would be great to document spark.scheduler.listenerbus.eventqueue.size > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, > SparkListenerBus .png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)
[ https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013199#comment-16013199 ] Jiang Xingbo edited comment on SPARK-20700 at 5/16/17 10:23 PM: In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail at preventing the recursive deductions. I'll send a PR to fix this later today. was (Author: jiangxb1987): In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail to prevent the recursive deductions. I'll send a PR to fix this later today. > InferFiltersFromConstraints stackoverflows for query (v2) > - > > Key: SPARK-20700 > URL: https://issues.apache.org/jira/browse/SPARK-20700 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > > The following (complicated) query eventually fails with a stack overflow > during optimization: > {code} > CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, > int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES > ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', > TIMESTAMP('2015-01-14 00:00:00.0'), '947'), > ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', > TIMESTAMP('1999-08-15 00:00:00.0'), '437'), > ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', > TIMESTAMP('1991-05-23 00:00:00.0'), '630'), > ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), > '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'), > ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS > STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'), > ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', > CAST(NULL AS TIMESTAMP), '-740'), > ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL > AS TIMESTAMP), CAST(NULL AS STRING)), > ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', > CAST(NULL AS TIMESTAMP), '181'), > ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', > TIMESTAMP('2016-06-30 00:00:00.0'), '487'), > ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS > STRING), CAST(NULL AS TIMESTAMP), '-62'); > CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null); > SELECT > AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS > float_col, > COUNT(t1.smallint_col_2) AS int_col > FROM table_5 t1 > INNER JOIN ( > SELECT > (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * > (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) > AS boolean_col, > t2.a, > (t1.int_col_4) * (t1.int_col_4) AS int_col > FROM table_5 t1 > LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4) > WHERE > (t1.smallint_col_2) > (t1.smallint_col_2) > GROUP BY > t2.a, > (t1.int_col_4) * (t1.int_col_4) > HAVING > ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), > SUM(t1.int_col_4)) > ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND > ((t2.a) = (t1.smallint_col_2)); > {code} > (I haven't tried to minimize this failing case yet). > Based on sampled jstacks from the driver, it looks like the query might be > repeatedly inferring filters from constraints and then pruning those filters. > Here's part of the stack at the point where it stackoverflows: > {code} > [... repeats ...] > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(Tra
[jira] [Commented] (SPARK-20700) InferFiltersFromConstraints stackoverflows for query (v2)
[ https://issues.apache.org/jira/browse/SPARK-20700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013199#comment-16013199 ] Jiang Xingbo commented on SPARK-20700: -- In the previous approach we used `aliasMap` to link an `Attribute` to the expression with potentially the form `f(a, b)`, but we only searched the `expressions` and `children.expressions` for this, which is not enough when an `Alias` may lies deep in the logical plan. In that case, we can't generate the valid equivalent constraint classes and thus we fail to prevent the recursive deductions. I'll send a PR to fix this later today. > InferFiltersFromConstraints stackoverflows for query (v2) > - > > Key: SPARK-20700 > URL: https://issues.apache.org/jira/browse/SPARK-20700 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > > The following (complicated) query eventually fails with a stack overflow > during optimization: > {code} > CREATE TEMPORARY VIEW table_5(varchar0002_col_1, smallint_col_2, float_col_3, > int_col_4, string_col_5, timestamp_col_6, string_col_7) AS VALUES > ('68', CAST(NULL AS SMALLINT), CAST(244.90413 AS FLOAT), -137, '571', > TIMESTAMP('2015-01-14 00:00:00.0'), '947'), > ('82', CAST(213 AS SMALLINT), CAST(53.184647 AS FLOAT), -724, '-278', > TIMESTAMP('1999-08-15 00:00:00.0'), '437'), > ('-7', CAST(-15 AS SMALLINT), CAST(NULL AS FLOAT), -890, '778', > TIMESTAMP('1991-05-23 00:00:00.0'), '630'), > ('22', CAST(676 AS SMALLINT), CAST(385.27386 AS FLOAT), CAST(NULL AS INT), > '-10', TIMESTAMP('1996-09-29 00:00:00.0'), '641'), > ('16', CAST(430 AS SMALLINT), CAST(187.23717 AS FLOAT), 989, CAST(NULL AS > STRING), TIMESTAMP('2024-04-21 00:00:00.0'), '-234'), > ('83', CAST(760 AS SMALLINT), CAST(-695.45386 AS FLOAT), -970, '330', > CAST(NULL AS TIMESTAMP), '-740'), > ('68', CAST(-930 AS SMALLINT), CAST(NULL AS FLOAT), -915, '-766', CAST(NULL > AS TIMESTAMP), CAST(NULL AS STRING)), > ('48', CAST(692 AS SMALLINT), CAST(-220.59615 AS FLOAT), 940, '-514', > CAST(NULL AS TIMESTAMP), '181'), > ('21', CAST(44 AS SMALLINT), CAST(NULL AS FLOAT), -175, '761', > TIMESTAMP('2016-06-30 00:00:00.0'), '487'), > ('50', CAST(953 AS SMALLINT), CAST(837.2948 AS FLOAT), 705, CAST(NULL AS > STRING), CAST(NULL AS TIMESTAMP), '-62'); > CREATE VIEW bools(a, b) as values (1, true), (1, true), (1, null); > SELECT > AVG(-13) OVER (ORDER BY COUNT(t1.smallint_col_2) DESC ROWS 27 PRECEDING ) AS > float_col, > COUNT(t1.smallint_col_2) AS int_col > FROM table_5 t1 > INNER JOIN ( > SELECT > (MIN(-83) OVER (PARTITION BY t2.a ORDER BY t2.a, (t1.int_col_4) * > (t1.int_col_4) ROWS BETWEEN CURRENT ROW AND 15 FOLLOWING)) NOT IN (-222, 928) > AS boolean_col, > t2.a, > (t1.int_col_4) * (t1.int_col_4) AS int_col > FROM table_5 t1 > LEFT JOIN bools t2 ON (t2.a) = (t1.int_col_4) > WHERE > (t1.smallint_col_2) > (t1.smallint_col_2) > GROUP BY > t2.a, > (t1.int_col_4) * (t1.int_col_4) > HAVING > ((t1.int_col_4) * (t1.int_col_4)) IN ((t1.int_col_4) * (t1.int_col_4), > SUM(t1.int_col_4)) > ) t2 ON (((t2.int_col) = (t1.int_col_4)) AND ((t2.a) = (t1.int_col_4))) AND > ((t2.a) = (t1.smallint_col_2)); > {code} > (I haven't tried to minimize this failing case yet). > Based on sampled jstacks from the driver, it looks like the query might be > repeatedly inferring filters from constraints and then pruning those filters. > Here's part of the stack at the point where it stackoverflows: > {code} > [... repeats ...] > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative$1.apply(Canonicalize.scala:50) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$.org$apache$spark$sql$catalyst$expressions$Canonicalize$$gatherCommutative(Canonicalize.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Canonicalize$$anonfun$org$apache$spark$sql$ca
[jira] [Resolved] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-20140. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma > Labels: kinesis, recovery > Fix For: 2.2.1, 2.3.0 > > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013191#comment-16013191 ] Burak Yavuz commented on SPARK-20140: - resolved by https://github.com/apache/spark/pull/17467 > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20140) Remove hardcoded kinesis retry wait and max retries
[ https://issues.apache.org/jira/browse/SPARK-20140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz reassigned SPARK-20140: --- Assignee: Yash Sharma > Remove hardcoded kinesis retry wait and max retries > --- > > Key: SPARK-20140 > URL: https://issues.apache.org/jira/browse/SPARK-20140 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma > Labels: kinesis, recovery > > The pull requests proposes to remove the hardcoded values for Amazon Kinesis > - MIN_RETRY_WAIT_TIME_MS, MAX_RETRIES. > This change is critical for kinesis checkpoint recovery when the kinesis > backed rdd is huge. > Following happens in a typical kinesis recovery : > - kinesis throttles large number of requests while recovering > - retries in case of throttling are not able to recover due to the small wait > period > - kinesis throttles per second, the wait period should be configurable for > recovery > The patch picks the spark kinesis configs from: > - spark.streaming.kinesis.retry.wait.time > - spark.streaming.kinesis.retry.max.attempts -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-19372: Assignee: Kazuaki Ishizaki > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Assignee: Kazuaki Ishizaki > Fix For: 2.3.0 > > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19372. -- Resolution: Fixed Fix Version/s: 2.3.0 > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi > Fix For: 2.3.0 > > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20775) from_json should also have an API where the schema is specified with a string
Burak Yavuz created SPARK-20775: --- Summary: from_json should also have an API where the schema is specified with a string Key: SPARK-20775 URL: https://issues.apache.org/jira/browse/SPARK-20775 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Burak Yavuz Right now you also have to provide a java.util.Map which is not nice for Scala users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20503) ML 2.2 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20503: -- Fix Version/s: 2.2.0 > ML 2.2 QA: API: Python API coverage > --- > > Key: SPARK-20503 > URL: https://issues.apache.org/jira/browse/SPARK-20503 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Blocker > Fix For: 2.2.0 > > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20509. --- Resolution: Done Fix Version/s: 2.2.0 > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > Fix For: 2.2.0 > > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20774) BroadcastExchangeExec doesn't cancel the Spark job if broadcasting a relation timeouts.
Shixiong Zhu created SPARK-20774: Summary: BroadcastExchangeExec doesn't cancel the Spark job if broadcasting a relation timeouts. Key: SPARK-20774 URL: https://issues.apache.org/jira/browse/SPARK-20774 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.1.0 Reporter: Shixiong Zhu When broadcasting a table takes too long and triggers timeout, the SQL query will fail. However, the background Spark job is still running and it wastes resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013060#comment-16013060 ] Joseph K. Bradley commented on SPARK-20509: --- I checked the new and changed APIs, comparing the Scala and R docs. I did not see any issues, so I'll mark this complete. But [~felixcheung] and others, please say if you find any during your checks. Thanks! > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > Fix For: 2.2.0 > > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations
[ https://issues.apache.org/jira/browse/SPARK-14584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013001#comment-16013001 ] Takeshi Yamamuro commented on SPARK-14584: -- Could we close this as resolved? It seems the merged pr in SPARK-18284 [~hyukjin.kwon] suggested includes test cases this ticket describes. > Improve recognition of non-nullability in Dataset transformations > - > > Key: SPARK-14584 > URL: https://issues.apache.org/jira/browse/SPARK-14584 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Josh Rosen > > There are many cases where we can statically know that a field will never be > null. For instance, a field in a case class with a primitive type will never > return null. However, there are currently several cases in the Dataset API > where we do not properly recognize this non-nullability. For instance: > {code} > case class MyCaseClass(foo: Int) > sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema > {code} > claims that the {{foo}} field is nullable even though this is impossible. > I believe that this is due to the way that we reason about nullability when > constructing serializer expressions in ExpressionEncoders. The following > assertion will currently fail if added to ExpressionEncoder: > {code} > require(schema.size == serializer.size) > schema.fields.zip(serializer).foreach { case (field, fieldSerializer) => > require(field.dataType == fieldSerializer.dataType, s"Field > ${field.name}'s data type is " + > s"${field.dataType} in the schema but ${fieldSerializer.dataType} in > its serializer") > require(field.nullable == fieldSerializer.nullable, s"Field > ${field.name}'s nullability is " + > s"${field.nullable} in the schema but ${fieldSerializer.nullable} in > its serializer") > } > {code} > Most often, the schema claims that a field is non-nullable while the encoder > allows for nullability, but occasionally we see a mismatch in the datatypes > due to disagreements over the nullability of nested structs' fields (or > fields of structs in arrays). > I think the problem is that when we're reasoning about nullability in a > struct's schema we consider its fields' nullability to be independent of the > nullability of the struct itself, whereas in the serializer expressions we > are considering those field extraction expressions to be nullable if the > input objects themselves can be nullable. > I'm not sure what's the simplest way to fix this. One proposal would be to > leave the serializers unchanged and have ObjectOperator derive its output > attributes from an explicitly-passed schema rather than using the > serializers' attributes. However, I worry that this might introduce bugs in > case the serializer and schema disagree. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20773: Assignee: (was: Apache Spark) > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Priority: Minor > Labels: easyfix, performance > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20773: Assignee: Apache Spark > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Assignee: Apache Spark >Priority: Minor > Labels: easyfix, performance > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012992#comment-16012992 ] Apache Spark commented on SPARK-20773: -- User 'tpoterba' has created a pull request for this issue: https://github.com/apache/spark/pull/18005 > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Priority: Minor > Labels: easyfix, performance > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20773) ParquetWriteSupport.writeFields is quadratic in number of fields
[ https://issues.apache.org/jira/browse/SPARK-20773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] T Poterba updated SPARK-20773: -- Summary: ParquetWriteSupport.writeFields is quadratic in number of fields (was: ParquetWriteSupport.writeFields has is quadratic in number of fields) > ParquetWriteSupport.writeFields is quadratic in number of fields > > > Key: SPARK-20773 > URL: https://issues.apache.org/jira/browse/SPARK-20773 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: T Poterba >Priority: Minor > Labels: easyfix, performance > Original Estimate: 10m > Remaining Estimate: 10m > > The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all > elements. Since the fieldWriters object is a List, this is a quadratic > operation. > See line 123: > https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20746) Built-in SQL Function Improvement
[ https://issues.apache.org/jira/browse/SPARK-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012982#comment-16012982 ] Takeshi Yamamuro commented on SPARK-20746: -- Could I take on some of them? [~smilegator] > Built-in SQL Function Improvement > - > > Key: SPARK-20746 > URL: https://issues.apache.org/jira/browse/SPARK-20746 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > SQL functions are part of the core of the ISO/ANSI standards. This umbrella > JIRA is trying to list all the ISO/ANS SQL functions that are not fully > implemented by Spark SQL, fix the documentation and test case issues in the > supported functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20773) ParquetWriteSupport.writeFields has is quadratic in number of fields
T Poterba created SPARK-20773: - Summary: ParquetWriteSupport.writeFields has is quadratic in number of fields Key: SPARK-20773 URL: https://issues.apache.org/jira/browse/SPARK-20773 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: T Poterba Priority: Minor The writeFields method in ParquetWriteSupport uses Seq.apply(i) to select all elements. Since the fieldWriters object is a List, this is a quadratic operation. See line 123: https://github.com/apache/spark/blob/ac1ab6b9db188ac54c745558d57dd0a031d0b162/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20509: - Assignee: Joseph K. Bradley > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20509) SparkR 2.2 QA: New R APIs and API docs
[ https://issues.apache.org/jira/browse/SPARK-20509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012913#comment-16012913 ] Joseph K. Bradley commented on SPARK-20509: --- I'll take this one > SparkR 2.2 QA: New R APIs and API docs > -- > > Key: SPARK-20509 > URL: https://issues.apache.org/jira/browse/SPARK-20509 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Blocker > > Audit new public R APIs. Take note of: > * Correctness and uniformity of API > * Documentation: Missing? Bad links or formatting? > ** Check both the generated docs linked from the user guide and the R command > line docs `?read.df`. These are generated using roxygen. > As you find issues, please create JIRAs and link them to this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20772) Add support for query parameters in redirects on Yarn
Bjorn Jonsson created SPARK-20772: - Summary: Add support for query parameters in redirects on Yarn Key: SPARK-20772 URL: https://issues.apache.org/jira/browse/SPARK-20772 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.1.0 Reporter: Bjorn Jonsson Priority: Minor Spark uses rewrites of query parameters to paths (http://:4040/jobs/job?id=0 --> http://:4040/jobs/job/?id=0). This works fine in local or standalone mode, but does not work on Yarn (with the org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter filter), where the query parameter is dropped. The repro steps are: - Start up the spark-shell in yarn client or cluster mode and run a job - Try to access the job details through http://:4040/jobs/job?id=0 - A HTTP ERROR 400 is thrown (requirement failed: missing id parameter) Going directly through the RM proxy works (does not cause query parameters to be dropped). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20771) Usability issues with weekofyear()
[ https://issues.apache.org/jira/browse/SPARK-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-20771: -- Description: The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation or ordering wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. was: The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. > Usability issues with weekofyear() > -- > > Key: SPARK-20771 > URL: https://issues.apache.org/jira/browse/SPARK-20771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiangrui Meng >Priority: Minor > > The weekofyear() implementation follows HIVE / ISO 8601 week number. However > it is not useful because it doesn't return the year of the week start. For > example, > weekofyear("2017-01-01") returns 52 > Anyone using this with groupBy('week) might do the aggregation or ordering > wrong. A better implementation should return the year number of the week as > well. > MySQL's yearweek() is much better in this sense: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. > Maybe we should implement that in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20771) Usability issues with weekofyear()
Xiangrui Meng created SPARK-20771: - Summary: Usability issues with weekofyear() Key: SPARK-20771 URL: https://issues.apache.org/jira/browse/SPARK-20771 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Xiangrui Meng Priority: Minor The weekofyear() implementation follows HIVE / ISO 8601 week number. However it is not useful because it doesn't return the year of the week start. For example, weekofyear("2017-01-01") returns 52 Anyone using this with groupBy('week) might do the aggregation wrong. A better implementation should return the year number of the week as well. MySQL's yearweek() is much better in this sense: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_yearweek. Maybe we should implement that in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15616) Metastore relation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012818#comment-16012818 ] Brian Zhang commented on SPARK-15616: - Hello, Just wondering what's the current status of this issue? I think this fix would be really helpful. Thanks! > Metastore relation should fallback to HDFS size of partitions that are > involved in Query if statistics are not available. > - > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Lianhui Wang > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012813#comment-16012813 ] Apache Spark commented on SPARK-18838: -- User 'bOOm-X' has created a pull request for this issue: https://github.com/apache/spark/pull/18004 > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20529) Worker should not use the received Master address
[ https://issues.apache.org/jira/browse/SPARK-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20529. -- Resolution: Fixed Fix Version/s: 2.2.0 > Worker should not use the received Master address > - > > Key: SPARK-20529 > URL: https://issues.apache.org/jira/browse/SPARK-20529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.2.0 > > > Right now when worker connects to master, master will send its address to the > worker. Then worker will save this address and use it to reconnect in case of > failure. > However, sometimes, this address is not correct. If there is a proxy between > master and worker, the address master sent is not the address of proxy. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20503) ML 2.2 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012774#comment-16012774 ] Joseph K. Bradley commented on SPARK-20503: --- Thanks a lot! > ML 2.2 QA: API: Python API coverage > --- > > Key: SPARK-20503 > URL: https://issues.apache.org/jira/browse/SPARK-20503 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20770) Improve ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20770: Assignee: Apache Spark > Improve ColumnStats > --- > > Key: SPARK-20770 > URL: https://issues.apache.org/jira/browse/SPARK-20770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > We improve the implementation of {{ColumnStats}} by using the following > approaches. > 1. Declare subclasses of {{ColumnStats}} as {{final}} > 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} > 3. Remove the dependency on {{GenericInternalRow}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20770) Improve ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012636#comment-16012636 ] Apache Spark commented on SPARK-20770: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18002 > Improve ColumnStats > --- > > Key: SPARK-20770 > URL: https://issues.apache.org/jira/browse/SPARK-20770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > We improve the implementation of {{ColumnStats}} by using the following > approaches. > 1. Declare subclasses of {{ColumnStats}} as {{final}} > 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} > 3. Remove the dependency on {{GenericInternalRow}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20770) Improve ColumnStats
[ https://issues.apache.org/jira/browse/SPARK-20770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20770: Assignee: (was: Apache Spark) > Improve ColumnStats > --- > > Key: SPARK-20770 > URL: https://issues.apache.org/jira/browse/SPARK-20770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > We improve the implementation of {{ColumnStats}} by using the following > approaches. > 1. Declare subclasses of {{ColumnStats}} as {{final}} > 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} > 3. Remove the dependency on {{GenericInternalRow}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee
[ https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012591#comment-16012591 ] Richard Moorhead commented on SPARK-9215: - WAL is not necessary for fault tolerant Kinesis streaming? Would checkpointing be enabled on the StreamingContext with writeAheadLog disabled then? > Implement WAL-free Kinesis receiver that give at-least once guarantee > - > > Key: SPARK-9215 > URL: https://issues.apache.org/jira/browse/SPARK-9215 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.5.0 > > > Currently, the KinesisReceiver can loose some data in the case of certain > failures (receiver and driver failures). Using the write ahead logs can > mitigate some of the problem, but it is not ideal because WALs dont work with > S3 (eventually consistency, etc.) which is the most likely file system to be > used in the EC2 environment. Hence, we have to take a different approach to > improving reliability for Kinesis. > Detailed design doc - > https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:16 PM: What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. was (Author: zi): What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multi
[jira] [Comment Edited] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi edited comment on SPARK-12297 at 5/16/17 3:11 PM: What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table with some data and run a select in SparkSQL, then change the local timezone and run the same select again (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. was (Author: zi): What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table and insert a timestamp into it in SparkSQL, then change the local timezone and read the value back (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say
[jira] [Commented] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.
[ https://issues.apache.org/jira/browse/SPARK-12297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012542#comment-16012542 ] Zoltan Ivanfi commented on SPARK-12297: --- What I meant is that if a CSV file ("STORED AS TEXTFILE" in SQL terminology) contains a timestamp and you define the type of the column as TIMESTAMP, then SparkSQL interprets that timestamp as a local time value instead of a UTC-normalized one. So if you have such a table and insert a timestamp into it in SparkSQL, then change the local timezone and read the value back (using SparkSQL again), you will see the same timestamp. If you do the same with a Parquet table, you will see a different timestamp after changing the local timezone. I mentioned Avro as an example by mistake, as Avro-backed tables do not support the timestamp type at this moment. I may have been thinking about ORC. > Add work-around for Parquet/Hive int96 timestamp bug. > - > > Key: SPARK-12297 > URL: https://issues.apache.org/jira/browse/SPARK-12297 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Ryan Blue > > Spark copied Hive's behavior for parquet, but this was inconsistent with > other file formats, and inconsistent with Impala (which is the original > source of putting a timestamp as an int96 in parquet, I believe). This made > timestamps in parquet act more like timestamps with timezones, while in other > file formats, timestamps have no time zone, they are a "floating time". > The easiest way to see this issue is to write out a table with timestamps in > multiple different formats from one timezone, then try to read them back in > another timezone. Eg., here I write out a few timestamps to parquet and > textfile hive tables, and also just as a json file, all in the > "America/Los_Angeles" timezone: > {code} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val tblPrefix = args(0) > val schema = new StructType().add("ts", TimestampType) > val rows = sc.parallelize(Seq( > "2015-12-31 23:50:59.123", > "2015-12-31 22:49:59.123", > "2016-01-01 00:39:59.123", > "2016-01-01 01:29:59.123" > ).map { x => Row(java.sql.Timestamp.valueOf(x)) }) > val rawData = spark.createDataFrame(rows, schema).toDF() > rawData.show() > Seq("parquet", "textfile").foreach { format => > val tblName = s"${tblPrefix}_$format" > spark.sql(s"DROP TABLE IF EXISTS $tblName") > spark.sql( > raw"""CREATE TABLE $tblName ( > | ts timestamp > | ) > | STORED AS $format > """.stripMargin) > rawData.write.insertInto(tblName) > } > rawData.write.json(s"${tblPrefix}_json") > {code} > Then I start a spark-shell in "America/New_York" timezone, and read the data > back from each table: > {code} > scala> spark.sql("select * from la_parquet").collect().foreach{println} > [2016-01-01 02:50:59.123] > [2016-01-01 01:49:59.123] > [2016-01-01 03:39:59.123] > [2016-01-01 04:29:59.123] > scala> spark.sql("select * from la_textfile").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").collect().foreach{println} > [2015-12-31 23:50:59.123] > [2015-12-31 22:49:59.123] > [2016-01-01 00:39:59.123] > [2016-01-01 01:29:59.123] > scala> spark.read.json("la_json").join(spark.sql("select * from > la_textfile"), "ts").show() > ++ > | ts| > ++ > |2015-12-31 23:50:...| > |2015-12-31 22:49:...| > |2016-01-01 00:39:...| > |2016-01-01 01:29:...| > ++ > scala> spark.read.json("la_json").join(spark.sql("select * from la_parquet"), > "ts").show() > +---+ > | ts| > +---+ > +---+ > {code} > The textfile and json based data shows the same times, and can be joined > against each other, while the times from the parquet data have changed (and > obviously joins fail). > This is a big problem for any organization that may try to read the same data > (say in S3) with clusters in multiple timezones. It can also be a nasty > surprise as an organization tries to migrate file formats. Finally, its a > source of incompatibility between Hive, Impala, and Spark. > HIVE-12767 aims to fix this by introducing a table property which indicates > the "storage timezone" for the table. Spark should add the same to ensure > consistency between file formats, and with Hive & Impala. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20770) Improve ColumnStats
Kazuaki Ishizaki created SPARK-20770: Summary: Improve ColumnStats Key: SPARK-20770 URL: https://issues.apache.org/jira/browse/SPARK-20770 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki We improve the implementation of {{ColumnStats}} by using the following approaches. 1. Declare subclasses of {{ColumnStats}} as {{final}} 2. Remove unnecessary call of {{row.isNullAt(ordinal)}} 3. Remove the dependency on {{GenericInternalRow}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20769) Incorrect documentation for using Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20769: Assignee: Apache Spark > Incorrect documentation for using Jupyter notebook > -- > > Key: SPARK-20769 > URL: https://issues.apache.org/jira/browse/SPARK-20769 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Andrew Ray >Assignee: Apache Spark >Priority: Minor > > SPARK-13973 incorrectly removed the required > PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with > Jupyter notebook -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20769) Incorrect documentation for using Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20769: Assignee: (was: Apache Spark) > Incorrect documentation for using Jupyter notebook > -- > > Key: SPARK-20769 > URL: https://issues.apache.org/jira/browse/SPARK-20769 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Andrew Ray >Priority: Minor > > SPARK-13973 incorrectly removed the required > PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with > Jupyter notebook -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20769) Incorrect documentation for using Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-20769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012527#comment-16012527 ] Apache Spark commented on SPARK-20769: -- User 'aray' has created a pull request for this issue: https://github.com/apache/spark/pull/18001 > Incorrect documentation for using Jupyter notebook > -- > > Key: SPARK-20769 > URL: https://issues.apache.org/jira/browse/SPARK-20769 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Andrew Ray >Priority: Minor > > SPARK-13973 incorrectly removed the required > PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with > Jupyter notebook -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-14098: - Description: [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing] is a design document for this change (***TODO: Update the document***). This JIRA implements a new in-memory cache feature used by DataFrame.cache and Dataset.cache. The followings are basic design based on discussions with Sameer, Weichen, Xiao, Herman, and Nong. * Use ColumnarBatch with ColumnVector that are common data representations for columnar storage * Use multiple compression scheme (such as RLE, intdelta, and so on) for each ColumnVector in ColumnarBatch depends on its data typpe * Generate code that is simple and specialized for each in-memory cache to build an in-memory cache * Generate code that directly reads data from ColumnVector for the in-memory cache by whole-stage codegen. * Enhance ColumnVector to keep UnsafeArrayData * Use primitive-type array for primitive uncompressed data type in ColumnVector * Use byte[] for UnsafeArrayData and compressed data Based on this design, this JIRA generates two kinds of Java code for DataFrame.cache()/Dataset.cache() * Generate Java code to build CachedColumnarBatch, which keeps data in ColumnarBatch * Generate Java code to get a value of each column from ColumnarBatch ** a Get a value directly from from ColumnarBatch in code generated by whole stage code gen (primary path) ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # of columns is more than 100, as backup path) was: When DataFrame.cache() is called, data is stored as column-oriented storage in CachedBatch. The current Catalyst generates Java program to get a value of a column from an InternalRow that is translated from CachedBatch. This issue generates Java code to get a value of a column from CachedBatch. While a column for a cache may be compressed, this issue handles float and double types that are never compressed. Other primitive types, whose column may be compressed, will be addressed in another entry. > Generate Java code to build CachedColumnarBatch and get values from > CachedColumnarBatch when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Kazuaki Ishizaki > > [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing] > is a design document for this change (***TODO: Update the document***). > This JIRA implements a new in-memory cache feature used by DataFrame.cache > and Dataset.cache. The followings are basic design based on discussions with > Sameer, Weichen, Xiao, Herman, and Nong. > * Use ColumnarBatch with ColumnVector that are common data representations > for columnar storage > * Use multiple compression scheme (such as RLE, intdelta, and so on) for each > ColumnVector in ColumnarBatch depends on its data typpe > * Generate code that is simple and specialized for each in-memory cache to > build an in-memory cache > * Generate code that directly reads data from ColumnVector for the in-memory > cache by whole-stage codegen. > * Enhance ColumnVector to keep UnsafeArrayData > * Use primitive-type array for primitive uncompressed data type in > ColumnVector > * Use byte[] for UnsafeArrayData and compressed data > Based on this design, this JIRA generates two kinds of Java code for > DataFrame.cache()/Dataset.cache() > * Generate Java code to build CachedColumnarBatch, which keeps data in > ColumnarBatch > * Generate Java code to get a value of each column from ColumnarBatch > ** a Get a value directly from from ColumnarBatch in code generated by whole > stage code gen (primary path) > ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # > of columns is more than 100, as backup path) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20769) Incorrect documentation for using Jupyter notebook
Andrew Ray created SPARK-20769: -- Summary: Incorrect documentation for using Jupyter notebook Key: SPARK-20769 URL: https://issues.apache.org/jira/browse/SPARK-20769 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.1.1 Reporter: Andrew Ray Priority: Minor SPARK-13973 incorrectly removed the required PYSPARK_DRIVER_PYTHON_OPTS="notebook" from documentation to use pyspark with Jupyter notebook -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-14098: - Issue Type: Umbrella (was: Improvement) > Generate Java code to build CachedColumnarBatch and get values from > CachedColumnarBatch when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Umbrella > Components: SQL >Reporter: Kazuaki Ishizaki > > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. While a > column for a cache may be compressed, this issue handles float and double > types that are never compressed. > Other primitive types, whose column may be compressed, will be addressed in > another entry. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012488#comment-16012488 ] Kazuaki Ishizaki commented on SPARK-14098: -- [~lins05] Sorry, I overlooked this message. I synced titles at least. Later, I will update description. > Generate Java code to build CachedColumnarBatch and get values from > CachedColumnarBatch when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. While a > column for a cache may be compressed, this issue handles float and double > types that are never compressed. > Other primitive types, whose column may be compressed, will be addressed in > another entry. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-14098: - Summary: Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called (was: Generate code that get a float/double value in each column from CachedBatch when DataFrame.cache() is called) > Generate Java code to build CachedColumnarBatch and get values from > CachedColumnarBatch when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. While a > column for a cache may be compressed, this issue handles float and double > types that are never compressed. > Other primitive types, whose column may be compressed, will be addressed in > another entry. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20748) Built-in SQL Function Support - CH[A]R
[ https://issues.apache.org/jira/browse/SPARK-20748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012487#comment-16012487 ] Yuming Wang commented on SPARK-20748: - I am working on this > Built-in SQL Function Support - CH[A]R > -- > > Key: SPARK-20748 > URL: https://issues.apache.org/jira/browse/SPARK-20748 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > Labels: starter > > {noformat} > CH[A]R() > {noformat} > Returns a character when given its ASCII code. > Ref: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"
[ https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012465#comment-16012465 ] APeng Zhang commented on SPARK-20765: - Yes, the class is on the classpath. The problem is the current implementation can not map my Scala class name (com.abc.xyz.ml.SomeClass) to Python class name (xyz.ml.SomeClass). > Cannot load persisted PySpark ML Pipeline that includes 3rd party stage > (Transformer or Estimator) if the package name of stage is not > "org.apache.spark" and "pyspark" > --- > > Key: SPARK-20765 > URL: https://issues.apache.org/jira/browse/SPARK-20765 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: APeng Zhang > > When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will > invoke JavaParams._from_java() to create Python instance of persisted stage. > In JavaParams._from_java(), the name of python class is derived from java > class name by replace string "pyspark" with "org.apache.spark". This is OK > for ML Transformer and Estimator inside PySpark, but for 3rd party > Transformer and Estimator if package name is not org.apache.spark and > pyspark, there will be an error: > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 228, in load > return cls.read().load(path) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 180, in load > return self._clazz._from_java(java_obj) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 160, in _from_java > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 169, in _from_java > py_type = __get_class(stage_name) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 163, in __get_class > m = __import__(module) > ImportError: No module named com.abc.xyz.ml.testclass > Related code in PySpark: > In pyspark/ml/pipeline.py > class Pipeline(Estimator, MLReadable, MLWritable): > @classmethod > def _from_java(cls, java_stage): > # Create a new instance of this stage. > py_stage = cls() > # Load information from java_stage to the instance. > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > class JavaParams(JavaWrapper, Params): > @staticmethod > def _from_java(java_stage): > def __get_class(clazz): > """ > Loads Python class from its name. > """ > parts = clazz.split('.') > module = ".".join(parts[:-1]) > m = __import__(module) > for comp in parts[1:]: > m = getattr(m, comp) > return m > stage_name = > java_stage.getClass().getName().replace("org.apache.spark", "pyspark") > # Generate a default new instance from the stage_name class. > py_type = __get_class(stage_name) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012455#comment-16012455 ] Sean Owen commented on SPARK-18359: --- Using the JVM locale is a bad way to get this behavior, because it's not portable. Input would mysteriously work on one machine and not another, or succeed but quietly give the wrong output. It also caused some SQL-related methods to return the wrong value on non-US-locale machines. That's a big(ger) problem that had to be fixed. Yes, the problem is there isn't a way to specify non-US locales just for the CSV parsing. That's what this is about, and yes you should work on it if you need the functionality. As a workaround you can do some preprocessing to parse the dates manually. Not great, but not hard either. > Let user specify locale in CSV parsing > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20740) Expose UserDefinedType make sure could extends it
[ https://issues.apache.org/jira/browse/SPARK-20740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012453#comment-16012453 ] Hyukjin Kwon commented on SPARK-20740: -- ping [~darion]. If we can't explain the use case, I would like to suggest to close this. > Expose UserDefinedType make sure could extends it > - > > Key: SPARK-20740 > URL: https://issues.apache.org/jira/browse/SPARK-20740 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: darion yaphet > > User may want to extends UserDefinedType and create data types . We should > make UserDefinedType as a public class . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20761. -- Resolution: Duplicate I am pretty sure that it is a duplicate of SPARK-15918. Please reopen this if I misunderstood. > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20765) Cannot load persisted PySpark ML Pipeline that includes 3rd party stage (Transformer or Estimator) if the package name of stage is not "org.apache.spark" and "pyspark"
[ https://issues.apache.org/jira/browse/SPARK-20765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012439#comment-16012439 ] Sean Owen commented on SPARK-20765: --- Yes, but doesn't this code lead com.abc.xyz as com.abc.xyz as desired? that's your class name. Is that class on the classpath when you load? > Cannot load persisted PySpark ML Pipeline that includes 3rd party stage > (Transformer or Estimator) if the package name of stage is not > "org.apache.spark" and "pyspark" > --- > > Key: SPARK-20765 > URL: https://issues.apache.org/jira/browse/SPARK-20765 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: APeng Zhang > > When load persisted PySpark ML Pipeline instance, Pipeline._from_java() will > invoke JavaParams._from_java() to create Python instance of persisted stage. > In JavaParams._from_java(), the name of python class is derived from java > class name by replace string "pyspark" with "org.apache.spark". This is OK > for ML Transformer and Estimator inside PySpark, but for 3rd party > Transformer and Estimator if package name is not org.apache.spark and > pyspark, there will be an error: > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 228, in load > return cls.read().load(path) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/util.py", line > 180, in load > return self._clazz._from_java(java_obj) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 160, in _from_java > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 169, in _from_java > py_type = __get_class(stage_name) > File "/Users/azhang/Work/apyspark/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 163, in __get_class > m = __import__(module) > ImportError: No module named com.abc.xyz.ml.testclass > Related code in PySpark: > In pyspark/ml/pipeline.py > class Pipeline(Estimator, MLReadable, MLWritable): > @classmethod > def _from_java(cls, java_stage): > # Create a new instance of this stage. > py_stage = cls() > # Load information from java_stage to the instance. > py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()] > class JavaParams(JavaWrapper, Params): > @staticmethod > def _from_java(java_stage): > def __get_class(clazz): > """ > Loads Python class from its name. > """ > parts = clazz.split('.') > module = ".".join(parts[:-1]) > m = __import__(module) > for comp in parts[1:]: > m = getattr(m, comp) > return m > stage_name = > java_stage.getClass().getName().replace("org.apache.spark", "pyspark") > # Generate a default new instance from the stage_name class. > py_type = __get_class(stage_name) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012422#comment-16012422 ] Apache Spark commented on SPARK-20364: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18000 > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Critical > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org