[jira] [Commented] (SPARK-20780) Spark Kafka10 Consumer Hangs
[ https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712449#comment-16712449 ] shangmin commented on SPARK-20780: -- The root case is the kafka consumer lib version used in your program. Check lib dependency use maven or other tools, Ensure that use a consumer lib version greater than 0.10.0(I use 0.10.2.x works fine). If you use maven, exclude kafka lib, and add kafka lib with version 0.10.2.x. 1: org.apache.spark spark-sql-kafka-0-10_2.11 2.2.0 org.apache.kafka kafka_2.11 org.apache.kafka kafka-clients 2: org.apache.kafka kafka_2.11 0.10.2.1 org.apache.kafka kafka-clients 0.10.2.1 > Spark Kafka10 Consumer Hangs > > > Key: SPARK-20780 > URL: https://issues.apache.org/jira/browse/SPARK-20780 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 > Spark Streaming Kafka 010 > Yarn - Cluster Mode > CDH 5.8.4 > CentOS Linux release 7.2 >Reporter: jayadeepj >Priority: Major > Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png > > > We have recently upgraded our Streaming App with Direct Stream to Spark 2 > (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer > 10 . We find abnormal delays after the application has run for a couple of > hours or completed consumption of approx. ~ 5 million records. > See screenshot 1 & 2 > There is a sudden dip in the processing time from ~15 seconds (usual for this > app) to ~3 minutes & from then on the processing time keeps degrading > throughout. > We have seen that the delay is due to certain tasks taking the exact time > duration of the configured Kafka Consumer 'request.timeout.ms' . We have > tested this by varying timeout property to different values. > See screenshot 3. > I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method & > subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually > timing out on some of the partitions without reading data. But the executor > logs it as successfully completed after the exact timeout duration. Note that > most other tasks are completing successfully with millisecond duration. The > timeout is most likely from the > org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any > network latency difference. > We have observed this across multiple clusters & multiple apps with & without > TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent > performance > 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446288 > 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 > (TID 446288) > 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, > partition 0 offsets 776843 -> 779591 > 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for > spark-executor-default1 XX-XXX-XX 0 776843 > 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 > (TID 446288). 1699 bytes result sent to driver > 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 446329 > 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 > (TID 446329) > 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 > and clearing cache > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 6807 > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored > as bytes in memory (estimated size 13.1 KB, free 4.1 GB) > 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 6807 took 4 ms > 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as > values in m > We can see that the log statement differ with the exact timeout duration. > Our consumer config is below. > 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@1171dde4 > 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values: > metric.reporters = [] > metadata.max.age.ms = 30 > partition.assignment.strategy = > [org.apache.kafka.clients.consumer.RangeAssignor] > reconnect.backoff.ms = 50 > sasl.kerberos.ticket.renew.window.factor = 0.8 > max.partition.fetch.bytes = 1048576 > bootstrap.servers = [x.xxx.xxx:9092] > ssl.keystore.type = JKS > enable.auto.commit = true > sasl.mechanism = GSSAPI > interceptor.classes = null > exclude.internal.topics = true > ssl.truststore.password = null > client.id = > ssl.endpoint.identification.algorithm = null > max.poll.records = 2147483647 > check.crcs =
[jira] [Updated] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23580: -- Target Version/s: (was: 2.5.0) > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24360: -- Target Version/s: (was: 2.5.0) > Support Hive 3.1 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys
[ https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-12978: -- Target Version/s: (was: 2.5.0) > Skip unnecessary final group-by when input data already clustered with > group-by keys > > > Key: SPARK-12978 > URL: https://issues.apache.org/jira/browse/SPARK-12978 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > This ticket targets the optimization to skip an unnecessary group-by > operation below; > Without opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)], > output=[col0#159,sum#200,sum#201,count#202L]) >+- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], > InMemoryRelation [col0#159,col1#160,col2#161], true, 1, > StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None > {code} > With opt.: > {code} > == Physical Plan == > TungstenAggregate(key=[col0#159], > functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)], > output=[col0#159,sum(col1)#177,avg(col2)#178]) > +- TungstenExchange hashpartitioning(col0#159,200), None > +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation > [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, > true, 1), ConvertToUnsafe, None > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15117) Generate code that get a value in each compressed column from CachedBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15117: -- Target Version/s: (was: 2.5.0) > Generate code that get a value in each compressed column from CachedBatch > when DataFrame.cache() is called > -- > > Key: SPARK-15117 > URL: https://issues.apache.org/jira/browse/SPARK-15117 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki >Priority: Major > > Once SPARK-14098 is merged, we will migrate a feature in this JIRA entry. > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. This JIRA > entry supports other primitive types (boolean/byte/short/int/long) whose > column may be compressed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712414#comment-16712414 ] Dongjoon Hyun commented on SPARK-20184: --- If this exists up to 2.4.0, could you update the `Affects Versions`, [~kiszk]? > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16196: -- Target Version/s: (was: 2.5.0) > Optimize in-memory scan performance using ColumnarBatches > - > > Key: SPARK-16196 > URL: https://issues.apache.org/jira/browse/SPARK-16196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > > A simple benchmark such as the following reveals inefficiencies in the > existing in-memory scan implementation: > {code} > spark.range(N) > .selectExpr("id", "floor(rand() * 1) as k") > .createOrReplaceTempView("test") > val ds = spark.sql("select count(k), count(id) from test").cache() > ds.collect() > ds.collect() > {code} > There are many reasons why caching is slow. The biggest is that compression > takes a long time. The second is that there are a lot of virtual function > calls in this hot code path since the rows are processed using iterators. > Further, the rows are converted to and from ByteBuffers, which are slow to > read in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20184: -- Target Version/s: (was: 2.5.0) > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21318) The exception message thrown by `lookupFunction` is ambiguous.
[ https://issues.apache.org/jira/browse/SPARK-21318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21318: -- Target Version/s: (was: 2.5.0) > The exception message thrown by `lookupFunction` is ambiguous. > -- > > Key: SPARK-21318 > URL: https://issues.apache.org/jira/browse/SPARK-21318 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai >Assignee: StanZhai >Priority: Minor > Fix For: 2.4.0 > > > The function actually exists in current selected database, but the exception > message is: > {code} > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'. > {code} > My UDF has already been registered in the current database. But it's failed > to init during lookupFunction. > The exception message should be: > {code} > No handler for Hive UDF 'site.stanzhai.UDAFXXX': > org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException: Two arguments is > expected > {code} > This is not conducive to positioning problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections
[ https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712410#comment-16712410 ] Dongjoon Hyun commented on SPARK-23580: --- I removed the target version `2.5.0`. Please feel free to add `3.0.0` if needed. > Interpreted mode fallback should be implemented for all expressions & > projections > - > > Key: SPARK-23580 > URL: https://issues.apache.org/jira/browse/SPARK-23580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > Spark SQL currently does not support interpreted mode for all expressions and > projections. This is a problem for scenario's where were code generation does > not work, or blows past the JVM class limits. We currently cannot gracefully > fallback. > This ticket is an umbrella to fix this class of problem in Spark SQL. This > work can be divided into two main area's: > - Add interpreted versions for all dataset related expressions. > - Add an interpreted version of {{GenerateUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24360: -- Target Version/s: 3.0.0 > Support Hive 3.1 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.1.0 is released. This issue aims to support Hive Metastore 3.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26301) Consider switching from putting secret in environment variable directly to using secret reference
Matt Cheah created SPARK-26301: -- Summary: Consider switching from putting secret in environment variable directly to using secret reference Key: SPARK-26301 URL: https://issues.apache.org/jira/browse/SPARK-26301 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.0 Reporter: Matt Cheah In SPARK-26194 we proposed using an environment variable that is loaded in the executor pod spec to share the generated SASL secret key between the driver and the executors. However in practice this is very difficult to secure. Most traditional Kubernetes deployments will handle permissions by allowing wide access to viewing pod specs but restricting access to view Kubernetes secrets. Now however any user that can view the pod spec can also view the contents of the SASL secrets. An example use case where this quickly breaks down is in the case where a systems administrator is allowed to look at pods that run user code in order to debug failing infrastructure, but the cluster administrator should not be able to view contents of secrets or other sensitive data from Spark applications run by their users. We propose modifying the existing solution to instead automatically create a Kubernetes Secret object containing the SASL encryption key, then using the [secret reference feature in Kubernetes|https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables] to store the data in the environment variable without putting the secret data in the pod spec itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712373#comment-16712373 ] Apache Spark commented on SPARK-26239: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/23252 > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26239: Assignee: Apache Spark > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26239: Assignee: (was: Apache Spark) > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26188: Labels: release-notes (was: ) > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Assignee: Gengliang Wang >Priority: Critical > Labels: correctness, release-notes > Fix For: 2.4.1, 3.0.0 > > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values: [7a] > 18/11/27 14:02:50
[jira] [Updated] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26188: Labels: correctness release-notes (was: release-notes) > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Assignee: Gengliang Wang >Priority: Critical > Labels: correctness, release-notes > Fix For: 2.4.1, 3.0.0 > > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=7a/part-00093-hashredacted.parquet, > range: 0-662, partition values:
[jira] [Commented] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join
[ https://issues.apache.org/jira/browse/SPARK-25401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712350#comment-16712350 ] David Vrba commented on SPARK-25401: I was looking at it and i believe that it the class EnsureRequirements we could reorder the join predicates for SortMergeJoin once more - just before we check if child outputOrdering satisfies the requiredOrdering - and we can align the predicate keys with the child outputOrdering. In such case it is not going to add the unnecessary SortExec and also it is not going to add unnecessary Exchange either, because Exchange is handled before. What do you guys think? Is it a good approach? (Please be patient with me, this is my first Jira on Spark) > Reorder the required ordering to match the table's output ordering for bucket > join > -- > > Key: SPARK-25401 > URL: https://issues.apache.org/jira/browse/SPARK-25401 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang, Gang >Priority: Major > > Currently, we check if SortExec is needed between a operator and its child > operator in method orderingSatisfies, and method orderingSatisfies require > the order in the SortOrders are all the same. > While, take the following case into consideration. > * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is > 200. > * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is > 200. > * Table a join table b on (a1=b1, a2=b2) > In this case, if the join is sort merge join, the query planner won't add > exchange on both sides, while, sort will be added on both sides. Actually, > sort is also unnecessary, since in the same bucket, like bucket 1 of table a, > and bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26298: -- Issue Type: Improvement (was: Bug) > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26298. --- Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23250 . > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26298: -- Comment: was deleted (was: User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/23250) > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26300: Assignee: (was: Apache Spark) > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712296#comment-16712296 ] Apache Spark commented on SPARK-26300: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/23251 > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26300: Assignee: Apache Spark > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
liuxian created SPARK-26300: --- Summary: The `checkForStreaming` mothod may be called twice in `createQuery` Key: SPARK-26300 URL: https://issues.apache.org/jira/browse/SPARK-26300 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: liuxian If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called twice in {{createQuery}} , this is not necessary, and the {{checkForStreaming}} method has a lot of statements, so it's better to remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26263) Throw exception when Partition column value can't be converted to user specified type
[ https://issues.apache.org/jira/browse/SPARK-26263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26263: --- Assignee: Gengliang Wang > Throw exception when Partition column value can't be converted to user > specified type > - > > Key: SPARK-26263 > URL: https://issues.apache.org/jira/browse/SPARK-26263 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently if user provides data schema, partition column values are converted > as per it. But if the conversion failed, e.g. converting string to int, the > column value is null. > For the following directory > /tmp/testDir > ├── p=bar > └── p=foo > If we run: > ``` > val schema = StructType(Seq(StructField("p", IntegerType, false))) > spark.read.schema(schema).csv("/tmp/testDir/").show() > ``` > We will get: > ++ > | p| > ++ > |null| > |null| > ++ > This PR propose to throw exception in such case, instead of converting into > null value silently: > 1. These null partition column values doesn't make sense to users in most > case. It is better to know the conversion failure, and then adjust the schema > or ETL jobs, etc to fix it. > 2. There are always exceptions on such conversion failure for non-partition > data columns. Partition columns should have the same behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26263) Throw exception when Partition column value can't be converted to user specified type
[ https://issues.apache.org/jira/browse/SPARK-26263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26263. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23215 [https://github.com/apache/spark/pull/23215] > Throw exception when Partition column value can't be converted to user > specified type > - > > Key: SPARK-26263 > URL: https://issues.apache.org/jira/browse/SPARK-26263 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > Currently if user provides data schema, partition column values are converted > as per it. But if the conversion failed, e.g. converting string to int, the > column value is null. > For the following directory > /tmp/testDir > ├── p=bar > └── p=foo > If we run: > ``` > val schema = StructType(Seq(StructField("p", IntegerType, false))) > spark.read.schema(schema).csv("/tmp/testDir/").show() > ``` > We will get: > ++ > | p| > ++ > |null| > |null| > ++ > This PR propose to throw exception in such case, instead of converting into > null value silently: > 1. These null partition column values doesn't make sense to users in most > case. It is better to know the conversion failure, and then adjust the schema > or ETL jobs, etc to fix it. > 2. There are always exceptions on such conversion failure for non-partition > data columns. Partition columns should have the same behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26051) Can't create table with column name '22222d'
[ https://issues.apache.org/jira/browse/SPARK-26051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712279#comment-16712279 ] Xie Juntao commented on SPARK-26051: [~dkbiswal] hi, I tested in mysql. It's ok for creating a table with column name "2d": mysql> create table t1(2d int); Query OK, 0 rows affected (0.02 sec) > Can't create table with column name '2d' > > > Key: SPARK-26051 > URL: https://issues.apache.org/jira/browse/SPARK-26051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xie Juntao >Priority: Minor > > I can't create table in which the column name is '2d' when I use > spark-sql. It seems a SQL parser bug because it's ok for creating table with > the column name ''2m". > {code:java} > spark-sql> create table t1(2d int); > Error in query: > no viable alternative at input 'create table t1(2d'(line 1, pos 16) > == SQL == > create table t1(2d int) > ^^^ > spark-sql> create table t1(2m int); > 18/11/14 09:13:53 INFO HiveMetaStore: 0: get_database: global_temp > 18/11/14 09:13:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > global_temp > 18/11/14 09:13:53 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : > db=default tbl=t1 > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: create_table: Table(tableName:t1, > dbName:default, owner:root, createTime:1542158033, lastAccessTime:0, > retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, > comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: > Table(tableName:t1, dbName:default, owner:root, createTime:1542158033, > lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 WARN HiveMetaStore: Location: > file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1 > specified for non-external table:t1 > 18/11/14 09:13:55 INFO FileUtils: Creating directory if it doesn't exist: > file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1 > Time taken: 2.15 seconds > 18/11/14 09:13:56 INFO SparkSQLCLIDriver: Time taken: 2.15 seconds{code} -- This message was sent by Atlassian JIRA
[jira] [Comment Edited] (SPARK-26051) Can't create table with column name '22222d'
[ https://issues.apache.org/jira/browse/SPARK-26051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712279#comment-16712279 ] Xie Juntao edited comment on SPARK-26051 at 12/7/18 2:53 AM: - [~dkbiswal] hi, I tested in mysql. It's ok for creating a table with column name "2d": {quote}mysql> create table t1(2d int); Query OK, 0 rows affected (0.02 sec) {quote} was (Author: xiejuntao1...@163.com): [~dkbiswal] hi, I tested in mysql. It's ok for creating a table with column name "2d": mysql> create table t1(2d int); Query OK, 0 rows affected (0.02 sec) > Can't create table with column name '2d' > > > Key: SPARK-26051 > URL: https://issues.apache.org/jira/browse/SPARK-26051 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xie Juntao >Priority: Minor > > I can't create table in which the column name is '2d' when I use > spark-sql. It seems a SQL parser bug because it's ok for creating table with > the column name ''2m". > {code:java} > spark-sql> create table t1(2d int); > Error in query: > no viable alternative at input 'create table t1(2d'(line 1, pos 16) > == SQL == > create table t1(2d int) > ^^^ > spark-sql> create table t1(2m int); > 18/11/14 09:13:53 INFO HiveMetaStore: 0: get_database: global_temp > 18/11/14 09:13:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > global_temp > 18/11/14 09:13:53 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : > db=default tbl=t1 > 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: > default > 18/11/14 09:13:55 INFO HiveMetaStore: 0: create_table: Table(tableName:t1, > dbName:default, owner:root, createTime:1542158033, lastAccessTime:0, > retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, > comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: > Table(tableName:t1, dbName:default, owner:root, createTime:1542158033, > lastAccessTime:0, retention:0, > sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, comment:null)], > location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1, > inputFormat:org.apache.hadoop.mapred.TextInputFormat, > outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, > compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, > serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > parameters:{serialization.format=1}), bucketCols:[], sortCols:[], > parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], > skewedColValueLocationMaps:{})), partitionKeys:[], > parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]}, > spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, > viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, > privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, > rolePrivileges:null)) > 18/11/14 09:13:55 WARN HiveMetaStore: Location: > file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1 > specified for non-external table:t1 > 18/11/14 09:13:55 INFO FileUtils:
[jira] [Updated] (SPARK-26299) [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice
[ https://issues.apache.org/jira/browse/SPARK-26299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu, Linhong updated SPARK-26299: - Description: In spark root pom, there is a test-jar execution for all sub-projects including spark-streaming. This will attach an artifact: org.apache.spark:spark-streaming_2.11:*{color:#ff}test-jar{color}*:tests:2.3.2 Also, in streaming pom, there is a shade test configuration, it will attach an artifact: org.apache.spark:spark-streaming_2.11:{color:#ff}*jar*{color}:tests:2.3.2 But two artifacts actually point to the same file: spark-streaming_2.11-2.3.2-tests.jar So, when deploy spark to nexus using mvn deploy, maven will upload the test.jar twice since it belongs to 2 artifacts. Then deploy fails due to nexus don't allow overrides existing file for non-SNAPSHOT release. What's more, after checking the spark-streaming, shaded test jar is exactly same to original test jar. It seems we can just delete the related shade config. {code:java} target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes org.apache.maven.plugins maven-shade-plugin true {code} was: In spark root pom, there is a test-jar execution for all sub-projects including spark-streaming. This will attach an artifact: org.apache.spark:spark-streaming_2.11:*{color:#ff}test-jar{color}*:tests:2.3.2 Also, in streaming pom, there is a shade test configuration, it will attach an artifact: org.apache.spark:spark-streaming_2.11:{color:#ff}*jar*{color}:tests:2.3.2 But two artifacts actually point to the same file: spark-streaming_2.11-2.3.2-tests.jar So, when deploy spark to nexus using mvn deploy, maven will upload the test.jar twice since it belongs to 2 artifacts. Then deploy fails due to nexus don't allow overrides existing file for non-SNAPSHOT release. What's more, after checking the spark-streaming, shaded test jar is exactly same to original test jar. It seems we can just delete the related shade config. ``` target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes org.apache.maven.plugins maven-shade-plugin true ``` > [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice > -- > > Key: SPARK-26299 > URL: https://issues.apache.org/jira/browse/SPARK-26299 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2, 2.4.0 >Reporter: Liu, Linhong >Priority: Major > > In spark root pom, there is a test-jar execution for all sub-projects > including spark-streaming. This will attach an artifact: > org.apache.spark:spark-streaming_2.11:*{color:#ff}test-jar{color}*:tests:2.3.2 > Also, in streaming pom, there is a shade test configuration, it will attach > an artifact: > org.apache.spark:spark-streaming_2.11:{color:#ff}*jar*{color}:tests:2.3.2 > But two artifacts actually point to the same file: > spark-streaming_2.11-2.3.2-tests.jar > So, when deploy spark to nexus using mvn deploy, maven will upload the > test.jar twice since it belongs to 2 artifacts. Then deploy fails due to > nexus don't allow overrides existing file for non-SNAPSHOT release. > What's more, after checking the spark-streaming, shaded test jar is exactly > same to original test jar. It seems we can just delete the related shade > config. > {code:java} > > > target/scala-${scala.binary.version}/classes > > target/scala-${scala.binary.version}/test-classes > > > org.apache.maven.plugins > maven-shade-plugin > > true > > > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26299) [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice
[ https://issues.apache.org/jira/browse/SPARK-26299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu, Linhong updated SPARK-26299: - Description: In spark root pom, there is a test-jar execution for all sub-projects including spark-streaming. This will attach an artifact: org.apache.spark:spark-streaming_2.11:*{color:#ff}test-jar{color}*:tests:2.3.2 Also, in streaming pom, there is a shade test configuration, it will attach an artifact: org.apache.spark:spark-streaming_2.11:{color:#ff}*jar*{color}:tests:2.3.2 But two artifacts actually point to the same file: spark-streaming_2.11-2.3.2-tests.jar So, when deploy spark to nexus using mvn deploy, maven will upload the test.jar twice since it belongs to 2 artifacts. Then deploy fails due to nexus don't allow overrides existing file for non-SNAPSHOT release. What's more, after checking the spark-streaming, shaded test jar is exactly same to original test jar. It seems we can just delete the related shade config. ``` target/scala-${scala.binary.version}/classes target/scala-${scala.binary.version}/test-classes org.apache.maven.plugins maven-shade-plugin true ``` was: In spark root pom, there is a test-jar execution for all sub-projects including spark-streaming. This will attach an artifact: org.apache.spark:spark-streaming_2.11:*{color:#FF}test-jar{color}*:tests:2.3.2 Also, in streaming pom, there is a shade test configuration, it will attach an artifact: org.apache.spark:spark-streaming_2.11:{color:#FF}*jar*{color}:tests:2.3.2 But two artifacts actually point to the same file: spark-streaming_2.11-2.3.2-tests.jar So, when deploy spark to nexus using mvn deploy, > [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice > -- > > Key: SPARK-26299 > URL: https://issues.apache.org/jira/browse/SPARK-26299 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2, 2.4.0 >Reporter: Liu, Linhong >Priority: Major > > In spark root pom, there is a test-jar execution for all sub-projects > including spark-streaming. This will attach an artifact: > org.apache.spark:spark-streaming_2.11:*{color:#ff}test-jar{color}*:tests:2.3.2 > Also, in streaming pom, there is a shade test configuration, it will attach > an artifact: > org.apache.spark:spark-streaming_2.11:{color:#ff}*jar*{color}:tests:2.3.2 > But two artifacts actually point to the same file: > spark-streaming_2.11-2.3.2-tests.jar > So, when deploy spark to nexus using mvn deploy, maven will upload the > test.jar twice since it belongs to 2 artifacts. Then deploy fails due to > nexus don't allow overrides existing file for non-SNAPSHOT release. > What's more, after checking the spark-streaming, shaded test jar is exactly > same to original test jar. It seems we can just delete the related shade > config. > ``` > > > target/scala-${scala.binary.version}/classes > > target/scala-${scala.binary.version}/test-classes > > > org.apache.maven.plugins > maven-shade-plugin > > true > > > > > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26299) [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice
[ https://issues.apache.org/jira/browse/SPARK-26299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu, Linhong updated SPARK-26299: - Affects Version/s: 2.3.2 > [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice > -- > > Key: SPARK-26299 > URL: https://issues.apache.org/jira/browse/SPARK-26299 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2, 2.4.0 >Reporter: Liu, Linhong >Priority: Major > > In spark root pom, there is a test-jar execution for all sub-projects > including spark-streaming. This will attach an artifact: > org.apache.spark:spark-streaming_2.11:*{color:#FF}test-jar{color}*:tests:2.3.2 > Also, in streaming pom, there is a shade test configuration, it will attach > an artifact: > org.apache.spark:spark-streaming_2.11:{color:#FF}*jar*{color}:tests:2.3.2 > But two artifacts actually point to the same file: > spark-streaming_2.11-2.3.2-tests.jar > So, when deploy spark to nexus using mvn deploy, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26299) [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice
[ https://issues.apache.org/jira/browse/SPARK-26299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu, Linhong updated SPARK-26299: - Description: In spark root pom, there is a test-jar execution for all sub-projects including spark-streaming. This will attach an artifact: org.apache.spark:spark-streaming_2.11:*{color:#FF}test-jar{color}*:tests:2.3.2 Also, in streaming pom, there is a shade test configuration, it will attach an artifact: org.apache.spark:spark-streaming_2.11:{color:#FF}*jar*{color}:tests:2.3.2 But two artifacts actually point to the same file: spark-streaming_2.11-2.3.2-tests.jar So, when deploy spark to nexus using mvn deploy, > [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice > -- > > Key: SPARK-26299 > URL: https://issues.apache.org/jira/browse/SPARK-26299 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2, 2.4.0 >Reporter: Liu, Linhong >Priority: Major > > In spark root pom, there is a test-jar execution for all sub-projects > including spark-streaming. This will attach an artifact: > org.apache.spark:spark-streaming_2.11:*{color:#FF}test-jar{color}*:tests:2.3.2 > Also, in streaming pom, there is a shade test configuration, it will attach > an artifact: > org.apache.spark:spark-streaming_2.11:{color:#FF}*jar*{color}:tests:2.3.2 > But two artifacts actually point to the same file: > spark-streaming_2.11-2.3.2-tests.jar > So, when deploy spark to nexus using mvn deploy, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26299) [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice
Liu, Linhong created SPARK-26299: Summary: [MAVEN] Shaded test jar in spark-streaming cause deploy test jar twice Key: SPARK-26299 URL: https://issues.apache.org/jira/browse/SPARK-26299 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.4.0 Reporter: Liu, Linhong -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26289) cleanup enablePerfMetrics parameter from BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-26289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26289. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23244 [https://github.com/apache/spark/pull/23244] > cleanup enablePerfMetrics parameter from BytesToBytesMap > > > Key: SPARK-26289 > URL: https://issues.apache.org/jira/browse/SPARK-26289 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 3.0.0 > > > enablePerfMetrics was originally designed in BytesToBytesMap to control > getNumHashCollisions getTimeSpentResizingNs getAverageProbesPerLookup. > However, as the Spark version gradual progress. This parameter is only used > for getAverageProbesPerLookup and always given to true when using > BytesToBytesMap. it is also dangerous to determine whether > getAverageProbesPerLookup opens and throws an IllegalStateException > exception. So this pr will be remove enablePerfMetrics parameter from > BytesToBytesMap. thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26289) cleanup enablePerfMetrics parameter from BytesToBytesMap
[ https://issues.apache.org/jira/browse/SPARK-26289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26289: --- Assignee: caoxuewen > cleanup enablePerfMetrics parameter from BytesToBytesMap > > > Key: SPARK-26289 > URL: https://issues.apache.org/jira/browse/SPARK-26289 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: caoxuewen >Priority: Major > Fix For: 3.0.0 > > > enablePerfMetrics was originally designed in BytesToBytesMap to control > getNumHashCollisions getTimeSpentResizingNs getAverageProbesPerLookup. > However, as the Spark version gradual progress. This parameter is only used > for getAverageProbesPerLookup and always given to true when using > BytesToBytesMap. it is also dangerous to determine whether > getAverageProbesPerLookup opens and throws an IllegalStateException > exception. So this pr will be remove enablePerfMetrics parameter from > BytesToBytesMap. thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712192#comment-16712192 ] Marcelo Vanzin commented on SPARK-26295: Can't SPARK-25887 be used for this? > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user `mynamespace:spark` based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712116#comment-16712116 ] Dongjoon Hyun commented on SPARK-25987: --- It's not directly related to this, but it seems that we had better upgrade Janino, too. I filed SPARK-26298. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26298: Assignee: Apache Spark > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712089#comment-16712089 ] Apache Spark commented on SPARK-26298: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/23250 > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712090#comment-16712090 ] Apache Spark commented on SPARK-26298: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/23250 > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26298) Upgrade Janino version to 3.0.11
[ https://issues.apache.org/jira/browse/SPARK-26298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26298: Assignee: (was: Apache Spark) > Upgrade Janino version to 3.0.11 > > > Key: SPARK-26298 > URL: https://issues.apache.org/jira/browse/SPARK-26298 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Janino compiler to [version > 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. > - Fixed issue #63: Script with many "helper" variables. > - At last, the "jdk-9" implementation works. > - Made the code that is required for JDK 9+ (ModuleFinder and consorts) > Java-6-compilable through reflection, so that commons-compiler-jdk can still > be compiled with Java 6, but functions at runtime also with Java 9+. > - Fixed Java 9+ compatibility (JRE module system). > - Fixed issue #65: Compilation Error Messages Generated by JDK. > - Added experimental support for the "StackMapFrame" attribute; not active > yet. > - Fixed issue #61: Make Unparser more flexible. > - Fixed NPEs in various "toString()" methods. > - Fixed issue #62: Optimize static method invocation with rvalue target > expression. > - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" > methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26194) Support automatic spark.authenticate secret in Kubernetes backend
[ https://issues.apache.org/jira/browse/SPARK-26194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-26194. Resolution: Fixed Fix Version/s: 3.0.0 > Support automatic spark.authenticate secret in Kubernetes backend > - > > Key: SPARK-26194 > URL: https://issues.apache.org/jira/browse/SPARK-26194 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > Currently k8s inherits the default behavior for {{spark.authenticate}}, which > is that the user must provide an auth secret. > k8s doesn't have that requirement and could instead generate its own unique > per-app secret, and propagate it to executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26298) Upgrade Janino version to 3.0.11
Dongjoon Hyun created SPARK-26298: - Summary: Upgrade Janino version to 3.0.11 Key: SPARK-26298 URL: https://issues.apache.org/jira/browse/SPARK-26298 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to upgrade Janino compiler to [version 3.0.11|http://janino-compiler.github.io/janino/changelog.html]. - Fixed issue #63: Script with many "helper" variables. - At last, the "jdk-9" implementation works. - Made the code that is required for JDK 9+ (ModuleFinder and consorts) Java-6-compilable through reflection, so that commons-compiler-jdk can still be compiled with Java 6, but functions at runtime also with Java 9+. - Fixed Java 9+ compatibility (JRE module system). - Fixed issue #65: Compilation Error Messages Generated by JDK. - Added experimental support for the "StackMapFrame" attribute; not active yet. - Fixed issue #61: Make Unparser more flexible. - Fixed NPEs in various "toString()" methods. - Fixed issue #62: Optimize static method invocation with rvalue target expression. - Merged pull request #60: Added all missing "ClassFile.getConstant*Info()" methods, removing the necessity for many type casts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712067#comment-16712067 ] Dongjoon Hyun edited comment on SPARK-25987 at 12/6/18 10:06 PM: - Hi, [~kiszk]. The situation is very similar with SPARK-22523 and this still happens in Spark 2.4.0 and master branch. Do we have an existing issue to track this? was (Author: dongjoon): Hi, [~kiszk]. The situation is very similar with SPARK-22523 and this still happens in Spark 2.4.0. Do we have an existing issue to track this? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25987: -- Affects Version/s: 3.0.0 > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16712067#comment-16712067 ] Dongjoon Hyun commented on SPARK-25987: --- Hi, [~kiszk]. The situation is very similar with SPARK-22523 and this still happens in Spark 2.4.0. Do we have an existing issue to track this? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25987: -- Description: When I execute {code:java} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val columnsCount = 100 val columns = (1 to columnsCount).map(i => s"col$i") val initialData = (1 to columnsCount).map(i => s"val$i") val df = spark.createDataFrame( rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), schema = StructType(columns.map(StructField(_, StringType, true))) ) val addSuffixUDF = udf( (str: String) => str + "_added" ) implicit class DFOps(df: DataFrame) { def addSuffix() = { df.select(columns.map(col => addSuffixUDF(df(col)).as(col) ): _*) } } df .addSuffix() .addSuffix() .addSuffix() .show() {code} I get {code:java} An exception or error caused a run to abort. java.lang.StackOverflowError at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) ... {code} If I reduce columns number (to 10 for example) or do `addSuffix` only once - it works fine. was: When I execute {code:java} val columnsCount = 100 val columns = (1 to columnsCount).map(i => s"col$i") val initialData = (1 to columnsCount).map(i => s"val$i") val df = sparkSession.createDataFrame( rowRDD = sparkSession.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), schema = StructType(columns.map(StructField(_, StringType, true))) ) val addSuffixUDF = udf( (str: String) => str + "_added" ) implicit class DFOps(df: DataFrame) { def addSuffix() = { df.select(columns.map(col => addSuffixUDF(df(col)).as(col) ): _*) } } df .addSuffix() .addSuffix() .addSuffix() .show() {code} I get {code:java} An exception or error caused a run to abort. java.lang.StackOverflowError at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) ... {code} If I reduce columns number (to 10 for example) or do `addSuffix` only once - it works fine. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df > .addSuffix() > .addSuffix() > .addSuffix() > .show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25987: -- Affects Version/s: 2.4.0 > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25987: -- Description: When I execute {code:java} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val columnsCount = 100 val columns = (1 to columnsCount).map(i => s"col$i") val initialData = (1 to columnsCount).map(i => s"val$i") val df = spark.createDataFrame( rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), schema = StructType(columns.map(StructField(_, StringType, true))) ) val addSuffixUDF = udf( (str: String) => str + "_added" ) implicit class DFOps(df: DataFrame) { def addSuffix() = { df.select(columns.map(col => addSuffixUDF(df(col)).as(col) ): _*) } } df.addSuffix().addSuffix().addSuffix().show() {code} I get {code:java} An exception or error caused a run to abort. java.lang.StackOverflowError at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) ... {code} If I reduce columns number (to 10 for example) or do `addSuffix` only once - it works fine. was: When I execute {code:java} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val columnsCount = 100 val columns = (1 to columnsCount).map(i => s"col$i") val initialData = (1 to columnsCount).map(i => s"val$i") val df = spark.createDataFrame( rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), schema = StructType(columns.map(StructField(_, StringType, true))) ) val addSuffixUDF = udf( (str: String) => str + "_added" ) implicit class DFOps(df: DataFrame) { def addSuffix() = { df.select(columns.map(col => addSuffixUDF(df(col)).as(col) ): _*) } } df .addSuffix() .addSuffix() .addSuffix() .show() {code} I get {code:java} An exception or error caused a run to abort. java.lang.StackOverflowError at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) ... {code} If I reduce columns number (to 10 for example) or do `addSuffix` only once - it works fine. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22776) Increase default value of spark.sql.codegen.maxFields
[ https://issues.apache.org/jira/browse/SPARK-22776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711988#comment-16711988 ] Dongjoon Hyun commented on SPARK-22776: --- Hi, [~kiszk]. Is there anything to do here? > Increase default value of spark.sql.codegen.maxFields > - > > Key: SPARK-22776 > URL: https://issues.apache.org/jira/browse/SPARK-22776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Since there are lots of effort to avoid limitation of Java class files, > generated code for whole-stage codegen works with wider columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns
[ https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26067. - > Pandas GROUPED_MAP udf breaks if DF has >255 columns > > > Key: SPARK-26067 > URL: https://issues.apache.org/jira/browse/SPARK-26067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Abdeali Kothari >Priority: Major > > When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in > pythohn/pandas on a grouped dataframe in spark - it fails if the number of > columns is greater than 255 in Pytohn 3.6 and lower. > {code:java} > import pyspark > from pyspark.sql import types as T, functions as F > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame( > [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + > str(i) for i in range(256)]) > new_schema = T.StructType([ > field for field in df.schema] + [T.StructField("new_row", > T.DoubleType())]) > def myfunc(df): > df['new_row'] = 1 > return df > myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc) > df2 = df.groupBy(["a1"]).apply(myfunc_udf) > print(df2.count()) # This FAILS > # ERROR: > # Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > # func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > # mapper = eval(mapper_str, udfs) > # File "", line 1 > # SyntaxError: more than 255 arguments > {code} > Note: In Python 3.7 the 255 limit was raised, but I have not tried with > Pytohn 3.7 > ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes > I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my > Hadoop Linux cluster and also on my Mac standalone spark installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
[ https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26286: -- Priority: Trivial (was: Minor) > Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test > --- > > Key: SPARK-26286 > URL: https://issues.apache.org/jira/browse/SPARK-26286 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Trivial > > Add max page size exception bounds checking test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26292) Assert statement of currentPage may be not in right place
[ https://issues.apache.org/jira/browse/SPARK-26292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26292. --- Resolution: Not A Problem Fix Version/s: (was: 2.4.0) Please also dont' set Fix Version > Assert statement of currentPage may be not in right place > -- > > Key: SPARK-26292 > URL: https://issues.apache.org/jira/browse/SPARK-26292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > The assert statement of currentPage is not in right place,it should be add > to the back of > allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26294: -- Priority: Trivial (was: Minor) This is too trivial for a JIRA. > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Trivial > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26236) Kafka delegation token support documentation
[ https://issues.apache.org/jira/browse/SPARK-26236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26236. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23195 [https://github.com/apache/spark/pull/23195] > Kafka delegation token support documentation > > > Key: SPARK-26236 > URL: https://issues.apache.org/jira/browse/SPARK-26236 > Project: Spark > Issue Type: New Feature > Components: Documentation, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > Because SPARK-25501 merged to master now it's time to update the docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26236) Kafka delegation token support documentation
[ https://issues.apache.org/jira/browse/SPARK-26236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-26236: -- Assignee: Gabor Somogyi > Kafka delegation token support documentation > > > Key: SPARK-26236 > URL: https://issues.apache.org/jira/browse/SPARK-26236 > Project: Spark > Issue Type: New Feature > Components: Documentation, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > Because SPARK-25501 merged to master now it's time to update the docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711516#comment-16711516 ] Gabor Somogyi edited comment on SPARK-26254 at 12/6/18 6:32 PM: I've reached a state where tradeoff has to be made, so interested in opinions [~vanzin] [~ste...@apache.org] I've created a project with token-providers name which is depending on core. With this successfully extracted all the nasty hive + kafka dependencies + all token providers are there. Then loaded the providers with ServiceLoader which also works fine. Finally reached a point where kafka-sql project expects couple of things from KafkaUtil which is in token-providers now. Here is the list of problems: {noformat} [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:31: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:25: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:32: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) != null [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:37: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:566: not found: value KafkaTokenUtil [error] if (KafkaTokenUtil.isGlobalJaasConfigurationProvided) { [error] ^ + all isTokenAvailable tests expects TOKEN_KIND, TOKEN_SERVICE + KafkaDelegationTokenIdentifier in KafkaSecurityHelperSuite {noformat} Here I see these possibilities: * Hardcode TOKEN_KIND + TOKEN_SERVICE and duplicate isGlobalJaasConfigurationProvided => The drawback here is we can't really test whether the provider created token can be read in kafka-sql (we can actually but with hardcoded strings in both sides which makes it brittle) * As we're loading providers with ServiceLoader the kafka related one can be moved to kafka-sql => The drawback is that providers spread around and this code can't really be reused in DStreams. * Extract KafkaUtil to a kafka specific project => It's too heavyweight * Add token-providers provided dependency to kafka-sql project => It's kinda' weird but at the moment looks the least problematic Waiting on opinions... was (Author: gsomogyi): I've reached a state where tradeoff has to be made, so interested in opinions [~vanzin] [~ste...@apache.org] I've created a project with token-providers name which is depending on core. With this successfully extracted all the nasty hive + kafka dependencies + all token providers are there. Then loaded the providers with ServiceLoader which also works fine. Finally reached a point where kafka-sql project expects couple of things from KafkaUtil which is in token-providers now. Here is the list of problems: {noformat} [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:31: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:25: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:32: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) != null [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:37: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:566: not found: value KafkaTokenUtil [error] if (KafkaTokenUtil.isGlobalJaasConfigurationProvided) { [error] ^ + all isTokenAvailable tests expects TOKEN_KIND, TOKEN_SERVICE + KafkaDelegationTokenIdentifier in
[jira] [Resolved] (SPARK-26213) Custom Receiver for Structured streaming
[ https://issues.apache.org/jira/browse/SPARK-26213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-26213. --- Resolution: Information Provided > Custom Receiver for Structured streaming > > > Key: SPARK-26213 > URL: https://issues.apache.org/jira/browse/SPARK-26213 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Aarthi >Priority: Major > > Hi, > I have implemented a Custom Receiver for a https/json data source by > implementing the Receievr abstract class as provided in the documentation > here [https://spark.apache.org/docs/latest//streaming-custom-receivers.html] > This approach works on Spark streaming context where the custom receiver > class is passed it receiverStream. However I would like the implement the > same for Structured streaming as each of the DStreams have a complex > structure and need to be joined with each other based on complex rules. > ([https://stackoverflow.com/questions/53449599/join-two-spark-dstreams-with-complex-nested-structure]) > Structured streaming uses the Spark Session object that takes in > DataStreamReader which is a final class. Please advice on how to implement > the custom receiver for Strucutred Streaming. > Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26296) Base Spark over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26296. Resolution: Won't Fix Lack of newer JDK support is not only the fault of Scala. There's a lot of Java code that doesn't work on the new JDK either. > Base Spark over Java and not over Scala, offering Scala as an option over > Spark > --- > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and [Scala is still lowering Spark in the previous > version of the > JDK|https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html]. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those who do not > need it wouldn't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches
[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25274. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22275 https://github.com/apache/spark/pull/22275 > Improve toPandas with Arrow by sending out-of-order record batches > -- > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches
[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-25274: Assignee: Bryan Cutler > Improve toPandas with Arrow by sending out-of-order record batches > -- > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26297) improve the doc of Distribution/Partitioning
[ https://issues.apache.org/jira/browse/SPARK-26297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26297: Assignee: Apache Spark (was: Wenchen Fan) > improve the doc of Distribution/Partitioning > > > Key: SPARK-26297 > URL: https://issues.apache.org/jira/browse/SPARK-26297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26297) improve the doc of Distribution/Partitioning
[ https://issues.apache.org/jira/browse/SPARK-26297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711628#comment-16711628 ] Apache Spark commented on SPARK-26297: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23249 > improve the doc of Distribution/Partitioning > > > Key: SPARK-26297 > URL: https://issues.apache.org/jira/browse/SPARK-26297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26297) improve the doc of Distribution/Partitioning
[ https://issues.apache.org/jira/browse/SPARK-26297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26297: Assignee: Wenchen Fan (was: Apache Spark) > improve the doc of Distribution/Partitioning > > > Key: SPARK-26297 > URL: https://issues.apache.org/jira/browse/SPARK-26297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26297) improve the doc of Distribution/Partitioning
Wenchen Fan created SPARK-26297: --- Summary: improve the doc of Distribution/Partitioning Key: SPARK-26297 URL: https://issues.apache.org/jira/browse/SPARK-26297 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26244) Do not use case class as public API
[ https://issues.apache.org/jira/browse/SPARK-26244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26244. --- Resolution: Done > Do not use case class as public API > --- > > Key: SPARK-26244 > URL: https://issues.apache.org/jira/browse/SPARK-26244 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: release-notes > > It's a bad idea to use case class as public API, as it has a very wide > surface. For example, the copy method, its fields, the companion object, etc. > I don't think it's expect to expose so many stuff to end users, and usually > we only want to expose a few methods. > We should use a pure trait as public API, and use case class as an > implementation, which should be private and hide from end users. > Changing class to interface is not binary compatible(but source compatible), > so 3.0 is a good chance to do it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711516#comment-16711516 ] Gabor Somogyi commented on SPARK-26254: --- I've reached a state where tradeoff has to be made, so interested in opinions [~vanzin] [~ste...@apache.org] I've created a project with token-providers name which is depending on core. With this successfully extracted all the nasty hive + kafka dependencies + all token providers are there. Then loaded the providers with ServiceLoader which also works fine. Finally reached a point where kafka-sql project expects couple of things from KafkaUtil which is in token-providers now. Here is the list of problems: {noformat} [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:31: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:25: object KafkaTokenUtil is not a member of package org.apache.spark.deploy.security [error] import org.apache.spark.deploy.security.KafkaTokenUtil [error]^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:32: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) != null [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSecurityHelper.scala:37: not found: value KafkaTokenUtil [error] KafkaTokenUtil.TOKEN_SERVICE) [error] ^ [error] /Users/gaborsomogyi/spark/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala:566: not found: value KafkaTokenUtil [error] if (KafkaTokenUtil.isGlobalJaasConfigurationProvided) { [error] ^ + all isTokenAvailable tests expects TOKEN_KIND, TOKEN_SERVICE + KafkaDelegationTokenIdentifier in KafkaSecurityHelperSuite {noformat} Here I see these possibilities: * Hardcode TOKEN_KIND + TOKEN_SERVICE and duplicate isGlobalJaasConfigurationProvided => The drawback here is we can't really test whether the provider created token can be read in kafka-sql (we can actually but with hardcoded strings in both sides which makes it brittle) * As we're loading providers with ServiceLoader the kafka related one can be moved to kafka-sql => The drawback is that providers spread around and this code can't really be reused in DStreams. * Add token-providers provided dependency to kafka-sql project => It's kinda' weird but at the moment looks the least problematic Waiting on opinions... > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711500#comment-16711500 ] Hyukjin Kwon commented on SPARK-26265: -- Thanks, [~qianhan], can you investigate and make a fix to Spark? > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Priority: Major > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:281) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1995) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270) > "Executor task launch worker for task 18899": at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.spill(BytesToBytesMap.java:345) > -
[jira] [Reopened] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-26265: -- > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Priority: Major > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.next(BytesToBytesMap.java:313) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap$1.next(UnsafeFixedWidthAggregationMap.java:173) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at > scala.collection.Iterator$class.foreach(Iterator.scala:893) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply$mcV$sp(ScriptTransformationExec.scala:281) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread$$anonfun$run$1.apply(ScriptTransformationExec.scala:270) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1995) at > org.apache.spark.sql.hive.execution.ScriptTransformationWriterThread.run(ScriptTransformationExec.scala:270) > "Executor task launch worker for task 18899": at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.spill(BytesToBytesMap.java:345) > - waiting to lock <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at >
[jira] [Updated] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weixiuli updated SPARK-26288: - Description: As we all know that spark on Yarn uses DB to record RegisteredExecutors information which can be reloaded and used again when the ExternalShuffleService is restarted . The RegisteredExecutors information can't be recorded both in the mode of spark's standalone and spark on k8s , which will cause the RegisteredExecutors information to be lost ,when the ExternalShuffleService is restarted. To solve the problem above, a method is proposed and is committed . was: As we all know that spark on Yarn uses DB to record RegisteredExecutors information, when the ExternalShuffleService restart and it can be reloaded, which will be used as well . While neither spark's standalone nor spark on k8s can record it's RegisteredExecutors information by db or others ,so when ExternalShuffleService restart ,which RegisteredExecutors information will be lost,it is't what we looking forward to . This commit add initRegisteredExecutorsDB which can be used either spark standalone or spark on k8s to record RegisteredExecutors information , when the ExternalShuffleService restart and it can be reloaded, which will be used as well . > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Base Spark over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and [Scala is still lowering Spark in the previous version of the JDK|https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html]. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those who do not need it wouldn't download it and wouldn't be affected by it. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those who do not need it wouldn't download it and wouldn't be affected by it. > Base Spark over Java and not over Scala, offering Scala as an option over > Spark > --- > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and [Scala is still lowering Spark in the previous > version of the > JDK|https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html]. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those who do not > need it wouldn't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those who do not need it wouldn't download it and wouldn't be affected by it. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it and wouldn't be affected by it. > Spark build over Java and not over Scala, offering Scala as an option over > Spark > > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous > version of the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those who do not > need it wouldn't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it and wouldn't be affected by it. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it. In the same move, you would to return to the generation of standard javadocs for Java classes documentation. > Spark build over Java and not over Scala, offering Scala as an option over > Spark > > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous > version of the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those that do not > need it won't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it. In the same move, you would to return to the generation of standard javadocs for Java classes documentation. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause troubles, especially at the time of leveling JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it. In the same move, you would to return to the generation of standard javadocs for Java classes documentation. > Spark build over Java and not over Scala, offering Scala as an option over > Spark > > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous > version of the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those that do not > need it won't download it. In the same move, you would to return to the > generation of standard javadocs for Java classes documentation. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause troubles, especially at the time of leveling JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it. In the same move, you would to return to the generation of standard javadocs for Java classes documentation. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But Spark as been build over _Scala_ instead of plain _Java_ and its a cause troubles, especially at the time of leveling JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. > Spark build over Java and not over Scala, offering Scala as an option over > Spark > > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a > cause troubles, especially at the time of leveling JDK. We are awaiting to > use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of > the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those that do not > need it won't download it. In the same move, you would to return to the > generation of standard javadocs for Java classes documentation. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Base Spark over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Summary: Base Spark over Java and not over Scala, offering Scala as an option over Spark (was: Spark build over Java and not over Scala, offering Scala as an option over Spark) > Base Spark over Java and not over Scala, offering Scala as an option over > Spark > --- > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous > version of the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those who do not > need it wouldn't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711464#comment-16711464 ] Adrian Tanase commented on SPARK-26295: --- ping [~vanzin], > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user `mynamespace:spark` based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711462#comment-16711462 ] Adrian Tanase commented on SPARK-26295: --- I would be happy to give it a shot with a bit of guidance, as I can't easily figure out where this should slot in, given the many classes under the *org.apache.spark.deploy.k8s* package. Also, I've just seen the docs calling out this configuration (*spark.kubernetes.authenticate.serviceAccountName*) but I don't think it's implemented, or we have a bug. > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user `mynamespace:spark` based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
[ https://issues.apache.org/jira/browse/SPARK-26296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M. Le Bihan updated SPARK-26296: Description: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it and wouldn't be affected by it. was: I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But _Spark_ as been build over _Scala_ instead of plain _Java_ and its a cause of troubles, especially at the time of upgrading the JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. _Big Data_ programming shall not force developpers to get by with _Scala_ when its not the language they have choosen. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. Provide an optional _spark-scala_ artifact would be fine as those that do not need it won't download it and wouldn't be affected by it. > Spark build over Java and not over Scala, offering Scala as an option over > Spark > > > Key: SPARK-26296 > URL: https://issues.apache.org/jira/browse/SPARK-26296 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: M. Le Bihan >Priority: Minor > > I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I > believe those using _PySpark_ no more. > > But _Spark_ has been build over _Scala_ instead of plain _Java_ and this is a > cause of troubles, especially at the time of upgrading the JDK. We are > awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous > version of the JDK. > _Big Data_ programming shall not force developpers to get by with _Scala_ > when its not the language they have choosen. > > Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ > without _Hadoop,_ would confort me : a cause of issues would disappear. > Provide an optional _spark-scala_ artifact would be fine as those that do not > need it won't download it and wouldn't be affected by it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26296) Spark build over Java and not over Scala, offering Scala as an option over Spark
M. Le Bihan created SPARK-26296: --- Summary: Spark build over Java and not over Scala, offering Scala as an option over Spark Key: SPARK-26296 URL: https://issues.apache.org/jira/browse/SPARK-26296 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.4.0 Reporter: M. Le Bihan I am not using _Scala_ when I am programming _Spark_ but plain _Java_. I believe those using _PySpark_ no more. But Spark as been build over _Scala_ instead of plain _Java_ and its a cause troubles, especially at the time of leveling JDK. We are awaiting to use JDK 11 and _Scala_ is still lowering _Spark_ in the previous version of the JDK. Having a _Spark_ without _Scala_, like it is possible to have a _Spark_ without _Hadoop,_ would confort me : a cause of issues would disappear. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710091#comment-16710091 ] M. Le Bihan edited comment on SPARK-24417 at 12/6/18 1:15 PM: -- Hello, Unaware if the problem with the JDK 11, I used it with _Spark 2.3.x_ without troubles for months, calling most of the times _lookup()_ functions on RDDs. But when I attempted a _collect()_, I had a failure (an _IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._". Is it a trouble coming from memory management or from _Scala_ language ? If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait for _Spark 3.0,_ when this version is planned to be released ? Regards, was (Author: mlebihan): Hello, Unaware if the problem with the JDK 11, I used it with _Spark 2.3.x_ without troubles for months, calling most of the times _lookup()_ functions on RDDs. But when I attempted a _collect()_, I had a failure (an _IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._". Is it a trouble coming from memory management or from _Scala_ language ? If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait for _Spark 3.0,_ when this version is planned to be released ? Sorry if it's out of subject, but : Will this next major version still be built over _Scala_ (meaning that it has to wait that _Scala_ project can follow _Java_ JDK versions) or only over _Java_, with _Scala_ offered as an independant option ? Because it seems to me, who do not use _Scala_ for programming _Spark_ but plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a _Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ would confort me : a cause of issues would disappear. Regards, > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-26295: -- Description: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {noformat} Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"{noformat} The expectation was to see the user `mynamespace:spark` based on my submit command. My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. was: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {noformat} Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"{noformat} My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > The expectation was to see the user `mynamespace:spark` based on my submit > command. > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
[ https://issues.apache.org/jira/browse/SPARK-26295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-26295: -- Description: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {noformat} Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"{noformat} My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. was: When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {{Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"}} My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. > [K8S] serviceAccountName is not set in client mode > -- > > Key: SPARK-26295 > URL: https://issues.apache.org/jira/browse/SPARK-26295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Adrian Tanase >Priority: Major > > When deploying spark apps in client mode (in my case from inside the driver > pod), one can't specify the service account in accordance to the docs > ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] > The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is > most likely added in cluster mode only, which would be consistent with > spark.kubernetes.authenticate.driver being the cluster mode prefix. > We should either inject the service account specified by this property in the > client mode pods, or specify an equivalent config: > spark.kubernetes.authenticate.serviceAccountName > This is the exception: > {noformat} > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods "..." is forbidden: User > "system:serviceaccount:mynamespace:default" cannot get pods in the namespace > "mynamespace"{noformat} > My current workaround is to create a clusterrolebinding with edit rights for > the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26295) [K8S] serviceAccountName is not set in client mode
Adrian Tanase created SPARK-26295: - Summary: [K8S] serviceAccountName is not set in client mode Key: SPARK-26295 URL: https://issues.apache.org/jira/browse/SPARK-26295 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0 Reporter: Adrian Tanase When deploying spark apps in client mode (in my case from inside the driver pod), one can't specify the service account in accordance to the docs ([https://spark.apache.org/docs/latest/running-on-kubernetes.html#rbac).] The property {{spark.kubernetes.authenticate.driver.serviceAccountName}} is most likely added in cluster mode only, which would be consistent with spark.kubernetes.authenticate.driver being the cluster mode prefix. We should either inject the service account specified by this property in the client mode pods, or specify an equivalent config: spark.kubernetes.authenticate.serviceAccountName This is the exception: {{Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "..." is forbidden: User "system:serviceaccount:mynamespace:default" cannot get pods in the namespace "mynamespace"}} My current workaround is to create a clusterrolebinding with edit rights for the mynamespace:default account. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711360#comment-16711360 ] Apache Spark commented on SPARK-26293: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23248 > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26293: Assignee: Apache Spark (was: Wenchen Fan) > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26293: Assignee: Wenchen Fan (was: Apache Spark) > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711341#comment-16711341 ] Apache Spark commented on SPARK-26294: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23247 > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26294: Assignee: Apache Spark > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26294) Delete Unnecessary If statement
[ https://issues.apache.org/jira/browse/SPARK-26294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26294: Assignee: (was: Apache Spark) > Delete Unnecessary If statement > --- > > Key: SPARK-26294 > URL: https://issues.apache.org/jira/browse/SPARK-26294 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > Delete unnecessary If statement, because it Impossible execution when > records less than or equal to zero.it is only execution when records begin > zero. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26294) Delete Unnecessary If statement
wangjiaochun created SPARK-26294: Summary: Delete Unnecessary If statement Key: SPARK-26294 URL: https://issues.apache.org/jira/browse/SPARK-26294 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: wangjiaochun Delete unnecessary If statement, because it Impossible execution when records less than or equal to zero.it is only execution when records begin zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26292) Assert statement of currentPage may be not in right place
[ https://issues.apache.org/jira/browse/SPARK-26292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711309#comment-16711309 ] Apache Spark commented on SPARK-26292: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23246 > Assert statement of currentPage may be not in right place > -- > > Key: SPARK-26292 > URL: https://issues.apache.org/jira/browse/SPARK-26292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > Fix For: 2.4.0 > > > The assert statement of currentPage is not in right place,it should be add > to the back of > allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26292) Assert statement of currentPage may be not in right place
[ https://issues.apache.org/jira/browse/SPARK-26292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26292: Assignee: (was: Apache Spark) > Assert statement of currentPage may be not in right place > -- > > Key: SPARK-26292 > URL: https://issues.apache.org/jira/browse/SPARK-26292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > Fix For: 2.4.0 > > > The assert statement of currentPage is not in right place,it should be add > to the back of > allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711319#comment-16711319 ] Vikram Agrawal commented on SPARK-18165: Hi [~danielil] - right now it is available at https://github.com/qubole/kinesis-sql > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26292) Assert statement of currentPage may be not in right place
[ https://issues.apache.org/jira/browse/SPARK-26292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711317#comment-16711317 ] Apache Spark commented on SPARK-26292: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23246 > Assert statement of currentPage may be not in right place > -- > > Key: SPARK-26292 > URL: https://issues.apache.org/jira/browse/SPARK-26292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > Fix For: 2.4.0 > > > The assert statement of currentPage is not in right place,it should be add > to the back of > allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen
[ https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711316#comment-16711316 ] Eli commented on SPARK-18147: - [~mgaido] it work well on 2.3.2 . It should be a good idea to upgrade to this stable version.Thanks > Broken Spark SQL Codegen > > > Key: SPARK-18147 > URL: https://issues.apache.org/jira/browse/SPARK-18147 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: koert kuipers >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.1.0 > > > this is me on purpose trying to break spark sql codegen to uncover potential > issues, by creating arbitrately complex data structures using primitives, > strings, basic collections (map, seq, option), tuples, and case classes. > first example: nested case classes > code: > {noformat} > class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) > extends Aggregator[Row, B, C] { > override def reduce(b: B, input: Row): B = b > override def merge(b1: B, b2: B): B = b1 > override def finish(reduction: B): C = result > override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]() > override def outputEncoder: Encoder[C] = ExpressionEncoder[C]() > } > case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: > Seq[Long] = Seq.empty) > case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2()) > val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", > "y").groupBy("x").agg( > new ComplexResultAgg("boo", Struct3()).toColumn > ) > df1.printSchema > df1.show > {noformat} > the result is: > {noformat} > [info] Cause: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue > [info] /* 001 */ public java.lang.Object generate(Object[] references) { > [info] /* 002 */ return new SpecificMutableProjection(references); > [info] /* 003 */ } > [info] /* 004 */ > [info] /* 005 */ class SpecificMutableProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection { > [info] /* 006 */ > [info] /* 007 */ private Object[] references; > [info] /* 008 */ private MutableRow mutableRow; > [info] /* 009 */ private Object[] values; > [info] /* 010 */ private java.lang.String errMsg; > [info] /* 011 */ private Object[] values1; > [info] /* 012 */ private java.lang.String errMsg1; > [info] /* 013 */ private boolean[] argIsNulls; > [info] /* 014 */ private scala.collection.Seq argValue; > [info] /* 015 */ private java.lang.String errMsg2; > [info] /* 016 */ private boolean[] argIsNulls1; > [info] /* 017 */ private scala.collection.Seq argValue1; > [info] /* 018 */ private java.lang.String errMsg3; > [info] /* 019 */ private java.lang.String errMsg4; > [info] /* 020 */ private Object[] values2; > [info] /* 021 */ private java.lang.String errMsg5; > [info] /* 022 */ private boolean[] argIsNulls2; > [info] /* 023 */ private scala.collection.Seq argValue2; > [info] /* 024 */ private java.lang.String errMsg6; > [info] /* 025 */ private boolean[] argIsNulls3; > [info] /* 026 */ private scala.collection.Seq argValue3; > [info] /* 027 */ private java.lang.String errMsg7; > [info] /* 028 */ private boolean isNull_0; > [info] /* 029 */ private InternalRow value_0; > [info] /* 030 */ > [info] /* 031 */ private void apply_1(InternalRow i) { > [info] /* 032 */ > [info] /* 033 */ if (isNull1) { > [info] /* 034 */ throw new RuntimeException(errMsg3); > [info] /* 035 */ } > [info] /* 036 */ > [info] /* 037 */ boolean isNull24 = false; > [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? > null : (com.tresata.spark.sql.Struct2) value1.a(); > [info] /* 039 */ isNull24 = value24 == null; > [info] /* 040 */ > [info] /* 041 */ boolean isNull23 = isNull24; > [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : > (scala.collection.Seq) value24.s2(); > [info] /* 043 */ isNull23 = value23 == null; > [info] /* 044 */ argIsNulls1[0] = isNull23; > [info] /* 045 */ argValue1 = value23; > [info] /* 046 */ > [info] /* 047 */ > [info] /* 048 */ > [info] /* 049 */ boolean isNull22 = false; > [info] /* 050 */ for (int idx = 0; idx < 1; idx++) { > [info] /* 051 */ if (argIsNulls1[idx]) { isNull22 = true; break; } > [info] /* 052 */ } > [info] /* 053 */ > [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new > org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1); > [info] /* 055 */ if (isNull22) { > [info] /* 056 */ values1[2] = null; > [info] /* 057 */ } else { > [info] /* 058 */
[jira] [Assigned] (SPARK-26292) Assert statement of currentPage may be not in right place
[ https://issues.apache.org/jira/browse/SPARK-26292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26292: Assignee: Apache Spark > Assert statement of currentPage may be not in right place > -- > > Key: SPARK-26292 > URL: https://issues.apache.org/jira/browse/SPARK-26292 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > Fix For: 2.4.0 > > > The assert statement of currentPage is not in right place,it should be add > to the back of > allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26293) Cast exception when having python udf in subquery
Wenchen Fan created SPARK-26293: --- Summary: Cast exception when having python udf in subquery Key: SPARK-26293 URL: https://issues.apache.org/jira/browse/SPARK-26293 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26292) Assert statement of currentPage may be not in right place
wangjiaochun created SPARK-26292: Summary: Assert statement of currentPage may be not in right place Key: SPARK-26292 URL: https://issues.apache.org/jira/browse/SPARK-26292 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: wangjiaochun Fix For: 2.4.0 The assert statement of currentPage is not in right place,it should be add to the back of allocatePage in a function acquireNewPageIfNecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org