Re: Spark can't identify the event time column being supplied to withWatermark()
Glad that it worked out! It's unfortunate that there exist such pitfalls. And there is no easy way to get around it. If you can, let us know how your experience with mapGroupsWithState has been. TD On Fri, Jun 8, 2018 at 1:49 PM, frankdede wrote: > You are exactly right! A few hours ago, I tried many things and finally got > the example working by defining event timestamp column before groupByKey, > just like what you suggested, but I wasn't able to figure out the reasoning > behind my fix. > > val sessionUpdates = events > .withWatermark("timestamp", "10 seconds") > .groupByKey(event => event.sessionId) > .mapGroupsWithState[SessionInfo, > SessionUpdate](GroupStateTimeout.EventTimeTimeout()) > > It turns out that it's just impossible for the planner to figure out the > source of the watermark column after applied flatMap. > > Thanks Tathagata! > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Reset the offsets, Kafka 0.10 and Spark
Structured Streaming really makes this easy. You can simply specify the option of whether the start the query from earliest or latest. Check out - https://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html On Thu, Jun 7, 2018 at 1:27 PM, Guillermo Ortiz Fernández < guillermo.ortiz.f...@gmail.com> wrote: > I'm consuming data from Kafka with createDirectStream and store the > offsets in Kafka (https://spark.apache.org/docs/2.1.0/streaming-kafka-0- > 10-integration.html#kafka-itself) > > val stream = KafkaUtils.createDirectStream[String, String]( > streamingContext, > PreferConsistent, > Subscribe[String, String](topics, kafkaParams)) > > > > My Spark version is 2.0.2 and 0.10 from Kafka. This solution works well > and when I restart the spark process starts from the last offset which > Spark consumes, but sometimes I need to reprocess all the topic from the > beginning. > > I have seen that I could reset the offset with a kafka script but it's not > enable in Kafka 0.10... > > kafka-consumer-groups --bootstrap-server kafka-host:9092 --group > my-group --reset-offsets --to-earliest --all-topics --execute > > > Another possibility it's to set another kafka parameter in the > createDirectStream with a map with the offsets but, how could I get first > offset from each partition?, I have checked the api from the new consumer > and I don't see any method to get these offsets. > > Any other way?? I could start with another groupId as well, but it doesn't > seem a very clean option for production. >
Re: Spark can't identify the event time column being supplied to withWatermark()
You are exactly right! A few hours ago, I tried many things and finally got the example working by defining event timestamp column before groupByKey, just like what you suggested, but I wasn't able to figure out the reasoning behind my fix. val sessionUpdates = events .withWatermark("timestamp", "10 seconds") .groupByKey(event => event.sessionId) .mapGroupsWithState[SessionInfo, SessionUpdate](GroupStateTimeout.EventTimeTimeout()) It turns out that it's just impossible for the planner to figure out the source of the watermark column after applied flatMap. Thanks Tathagata! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark can't identify the event time column being supplied to withWatermark()
Try to define the watermark on the right column immediately before calling `groupByKey(...).mapGroupsWithState(...)`. You are applying the watermark and then doing a bunch of opaque transformation (user-defined flatMap that the planner has no visibility into). This prevents the planner from propagating the watermark tag through such operations. Specifically, you are applying a flatMap that takes a timestmap and splitting into multiple records with timestamp columns. The SQL analyzer/planner cannot possibly reason from the opaque user-defined code that the generated timestamp is same or different as the input timestamp column, hence it cannot propagate the watermark information down to the mapGropuswithState. Hope this helps. On Fri, Jun 8, 2018 at 7:50 AM, frankdede wrote: > I was trying to find a way to resessionize features in different events > based > on the event timestamps using Spark and I found a code example that uses > mapGroupsWithStateto resessionize events using processing timestamps in > their repo. > > https://github.com/apache/spark/blob/v2.3.0/examples/ > src/main/scala/org/apache/spark/examples/sql/streaming/ > StructuredSessionization.scala > > To quickly test if this sessionization thing works with event timestamps, I > added withWatermark("timestamp", "10 seconds") (treating processing time as > the event timestamp) and changed ProcessingTimeTimeout to EventTimeTimeout. > > val lines = spark.readStream > .format("socket") > .option("host", host) > .option("port", port) > .option("includeTimestamp", value = true) > .load() > > // Split the lines into words, treat words as sessionId of events > val events = lines > .withWatermark("timestamp", "10 seconds") // added > .as[(String, Timestamp)] > .flatMap { case (line, timestamp) => > line.split(" ").map(word => Event(sessionId = word, timestamp)) > } > > val sessionUpdates = events > .groupByKey(event => event.sessionId) > .mapGroupsWithState[SessionInfo, > SessionUpdate].(GroupStateTimeout.EventTimeTimeout) { >... > } > > // Start running the query that prints the session updates to the console > val query = sessionUpdates > .writeStream > .outputMode("update") > .format("console") > .start() > > query.awaitTermination() > However,when I ran it, Spark threw org.apache.spark.sql.AnalysisException > and said that Watermark must be specified in the query using > '[Dataset/DataFrame].withWatermark()' for using event-time timeout in a > [map|flatMap]GroupsWithState. Event-time timeout not supported without > watermark, which is not true and confusing, because that 'timestamp' column > is clearly in the physical plan following that exception message: > > ... > +- EventTimeWatermark timestamp#3: timestamp, interval 10 seconds >+- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@394a6d2b,socket,List(),..., > [value#2, timestamp#3] > Did I miss something or did something wrong? > > Thanks in advance! > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Spark 2.3 driver pod stuck in Running state — Kubernetes
Yes, it looks like it is because there's not enough resources to run the executor pods. Have you seen pending executor pods? On Fri, Jun 8, 2018, 11:49 AM Thodoris Zois wrote: > As far as I know from Mesos with Spark, it is a running state and not a > pending one. What you see is normal, but if I am wrong somebody correct me. > > Spark driver at start operates normally (running state) but when it comes > to start up executors, then it cannot allocate resources for them and > hangs.. > > - Thodoris > > On 8 Jun 2018, at 21:24, purna pradeep wrote: > > Hello, > > When I run spark-submit on k8s cluster I’m > > Seeing driver pod stuck in Running state and when I pulled driver pod logs > I’m able to see below log > > I do understand that this warning might be because of lack of cpu/ Memory > , but I expect driver pod be in “Pending” state rather than “ Running” > state though actually it’s not Running > > So I had kill the driver pod and resubmit the job > > Please suggest here ! > > 2018-06-08 14:38:01 WARN TaskSchedulerImpl:66 - Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient resources > > 2018-06-08 14:38:16 WARN TaskSchedulerImpl:66 - Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient resources > > 2018-06-08 14:38:31 WARN TaskSchedulerImpl:66 - Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient resources > > 2018-06-08 14:38:46 WARN TaskSchedulerImpl:66 - Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient resources > > 2018-06-08 14:39:01 WARN TaskSchedulerImpl:66 - Initial job has not > accepted any resources; check your cluster UI to ensure that workers are > registered and have sufficient resources > >
Re: Spark 2.3 driver pod stuck in Running state — Kubernetes
As far as I know from Mesos with Spark, it is a running state and not a pending one. What you see is normal, but if I am wrong somebody correct me. Spark driver at start operates normally (running state) but when it comes to start up executors, then it cannot allocate resources for them and hangs.. - Thodoris > On 8 Jun 2018, at 21:24, purna pradeep wrote: > > Hello, > When I run spark-submit on k8s cluster I’m > > Seeing driver pod stuck in Running state and when I pulled driver pod logs > I’m able to see below log > > I do understand that this warning might be because of lack of cpu/ Memory , > but I expect driver pod be in “Pending” state rather than “ Running” state > though actually it’s not Running > > So I had kill the driver pod and resubmit the job > > Please suggest here ! > > 2018-06-08 14:38:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > > 2018-06-08 14:38:16 WARN TaskSchedulerImpl:66 - Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > > 2018-06-08 14:38:31 WARN TaskSchedulerImpl:66 - Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > > 2018-06-08 14:38:46 WARN TaskSchedulerImpl:66 - Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources > > 2018-06-08 14:39:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted > any resources; check your cluster UI to ensure that workers are registered > and have sufficient resources
Spark 2.3 driver pod stuck in Running state — Kubernetes
Hello, When I run spark-submit on k8s cluster I’m Seeing driver pod stuck in Running state and when I pulled driver pod logs I’m able to see below log I do understand that this warning might be because of lack of cpu/ Memory , but I expect driver pod be in “Pending” state rather than “ Running” state though actually it’s not Running So I had kill the driver pod and resubmit the job Please suggest here ! 2018-06-08 14:38:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2018-06-08 14:38:16 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2018-06-08 14:38:31 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2018-06-08 14:38:46 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2018-06-08 14:39:01 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Spark can't identify the event time column being supplied to withWatermark()
I was trying to find a way to resessionize features in different events based on the event timestamps using Spark and I found a code example that uses mapGroupsWithStateto resessionize events using processing timestamps in their repo. https://github.com/apache/spark/blob/v2.3.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala To quickly test if this sessionization thing works with event timestamps, I added withWatermark("timestamp", "10 seconds") (treating processing time as the event timestamp) and changed ProcessingTimeTimeout to EventTimeTimeout. val lines = spark.readStream .format("socket") .option("host", host) .option("port", port) .option("includeTimestamp", value = true) .load() // Split the lines into words, treat words as sessionId of events val events = lines .withWatermark("timestamp", "10 seconds") // added .as[(String, Timestamp)] .flatMap { case (line, timestamp) => line.split(" ").map(word => Event(sessionId = word, timestamp)) } val sessionUpdates = events .groupByKey(event => event.sessionId) .mapGroupsWithState[SessionInfo, SessionUpdate].(GroupStateTimeout.EventTimeTimeout) { ... } // Start running the query that prints the session updates to the console val query = sessionUpdates .writeStream .outputMode("update") .format("console") .start() query.awaitTermination() However,when I ran it, Spark threw org.apache.spark.sql.AnalysisException and said that Watermark must be specified in the query using '[Dataset/DataFrame].withWatermark()' for using event-time timeout in a [map|flatMap]GroupsWithState. Event-time timeout not supported without watermark, which is not true and confusing, because that 'timestamp' column is clearly in the physical plan following that exception message: ... +- EventTimeWatermark timestamp#3: timestamp, interval 10 seconds +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@394a6d2b,socket,List(),..., [value#2, timestamp#3] Did I miss something or did something wrong? Thanks in advance! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Change in configuration settings?
I recently upgraded a Structured Streaming application from Spark 2.2.1 -> Spark 2.3.0. This application runs in yarn-cluster mode, and it made use of the spark.yarn.{driver|executor}.memoryOverhead properties. I noticed the job started crashing unexpectedly, and after doing a bunch of digging, it seems that these properties were migrated to simply be "spark.driver.memoryOverhead" and "spark.executor.memoryOverhead" - I see that they existed in the 2.2.1 configuration documentation, but not the 2.3.0 docs. However, I can't find anything in the release notes between versions that references this change - should the old spark.yarn.* settings still work, or were they completely removed in favor the new settings? Regards, Will
Re: [SparkLauncher] stateChanged event not received in standalone cluster mode
Thanks. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark YARN job submission error (code 13)
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) - > > But, I'm facing a new issue now, this is the new RPC error I get > (Stack-Trace below) - > > > > > 2018-06-08 14:26:43 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2018-06-08 14:26:45 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-06-08 14:26:45 INFO SparkContext:54 - Submitted application: > EndToEnd_FeatureEngineeringPipeline > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls to: > bblite > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls to: > bblite > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls groups > to: > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls groups > to: > 2018-06-08 14:26:45 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(bblite); groups with view permissions: Set(); users with modify > permissions: Set(bblite); groups with modify permissions: Set() > 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 41957. > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-06-08 14:26:45 INFO DiskBlockManager:54 - Created local directory at > /appdata/spark/tmp/blockmgr-7b035871-a1f7-47ff-aad8-f7a43367836e > 2018-06-08 14:26:45 INFO MemoryStore:54 - MemoryStore started with > capacity 366.3 MB > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-06-08 14:26:45 INFO log:192 - Logging initialized @3659ms > 2018-06-08 14:26:45 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-06-08 14:26:45 INFO Server:414 - Started @3733ms > 2018-06-08 14:26:45 INFO AbstractConnector:278 - Started > ServerConnector@3080efb7{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} > 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service > 'SparkUI' on port 4040. > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@2c3409b5{/jobs,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@7f1ba569{/jobs/json,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@493631a1{/jobs/job,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@6b12f33c{/jobs/job/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@490023da{/stages,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@31c3a862{/stages/json,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@4da2454f{/stages/stage,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@552f182d{/stages/stage/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@a78a7fa{/stages/pool,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@15142105{/stages/pool/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@7589c977{/storage,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@584a599b{/storage/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@1742621f{/storage/rdd,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@23ea75fb{/storage/rdd/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@1813d280{/environment,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@129fc698{/environment/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@16c91c4e{/executors,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@667ce6c1{/executors/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO
Re: Spark YARN Error - triggering spark-shell
Fixed by adding 2 configurations in yarn-site,xml. Thanks all! On Fri, Jun 8, 2018 at 2:44 PM, Aakash Basu wrote: > Hi, > > I fixed that problem by putting all the Spark JARS in spark-archive.zip > and putting it in the HDFS (as that problem was happening for that reason) - > > But, I'm facing a new issue now, this is the new RPC error I get > (Stack-Trace below) - > > > > > 2018-06-08 14:26:43 WARN NativeCodeLoader:62 - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2018-06-08 14:26:45 INFO SparkContext:54 - Running Spark version 2.3.0 > 2018-06-08 14:26:45 INFO SparkContext:54 - Submitted application: > EndToEnd_FeatureEngineeringPipeline > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls to: > bblite > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls to: > bblite > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls groups > to: > 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls groups > to: > 2018-06-08 14:26:45 INFO SecurityManager:54 - SecurityManager: > authentication disabled; ui acls disabled; users with view permissions: > Set(bblite); groups with view permissions: Set(); users with modify > permissions: Set(bblite); groups with modify permissions: Set() > 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service > 'sparkDriver' on port 41957. > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering MapOutputTracker > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering BlockManagerMaster > 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - > BlockManagerMasterEndpoint up > 2018-06-08 14:26:45 INFO DiskBlockManager:54 - Created local directory at > /appdata/spark/tmp/blockmgr-7b035871-a1f7-47ff-aad8-f7a43367836e > 2018-06-08 14:26:45 INFO MemoryStore:54 - MemoryStore started with > capacity 366.3 MB > 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering OutputCommitCoordinator > 2018-06-08 14:26:45 INFO log:192 - Logging initialized @3659ms > 2018-06-08 14:26:45 INFO Server:346 - jetty-9.3.z-SNAPSHOT > 2018-06-08 14:26:45 INFO Server:414 - Started @3733ms > 2018-06-08 14:26:45 INFO AbstractConnector:278 - Started > ServerConnector@3080efb7{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} > 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service > 'SparkUI' on port 4040. > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@2c3409b5{/jobs,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@7f1ba569{/jobs/json,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@493631a1{/jobs/job,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@6b12f33c{/jobs/job/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@490023da{/stages,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@31c3a862{/stages/json,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@4da2454f{/stages/stage,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@552f182d{/stages/stage/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@a78a7fa{/stages/pool,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@15142105{/stages/pool/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@7589c977{/storage,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@584a599b{/storage/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@1742621f{/storage/rdd,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@23ea75fb{/storage/rdd/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@1813d280{/environment,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@129fc698{/environment/json, > null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@16c91c4e{/executors,null,AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO ContextHandler:781 - Started > o.s.j.s.ServletContextHandler@667ce6c1{/executors/json,null, > AVAILABLE,@Spark} > 2018-06-08 14:26:45 INFO
Re: Spark YARN Error - triggering spark-shell
Hi, I fixed that problem by putting all the Spark JARS in spark-archive.zip and putting it in the HDFS (as that problem was happening for that reason) - But, I'm facing a new issue now, this is the new RPC error I get (Stack-Trace below) - 2018-06-08 14:26:43 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-06-08 14:26:45 INFO SparkContext:54 - Running Spark version 2.3.0 2018-06-08 14:26:45 INFO SparkContext:54 - Submitted application: EndToEnd_FeatureEngineeringPipeline 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls to: bblite 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls to: bblite 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls groups to: 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls groups to: 2018-06-08 14:26:45 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(bblite); groups with view permissions: Set(); users with modify permissions: Set(bblite); groups with modify permissions: Set() 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service 'sparkDriver' on port 41957. 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering MapOutputTracker 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering BlockManagerMaster 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up 2018-06-08 14:26:45 INFO DiskBlockManager:54 - Created local directory at /appdata/spark/tmp/blockmgr-7b035871-a1f7-47ff-aad8-f7a43367836e 2018-06-08 14:26:45 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering OutputCommitCoordinator 2018-06-08 14:26:45 INFO log:192 - Logging initialized @3659ms 2018-06-08 14:26:45 INFO Server:346 - jetty-9.3.z-SNAPSHOT 2018-06-08 14:26:45 INFO Server:414 - Started @3733ms 2018-06-08 14:26:45 INFO AbstractConnector:278 - Started ServerConnector@3080efb7{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040. 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c3409b5{/jobs,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7f1ba569{/jobs/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@493631a1{/jobs/job,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b12f33c{/jobs/job/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@490023da{/stages,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@31c3a862{/stages/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4da2454f{/stages/stage,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@552f182d {/stages/stage/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@a78a7fa{/stages/pool,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15142105 {/stages/pool/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7589c977{/storage,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@584a599b{/storage/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1742621f{/storage/rdd,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@23ea75fb {/storage/rdd/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1813d280{/environment,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@129fc698 {/environment/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16c91c4e{/executors,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@667ce6c1 {/executors/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@60fdbf5c {/executors/threadDump,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c3a1edd {/executors/threadDump/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started
Re: Spark YARN job submission error (code 13)
Hi, I fixed that problem by putting all the Spark JARS in spark-archive.zip and putting it in the HDFS (as that problem was happening for that reason) - But, I'm facing a new issue now, this is the new RPC error I get (Stack-Trace below) - 2018-06-08 14:26:43 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-06-08 14:26:45 INFO SparkContext:54 - Running Spark version 2.3.0 2018-06-08 14:26:45 INFO SparkContext:54 - Submitted application: EndToEnd_FeatureEngineeringPipeline 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls to: bblite 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls to: bblite 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing view acls groups to: 2018-06-08 14:26:45 INFO SecurityManager:54 - Changing modify acls groups to: 2018-06-08 14:26:45 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(bblite); groups with view permissions: Set(); users with modify permissions: Set(bblite); groups with modify permissions: Set() 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service 'sparkDriver' on port 41957. 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering MapOutputTracker 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering BlockManagerMaster 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 2018-06-08 14:26:45 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up 2018-06-08 14:26:45 INFO DiskBlockManager:54 - Created local directory at /appdata/spark/tmp/blockmgr-7b035871-a1f7-47ff-aad8-f7a43367836e 2018-06-08 14:26:45 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB 2018-06-08 14:26:45 INFO SparkEnv:54 - Registering OutputCommitCoordinator 2018-06-08 14:26:45 INFO log:192 - Logging initialized @3659ms 2018-06-08 14:26:45 INFO Server:346 - jetty-9.3.z-SNAPSHOT 2018-06-08 14:26:45 INFO Server:414 - Started @3733ms 2018-06-08 14:26:45 INFO AbstractConnector:278 - Started ServerConnector@3080efb7{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} 2018-06-08 14:26:45 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040. 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2c3409b5{/jobs,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7f1ba569{/jobs/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@493631a1{/jobs/job,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6b12f33c{/jobs/job/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@490023da{/stages,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@31c3a862{/stages/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4da2454f{/stages/stage,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@552f182d {/stages/stage/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@a78a7fa{/stages/pool,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15142105 {/stages/pool/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7589c977{/storage,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@584a599b{/storage/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1742621f{/storage/rdd,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@23ea75fb {/storage/rdd/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1813d280{/environment,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@129fc698 {/environment/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@16c91c4e{/executors,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@667ce6c1 {/executors/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@60fdbf5c {/executors/threadDump,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c3a1edd {/executors/threadDump/json,null,AVAILABLE,@Spark} 2018-06-08 14:26:45 INFO ContextHandler:781 - Started
Re: Spark YARN Error - triggering spark-shell
It seems, your spark-on-yarn application is not able to get it's application master, org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. Check once on yarn logs Thanks, Sathish- On Fri, Jun 8, 2018 at 2:22 PM, Jeff Zhang wrote: > > Check the yarn AM log for details. > > > > Aakash Basu 于2018年6月8日周五 下午4:36写道: > >> Hi, >> >> Getting this error when trying to run Spark Shell using YARN - >> >> Command: *spark-shell --master yarn --deploy-mode client* >> >> 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor >> spark.yarn.archive is set, falling back to uploading libraries under >> SPARK_HOME. >> 2018-06-08 13:39:25 ERROR SparkContext:91 - Error initializing SparkContext. >> org.apache.spark.SparkException: Yarn application has already ended! It >> might have been killed or unable to launch application master. >> >> >> The last half of stack-trace - >> >> 2018-06-08 13:56:11 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - >> Attempted to request executors before the AM has registered! >> 2018-06-08 13:56:11 WARN MetricsSystem:66 - Stopping a MetricsSystem that >> is not running >> org.apache.spark.SparkException: Yarn application has already ended! It >> might have been killed or unable to launch application master. >> at >> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89) >> at >> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) >> at >> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) >> at org.apache.spark.SparkContext.(SparkContext.scala:500) >> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) >> at >> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) >> at >> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) >> at scala.Option.getOrElse(Option.scala:121) >> at >> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) >> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103) >> ... 55 elided >> :14: error: not found: value spark >>import spark.implicits._ >> ^ >> :14: error: not found: value spark >>import spark.sql >> >> >> Tried putting the *spark-yarn_2.11-2.3.0.jar *in Hadoop yarn, still not >> working, anything else to do? >> >> Thanks, >> Aakash. >> >
Re: Spark YARN Error - triggering spark-shell
Check the yarn AM log for details. Aakash Basu 于2018年6月8日周五 下午4:36写道: > Hi, > > Getting this error when trying to run Spark Shell using YARN - > > Command: *spark-shell --master yarn --deploy-mode client* > > 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor > spark.yarn.archive is set, falling back to uploading libraries under > SPARK_HOME. > 2018-06-08 13:39:25 ERROR SparkContext:91 - Error initializing SparkContext. > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > > > The last half of stack-trace - > > 2018-06-08 13:56:11 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - > Attempted to request executors before the AM has registered! > 2018-06-08 13:56:11 WARN MetricsSystem:66 - Stopping a MetricsSystem that is > not running > org.apache.spark.SparkException: Yarn application has already ended! It might > have been killed or unable to launch application master. > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) > at org.apache.spark.SparkContext.(SparkContext.scala:500) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) > at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103) > ... 55 elided > :14: error: not found: value spark >import spark.implicits._ > ^ > :14: error: not found: value spark >import spark.sql > > > Tried putting the *spark-yarn_2.11-2.3.0.jar *in Hadoop yarn, still not > working, anything else to do? > > Thanks, > Aakash. >
Re: Spark YARN job submission error (code 13)
In Spark on YARN, error code 13 means SparkContext doesn't initialize in time. You can check the yarn application log to get more information. BTW, did you just write a plain python script without creating SparkContext/SparkSession? Aakash Basu 于2018年6月8日周五 下午4:15写道: > Hi, > > I'm trying to run a program on a cluster using YARN. > > YARN is present there along with HADOOP. > > Problem I'm running into is as below - > > Container exited with a non-zero exit code 13 >> Failing this attempt. Failing the application. >> ApplicationMaster host: N/A >> ApplicationMaster RPC port: -1 >> queue: default >> start time: 1528297574594 >> final status: FAILED >> tracking URL: >> http://MasterNode:8088/cluster/app/application_1528296308262_0004 >> user: bblite >> Exception in thread "main" org.apache.spark.SparkException: Application >> application_1528296308262_0004 finished with failed status >> > > I checked on the net and most of the stackoverflow problems say, that the > users have given *.master('local[*]')* in the code while invoking the > Spark Session and at the same time, giving *--master yarn* while doing > the spark-submit, hence they're getting the error due to conflict. > > But, in my case, I've not mentioned any master at all at the code. Just > trying to run it on yarn by giving *--master yarn* while doing the > spark-submit. Below is the code spark invoking - > > spark = SparkSession\ > .builder\ > .appName("Temp_Prog")\ > .getOrCreate() > > Below is the spark-submit - > > *spark-submit --master yarn --deploy-mode cluster --num-executors 3 > --executor-cores 6 --executor-memory 4G > /appdata/codebase/backend/feature_extraction/try_yarn.py* > > I've tried without --deploy-mode too, still no help. > > Thanks, > Aakash. >
Spark YARN Error - triggering spark-shell
Hi, Getting this error when trying to run Spark Shell using YARN - Command: *spark-shell --master yarn --deploy-mode client* 2018-06-08 13:39:09 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 2018-06-08 13:39:25 ERROR SparkContext:91 - Error initializing SparkContext. org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. The last half of stack-trace - 2018-06-08 13:56:11 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Attempted to request executors before the AM has registered! 2018-06-08 13:56:11 WARN MetricsSystem:66 - Stopping a MetricsSystem that is not running org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master. at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:89) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164) at org.apache.spark.SparkContext.(SparkContext.scala:500) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2486) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:103) ... 55 elided :14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql Tried putting the *spark-yarn_2.11-2.3.0.jar *in Hadoop yarn, still not working, anything else to do? Thanks, Aakash.
Spark YARN job submission error (code 13)
Hi, I'm trying to run a program on a cluster using YARN. YARN is present there along with HADOOP. Problem I'm running into is as below - Container exited with a non-zero exit code 13 > Failing this attempt. Failing the application. > ApplicationMaster host: N/A > ApplicationMaster RPC port: -1 > queue: default > start time: 1528297574594 > final status: FAILED > tracking URL: > http://MasterNode:8088/cluster/app/application_1528296308262_0004 > user: bblite > Exception in thread "main" org.apache.spark.SparkException: Application > application_1528296308262_0004 finished with failed status > I checked on the net and most of the stackoverflow problems say, that the users have given *.master('local[*]')* in the code while invoking the Spark Session and at the same time, giving *--master yarn* while doing the spark-submit, hence they're getting the error due to conflict. But, in my case, I've not mentioned any master at all at the code. Just trying to run it on yarn by giving *--master yarn* while doing the spark-submit. Below is the code spark invoking - spark = SparkSession\ .builder\ .appName("Temp_Prog")\ .getOrCreate() Below is the spark-submit - *spark-submit --master yarn --deploy-mode cluster --num-executors 3 --executor-cores 6 --executor-memory 4G /appdata/codebase/backend/feature_extraction/try_yarn.py* I've tried without --deploy-mode too, still no help. Thanks, Aakash.
Re: Strange codegen error for SortMergeJoin in Spark 2.2.1
Hi! I finally found the problem. I was not aware, that the program was run in Client mode. The client used version 2.2.0. This caused the problem. Best, Rico. Am 07.06.2018 um 08:49 schrieb Kazuaki Ishizaki: > Thank you for reporting a problem. > Would it be possible to create a JIRA entry with a small program that > can reproduce this problem? > > Best Regards, > Kazuaki Ishizaki > > > > From: Rico Bergmann > To: "user@spark.apache.org" > Date: 2018/06/05 19:58 > Subject: Strange codegen error for SortMergeJoin in Spark 2.2.1 > > > > > Hi! > > I get a strange error when executing a complex SQL-query involving 4 > tables that are left-outer-joined: > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 37, Column 18: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', > Line 37, Column 18: No applicable constructor/method found for actual > parameters "int"; candidates are: > "org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(org.apache.spark.memory.TaskMemoryManager,org.apache.spark.storage.BlockManager, > org.apache.spark.serializer.SerializerManager, > org.apache.spark.TaskContext, int, long, int, int)", > "org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(int, > int)" > > ... > > /* 037 */ smj_matches = new > org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483647); > > The same query works with Spark 2.2.0. > > I checked the Spark source code and saw that in > ExternalAppendOnlyUnsafeRowArray a second int was introduced into the > constructor in 2.2.1 > > But looking at the codegeneration part of SortMergeJoinExec: > > // A list to hold all matched rows from right side. > *val *matches = ctx.freshName("matches") > *val *clsName = /classOf/[ExternalAppendOnlyUnsafeRowArray].getName > > *val *spillThreshold = getSpillThreshold > *val *inMemoryThreshold = getInMemoryThreshold > > ctx.addMutableState(clsName, matches, > s"*$*matches= new *$*clsName(*$*inMemoryThreshold, *$*spillThreshold);") > > it should get 2 parameters, not just one. > > May be anyone has an idea? > > Best, > > Rico. > >