Re: Spark on YARN, HowTo kill executor or individual task?
I tried a similar approach, it works well for user functions. but I need to crash tasks or executor when spark application runs "repartition". I didn't any away to inject "poison pill" into repartition call :( пн, 11 февр. 2019 г. в 21:19, Vadim Semenov : > something like this > > import org.apache.spark.TaskContext > ds.map(r => { > val taskContext = TaskContext.get() > if (taskContext.partitionId == 1000) { > throw new RuntimeException > } > r > }) > > On Mon, Feb 11, 2019 at 8:41 AM Serega Sheypak > wrote: > > > > I need to crash task which does repartition. > > > > пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi : > >> > >> What blocks you to put if conditions inside the mentioned map function? > >> > >> On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak < > serega.shey...@gmail.com> wrote: > >>> > >>> Yeah, but I don't need to crash entire app, I want to fail several > tasks or executors and then wait for completion. > >>> > >>> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi >: > >>>> > >>>> Another approach is adding artificial exception into the > application's source code like this: > >>>> > >>>> val query = input.toDS.map(_ / > 0).writeStream.format("console").start() > >>>> > >>>> G > >>>> > >>>> > >>>> On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak < > serega.shey...@gmail.com> wrote: > >>>>> > >>>>> Hi BR, > >>>>> thanks for your reply. I want to mimic the issue and kill tasks at a > certain stage. Killing executor is also an option for me. > >>>>> I'm curious how do core spark contributors test spark fault > tolerance? > >>>>> > >>>>> > >>>>> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi < > gabor.g.somo...@gmail.com>: > >>>>>> > >>>>>> Hi Serega, > >>>>>> > >>>>>> If I understand your problem correctly you would like to kill one > executor only and the rest of the app has to be untouched. > >>>>>> If that's true yarn -kill is not what you want because it stops the > whole application. > >>>>>> > >>>>>> I've done similar thing when tested/testing Spark's HA features. > >>>>>> - jps -vlm | grep > "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid" > >>>>>> - kill -9 pidofoneexecutor > >>>>>> > >>>>>> Be aware if it's a multi-node cluster check whether at least one > process runs on a specific node(it's not required). > >>>>>> Happy killing... > >>>>>> > >>>>>> BR, > >>>>>> G > >>>>>> > >>>>>> > >>>>>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke > wrote: > >>>>>>> > >>>>>>> yarn application -kill applicationid ? > >>>>>>> > >>>>>>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak < > serega.shey...@gmail.com>: > >>>>>>> > > >>>>>>> > Hi there! > >>>>>>> > I have weird issue that appears only when tasks fail at specific > stage. I would like to imitate failure on my own. > >>>>>>> > The plan is to run problematic app and then kill entire executor > or some tasks when execution reaches certain stage. > >>>>>>> > > >>>>>>> > Is it do-able? > >>>>>>> > >>>>>>> > - > >>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>>>>>> > > > -- > Sent from my iPhone >
Re: Spark on YARN, HowTo kill executor or individual task?
I need to crash task which does repartition. пн, 11 февр. 2019 г. в 10:37, Gabor Somogyi : > What blocks you to put if conditions inside the mentioned map function? > > On Mon, Feb 11, 2019 at 10:31 AM Serega Sheypak > wrote: > >> Yeah, but I don't need to crash entire app, I want to fail several tasks >> or executors and then wait for completion. >> >> вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi : >> >>> Another approach is adding artificial exception into the application's >>> source code like this: >>> >>> val query = input.toDS.map(_ / 0).writeStream.format("console").start() >>> >>> G >>> >>> >>> On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak >>> wrote: >>> >>>> Hi BR, >>>> thanks for your reply. I want to mimic the issue and kill tasks at a >>>> certain stage. Killing executor is also an option for me. >>>> I'm curious how do core spark contributors test spark fault tolerance? >>>> >>>> >>>> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi >>> >: >>>> >>>>> Hi Serega, >>>>> >>>>> If I understand your problem correctly you would like to kill one >>>>> executor only and the rest of the app has to be untouched. >>>>> If that's true yarn -kill is not what you want because it stops the >>>>> whole application. >>>>> >>>>> I've done similar thing when tested/testing Spark's HA features. >>>>> - jps -vlm | grep >>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid" >>>>> - kill -9 pidofoneexecutor >>>>> >>>>> Be aware if it's a multi-node cluster check whether at least one >>>>> process runs on a specific node(it's not required). >>>>> Happy killing... >>>>> >>>>> BR, >>>>> G >>>>> >>>>> >>>>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke >>>>> wrote: >>>>> >>>>>> yarn application -kill applicationid ? >>>>>> >>>>>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak < >>>>>> serega.shey...@gmail.com>: >>>>>> > >>>>>> > Hi there! >>>>>> > I have weird issue that appears only when tasks fail at specific >>>>>> stage. I would like to imitate failure on my own. >>>>>> > The plan is to run problematic app and then kill entire executor or >>>>>> some tasks when execution reaches certain stage. >>>>>> > >>>>>> > Is it do-able? >>>>>> >>>>>> - >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>> >>>>>>
Re: Spark on YARN, HowTo kill executor or individual task?
Yeah, but I don't need to crash entire app, I want to fail several tasks or executors and then wait for completion. вс, 10 февр. 2019 г. в 21:49, Gabor Somogyi : > Another approach is adding artificial exception into the application's > source code like this: > > val query = input.toDS.map(_ / 0).writeStream.format("console").start() > > G > > > On Sun, Feb 10, 2019 at 9:36 PM Serega Sheypak > wrote: > >> Hi BR, >> thanks for your reply. I want to mimic the issue and kill tasks at a >> certain stage. Killing executor is also an option for me. >> I'm curious how do core spark contributors test spark fault tolerance? >> >> >> вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi : >> >>> Hi Serega, >>> >>> If I understand your problem correctly you would like to kill one >>> executor only and the rest of the app has to be untouched. >>> If that's true yarn -kill is not what you want because it stops the >>> whole application. >>> >>> I've done similar thing when tested/testing Spark's HA features. >>> - jps -vlm | grep >>> "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid" >>> - kill -9 pidofoneexecutor >>> >>> Be aware if it's a multi-node cluster check whether at least one process >>> runs on a specific node(it's not required). >>> Happy killing... >>> >>> BR, >>> G >>> >>> >>> On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke >>> wrote: >>> >>>> yarn application -kill applicationid ? >>>> >>>> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak < >>>> serega.shey...@gmail.com>: >>>> > >>>> > Hi there! >>>> > I have weird issue that appears only when tasks fail at specific >>>> stage. I would like to imitate failure on my own. >>>> > The plan is to run problematic app and then kill entire executor or >>>> some tasks when execution reaches certain stage. >>>> > >>>> > Is it do-able? >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>>
Re: Spark on YARN, HowTo kill executor or individual task?
Hi BR, thanks for your reply. I want to mimic the issue and kill tasks at a certain stage. Killing executor is also an option for me. I'm curious how do core spark contributors test spark fault tolerance? вс, 10 февр. 2019 г. в 16:57, Gabor Somogyi : > Hi Serega, > > If I understand your problem correctly you would like to kill one executor > only and the rest of the app has to be untouched. > If that's true yarn -kill is not what you want because it stops the whole > application. > > I've done similar thing when tested/testing Spark's HA features. > - jps -vlm | grep > "org.apache.spark.executor.CoarseGrainedExecutorBackend.*applicationid" > - kill -9 pidofoneexecutor > > Be aware if it's a multi-node cluster check whether at least one process > runs on a specific node(it's not required). > Happy killing... > > BR, > G > > > On Sun, Feb 10, 2019 at 4:19 PM Jörn Franke wrote: > >> yarn application -kill applicationid ? >> >> > Am 10.02.2019 um 13:30 schrieb Serega Sheypak > >: >> > >> > Hi there! >> > I have weird issue that appears only when tasks fail at specific stage. >> I would like to imitate failure on my own. >> > The plan is to run problematic app and then kill entire executor or >> some tasks when execution reaches certain stage. >> > >> > Is it do-able? >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Spark on YARN, HowTo kill executor or individual task?
Hi there! I have weird issue that appears only when tasks fail at specific stage. I would like to imitate failure on my own. The plan is to run problematic app and then kill entire executor or some tasks when execution reaches certain stage. Is it do-able?
Spark 2.x duplicates output when task fails at "repartition" stage. Checkpointing is enabled before repartition.
Hi, I have spark job that produces duplicates when one or tasks from repartition stage fails. Here is simplified code. sparkContext.setCheckpointDir("hdfs://path-to-checkpoint-dir") *val *inputRDDs: List[RDD[String]] = *List*.*empty *// an RDD per input dir *val *updatedRDDs = inputRDDs.map{ inputRDD => // some stuff happens here inputRDD .filter(*???*) .map(*???*) } *val *unionOfUpdatedRDDs = sparkContext.union(updatedRDDs) unionOfUpdatedRDDs.checkpoint() // id didn't help unionOfUpdatedRDDs .repartition(42) // task failed here, .saveAsNewAPIHadoopFile("/path") // task failed here too. // what really causes duplicates in output?
Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Hi Imran, here is my usecase There is 1K nodes cluster and jobs have performance degradation because of a single node. It's rather hard to convince Cluster Ops to decommission node because of "performance degradation". Imagine 10 dev teams chase single ops team for valid reason (node has problems) or because code has a bug or data is skewed or spots on the sun. We can't just decommission node because random dev complains. Simple solution: - rerun failed / delayed job and blacklist "problematic" node in advance. - Report about the problem if job works w/o anomalies. - ops collect complains about node and start to decommission it when "complains threshold" is reached. It's a rather low probability that many loosely coupled teams with loosely coupled jobs complain about a single node. Results - Ops are not spammed with a random requests from devs - devs are not blocked because of the really bad node. - it's very cheap for everyone to "blacklist" node during job submission w/o doing anything to node. - it's very easy to automate such behavior. Many teams use 100500 kinds of workflow runners and the strategy is dead simple (depends on SLA of course). - Just re-run failed job excluding nodes with failed tasks (if number of nodes is reasonable) - Kill stuck job if it runs longer than XXX minutes and re-start excluding nodes with long-running tasks. ср, 23 янв. 2019 г. в 23:09, Imran Rashid : > Serga, can you explain a bit more why you want this ability? > If the node is really bad, wouldn't you want to decomission the NM > entirely? > If you've got heterogenous resources, than nodelabels seem like they would > be more appropriate -- and I don't feel great about adding workarounds for > the node-label limitations into blacklisting. > > I don't want to be stuck supporting a configuration with too limited a use > case. > > (may be better to move discussion to > https://issues.apache.org/jira/browse/SPARK-26688 so its better archived, > I'm responding here in case you aren't watching that issue) > > On Tue, Jan 22, 2019 at 6:09 AM Jörn Franke wrote: > >> You can try with Yarn node labels: >> >> https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/NodeLabel.html >> >> Then you can whitelist nodes. >> >> Am 19.01.2019 um 00:20 schrieb Serega Sheypak : >> >> Hi, is there any possibility to tell Scheduler to blacklist specific >> nodes in advance? >> >>
Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Hi Apiros, thanks for your reply. Is it this one: https://github.com/apache/spark/pull/23223 ? Can I try to reach you through Cloudera Support portal? пн, 21 янв. 2019 г. в 20:06, attilapiros : > Hello, I was working on this area last year (I have developed the > YarnAllocatorBlacklistTracker) and if you haven't found any solution for > your problem I can introduce a new config which would contain a sequence of > always blacklisted nodes. This way blacklisting would improve a bit again > :) > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Thanks, so I'll check YARN. Does anyone know if Spark-on-Yarn plans to expose such functionality? сб, 19 янв. 2019 г. в 18:04, Felix Cheung : > To clarify, yarn actually supports excluding node right when requesting > resources. It’s spark that doesn’t provide a way to populate such a > blacklist. > > If you can change yarn config, the equivalent is node label: > https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html > > > > -- > *From:* Li Gao > *Sent:* Saturday, January 19, 2019 8:43 AM > *To:* Felix Cheung > *Cc:* Serega Sheypak; user > *Subject:* Re: Spark on Yarn, is it possible to manually blacklist nodes > before running spark job? > > on yarn it is impossible afaik. on kubernetes you can use taints to keep > certain nodes outside of spark > > On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung > wrote: > >> Not as far as I recall... >> >> >> -- >> *From:* Serega Sheypak >> *Sent:* Friday, January 18, 2019 3:21 PM >> *To:* user >> *Subject:* Spark on Yarn, is it possible to manually blacklist nodes >> before running spark job? >> >> Hi, is there any possibility to tell Scheduler to blacklist specific >> nodes in advance? >> >
Spark on Yarn, is it possible to manually blacklist nodes before running spark job?
Hi, is there any possibility to tell Scheduler to blacklist specific nodes in advance?
Kill spark executor when spark runs specific stage
Hi, I'm running spark on YARN. My code is very simple. I want to kill one executor when "data.repartition(10)" is executed. Ho can I do it in easy way? val data = sc.sequenceFile[NullWritable, BytesWritable](inputPath) .map { case (key, value) => Data.fromBytes(value) } process = data.repartition(10) // kill one executor here process.map { d => val data = d.toByteArray (new AvroKey(ByteBuffer.wrap(data)), NullWritable.get()) } .saveAsNewAPIHadoopFile[AvroKeyOutputFormat[ByteBuffer]](outputPath)
Re: how "hour" function in Spark SQL is supposed to work?
Ok, this one works: .withColumn("hour", hour(from_unixtime(typedDataset.col("ts") / 1000))) 2018-03-20 22:43 GMT+01:00 Serega Sheypak <serega.shey...@gmail.com>: > Hi, any updates? Looks like some API inconsistency or bug..? > > 2018-03-17 13:09 GMT+01:00 Serega Sheypak <serega.shey...@gmail.com>: > >> > Not sure why you are dividing by 1000. from_unixtime expects a long type >> It expects seconds, I have milliseconds. >> >> >> >> 2018-03-12 6:16 GMT+01:00 vermanurag <anurag.ve...@fnmathlogic.com>: >> >>> Not sure why you are dividing by 1000. from_unixtime expects a long type >>> which is time in milliseconds from reference date. >>> >>> The following should work: >>> >>> val ds = dataset.withColumn("hour",hour(from_unixtime(dataset.col("ts >>> " >>> >>> >>> >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >
Re: how "hour" function in Spark SQL is supposed to work?
Hi, any updates? Looks like some API inconsistency or bug..? 2018-03-17 13:09 GMT+01:00 Serega Sheypak <serega.shey...@gmail.com>: > > Not sure why you are dividing by 1000. from_unixtime expects a long type > It expects seconds, I have milliseconds. > > > > 2018-03-12 6:16 GMT+01:00 vermanurag <anurag.ve...@fnmathlogic.com>: > >> Not sure why you are dividing by 1000. from_unixtime expects a long type >> which is time in milliseconds from reference date. >> >> The following should work: >> >> val ds = dataset.withColumn("hour",hour(from_unixtime(dataset.col(" >> ts" >> >> >> >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Run spark 2.2 on yarn as usual java application
Hi Jörn, thanks for your reply Oozie starts ooze java action as single "long running" MapReduce Mapper. This mapper is responsible for calling main class. Main class belongs to user and this main class starts spark job. yarn-cluster is not an option for me. I have to do something special to maintain "run away" driver. Imagine I want to kill the spark job. I can just kill oozie workflow, it will kill spawned mapper with main class with driver inside it. It won't happen in yarn-cluster mode, since driver is not running in the process "managed" by oozie. 2018-03-19 13:41 GMT+01:00 Jörn Franke <jornfra...@gmail.com>: > Maybe you should better run it in yarn cluster mode. Yarn client would > start the driver on the oozie server. > > On 19. Mar 2018, at 12:58, Serega Sheypak <serega.shey...@gmail.com> > wrote: > > I'm trying to run it as Oozie java action and reduce env dependency. The > only thing I need is Hadoop Configuration to talk to hdfs and yarn. > Spark submit is a shell thing. Trying to do all from jvm. > Oozie java action starts main class which inststiates SparkConf and > session. It works well in local mode but throws exception when I try to run > spark as yarn-client > > пн, 19 марта 2018 г. в 7:16, Jacek Laskowski <ja...@japila.pl>: > >> Hi, >> >> What's the deployment process then (if not using spark-submit)? How is >> the AM deployed? Why would you want to skip spark-submit? >> >> Jacek >> >> On 19 Mar 2018 00:20, "Serega Sheypak" <serega.shey...@gmail.com> wrote: >> >>> Hi, Is it even possible to run spark on yarn as usual java application? >>> I've built jat using maven with spark-yarn dependency and I manually >>> populate SparkConf with all hadoop properties. >>> SparkContext fails to start with exception: >>> >>>1. Caused by: java.lang.IllegalStateException: Library directory >>>'/hadoop/yarn/local/usercache/root/appcache/application_ >>>1521375636129_0022/container_e06_1521375636129_0022_01_ >>>02/assembly/target/scala-2.11/jars' does not exist; make sure >>>Spark is built. >>>2. at org.apache.spark.launcher.CommandBuilderUtils.checkState(Com >>>mandBuilderUtils.java:260) >>>3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir(Co >>>mmandBuilderUtils.java:359) >>>4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir( >>>YarnCommandBuilderUtils.scala:38) >>> >>> >>> I took a look at the code and it has some hardcodes and checks for >>> specific files layout. I don't follow why :) >>> Is it possible to bypass such checks? >>> >>
Re: Run spark 2.2 on yarn as usual java application
I'm trying to run it as Oozie java action and reduce env dependency. The only thing I need is Hadoop Configuration to talk to hdfs and yarn. Spark submit is a shell thing. Trying to do all from jvm. Oozie java action starts main class which inststiates SparkConf and session. It works well in local mode but throws exception when I try to run spark as yarn-client пн, 19 марта 2018 г. в 7:16, Jacek Laskowski <ja...@japila.pl>: > Hi, > > What's the deployment process then (if not using spark-submit)? How is the > AM deployed? Why would you want to skip spark-submit? > > Jacek > > On 19 Mar 2018 00:20, "Serega Sheypak" <serega.shey...@gmail.com> wrote: > >> Hi, Is it even possible to run spark on yarn as usual java application? >> I've built jat using maven with spark-yarn dependency and I manually >> populate SparkConf with all hadoop properties. >> SparkContext fails to start with exception: >> >>1. Caused by: java.lang.IllegalStateException: Library directory >> >> '/hadoop/yarn/local/usercache/root/appcache/application_1521375636129_0022/container_e06_1521375636129_0022_01_02/assembly/target/scala-2.11/jars' >>does not exist; make sure Spark is built. >>2. at org.apache.spark.launcher.CommandBuilderUtils.checkState( >>CommandBuilderUtils.java:260) >>3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir( >>CommandBuilderUtils.java:359) >>4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir( >>YarnCommandBuilderUtils.scala:38) >> >> >> I took a look at the code and it has some hardcodes and checks for >> specific files layout. I don't follow why :) >> Is it possible to bypass such checks? >> >
Run spark 2.2 on yarn as usual java application
Hi, Is it even possible to run spark on yarn as usual java application? I've built jat using maven with spark-yarn dependency and I manually populate SparkConf with all hadoop properties. SparkContext fails to start with exception: 1. Caused by: java.lang.IllegalStateException: Library directory '/hadoop/yarn/local/usercache/root/appcache/application_1521375636129_0022/container_e06_1521375636129_0022_01_02/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built. 2. at org.apache.spark.launcher.CommandBuilderUtils.checkState( CommandBuilderUtils.java:260) 3. at org.apache.spark.launcher.CommandBuilderUtils.findJarsDir( CommandBuilderUtils.java:359) 4. at org.apache.spark.launcher.YarnCommandBuilderUtils$.findJarsDir( YarnCommandBuilderUtils.scala:38) I took a look at the code and it has some hardcodes and checks for specific files layout. I don't follow why :) Is it possible to bypass such checks?
Re: Append more files to existing partitioned data
Thanks a lot! 2018-03-18 9:30 GMT+01:00 Denis Bolshakov <bolshakov.de...@gmail.com>: > Please checkout. > > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand > > > and > > org.apache.spark.sql.execution.datasources.WriteRelation > > > I guess it's managed by > > job.getConfiguration.set(DATASOURCE_WRITEJOBUUID, uniqueWriteJobId.toString) > > > On 17 March 2018 at 20:46, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Hi Denis, great to see you here :) >> It works, thanks! >> >> Do you know how spark generates datafile names? names look like >> part- with uuid appended after >> >> part-0-124a8c43-83b9-44e1-a9c4-dcc8676cdb99.c000.snappy.parquet >> >> >> >> >> 2018-03-17 14:15 GMT+01:00 Denis Bolshakov <bolshakov.de...@gmail.com>: >> >>> Hello Serega, >>> >>> https://spark.apache.org/docs/latest/sql-programming-guide.html >>> >>> Please try SaveMode.Append option. Does it work for you? >>> >>> >>> сб, 17 мар. 2018 г., 15:19 Serega Sheypak <serega.shey...@gmail.com>: >>> >>>> Hi, I', using spark-sql to process my data and store result as parquet >>>> partitioned by several columns >>>> >>>> ds.write >>>> .partitionBy("year", "month", "day", "hour", "workflowId") >>>> .parquet("/here/is/my/dir") >>>> >>>> >>>> I want to run more jobs that will produce new partitions or add more >>>> files to existing partitions. >>>> What is the right way to do it? >>>> >>> >> > > > -- > //with Best Regards > --Denis Bolshakov > e-mail: bolshakov.de...@gmail.com >
Re: Append more files to existing partitioned data
Hi Denis, great to see you here :) It works, thanks! Do you know how spark generates datafile names? names look like part- with uuid appended after part-0-124a8c43-83b9-44e1-a9c4-dcc8676cdb99.c000.snappy.parquet 2018-03-17 14:15 GMT+01:00 Denis Bolshakov <bolshakov.de...@gmail.com>: > Hello Serega, > > https://spark.apache.org/docs/latest/sql-programming-guide.html > > Please try SaveMode.Append option. Does it work for you? > > > сб, 17 мар. 2018 г., 15:19 Serega Sheypak <serega.shey...@gmail.com>: > >> Hi, I', using spark-sql to process my data and store result as parquet >> partitioned by several columns >> >> ds.write >> .partitionBy("year", "month", "day", "hour", "workflowId") >> .parquet("/here/is/my/dir") >> >> >> I want to run more jobs that will produce new partitions or add more >> files to existing partitions. >> What is the right way to do it? >> >
Append more files to existing partitioned data
Hi, I', using spark-sql to process my data and store result as parquet partitioned by several columns ds.write .partitionBy("year", "month", "day", "hour", "workflowId") .parquet("/here/is/my/dir") I want to run more jobs that will produce new partitions or add more files to existing partitions. What is the right way to do it?
Re: how "hour" function in Spark SQL is supposed to work?
> Not sure why you are dividing by 1000. from_unixtime expects a long type It expects seconds, I have milliseconds. 2018-03-12 6:16 GMT+01:00 vermanurag: > Not sure why you are dividing by 1000. from_unixtime expects a long type > which is time in milliseconds from reference date. > > The following should work: > > val ds = dataset.withColumn("hour",hour(from_unixtime(dataset.col("ts" > > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
how "hour" function in Spark SQL is supposed to work?
hi, desperately trying to extract hour from unix seconds year, month, dayofmonth functions work as expected. hour function always returns 0. val ds = dataset .withColumn("year", year(to_date(from_unixtime(dataset.col("ts") / 1000 .withColumn("month", month(to_date(from_unixtime(dataset.col("ts") / 1000 .withColumn("day", dayofmonth(to_date(from_unixtime(dataset.col("ts") / 1000 .withColumn("hour", hour(from_utc_timestamp(dataset.col("ts") / 1000, "UTC"))) //.withColumn("hour", hour(dataset.col("ts") / 1000)) //.withColumn("hour1", hour(dataset.col("ts"))) //.withColumn("hour", hour(dataset.col("ts"))) //.withColumn("hour", hour("2009-07-30 12:58:59")) I took a look at source code year, month, dayofmonth expect to get override def inputTypes: Seq[AbstractDataType] = Seq(DateType) hour function expects something different override def inputTypes: Seq[AbstractDataType] = Seq(TimestampType) from_utc_timestamp returns Timestamp override def dataType: DataType = TimestampType but It didn't help What do I do wrong? how can I get hour from unix seconds? Thanks!
Implement Dataset reader from SEQ file with protobuf to Dataset
Hi, did anyone try to implement Spark SQL dataset reader from SEQ file with protobuf inside to Dataset? Imagine I have protobuf def Person - name: String - lastName: String - phones: List[String] and generated scala case class: case class Person(name:String, lastName: String, phones: List[String]) I want to write some component that gives me Dataset with types schema. val personsDataset = spark.read .option("inferSchema", "true")[Person] Where can I take a look at references?
Re: Managed memory leak detected.SPARK-11293 ?
Ok, it happens only in YARN+cluster mode. It works with snappy in YARN+client mode. I've started to hit this problem when I switched to cluster mode. 2016-05-18 16:31 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > According to: > > http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of > > performance of snappy and lzf were on-par to each other. > > Maybe lzf has lower memory requirement. > > On Wed, May 18, 2016 at 7:22 AM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Switching from snappy to lzf helped me: >> >> *spark.io.compression.codec=lzf* >> >> Do you know why? :) I can't find exact explanation... >> >> >> >> 2016-05-18 15:41 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: >> >>> Please increase the number of partitions. >>> >>> Cheers >>> >>> On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> >>>> Hi, please have a look at log snippet: >>>> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; >>>> tracker endpoint = >>>> NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) >>>> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output >>>> locations >>>> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 30 >>>> non-empty blocks out of 30 blocks >>>> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 30 >>>> remote fetches in 3 ms >>>> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Don't have map >>>> outputs for shuffle 1, fetching them >>>> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; >>>> tracker endpoint = >>>> NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) >>>> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output >>>> locations >>>> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 1 >>>> non-empty blocks out of 1500 blocks >>>> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 1 >>>> remote fetches in 1 ms >>>> 16/05/18 03:27:17 ERROR executor.Executor: Managed memory leak >>>> detected; size = 6685476 bytes, TID = 3405 >>>> 16/05/18 03:27:17 ERROR executor.Executor: Exception in task 285.0 in >>>> stage 6.0 (TID 3405) >>>> >>>> Is it related to https://issues.apache.org/jira/browse/SPARK-11293 >>>> >>>> Is there any recommended workaround? >>>> >>> >>> >> >
Re: Managed memory leak detected.SPARK-11293 ?
Switching from snappy to lzf helped me: *spark.io.compression.codec=lzf* Do you know why? :) I can't find exact explanation... 2016-05-18 15:41 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > Please increase the number of partitions. > > Cheers > > On Wed, May 18, 2016 at 4:17 AM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Hi, please have a look at log snippet: >> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; >> tracker endpoint = >> NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) >> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output >> locations >> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 30 >> non-empty blocks out of 30 blocks >> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 30 >> remote fetches in 3 ms >> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Don't have map >> outputs for shuffle 1, fetching them >> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; >> tracker endpoint = >> NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) >> 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output >> locations >> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 1 >> non-empty blocks out of 1500 blocks >> 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 1 >> remote fetches in 1 ms >> 16/05/18 03:27:17 ERROR executor.Executor: Managed memory leak detected; >> size = 6685476 bytes, TID = 3405 >> 16/05/18 03:27:17 ERROR executor.Executor: Exception in task 285.0 in >> stage 6.0 (TID 3405) >> >> Is it related to https://issues.apache.org/jira/browse/SPARK-11293 >> >> Is there any recommended workaround? >> > >
Managed memory leak detected.SPARK-11293 ?
Hi, please have a look at log snippet: 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 30 non-empty blocks out of 30 blocks 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 30 remote fetches in 3 ms 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://mapoutputtrac...@xxx.xxx.xxx.xxx:38128) 16/05/18 03:27:16 INFO spark.MapOutputTrackerWorker: Got the output locations 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1500 blocks 16/05/18 03:27:16 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 1 ms 16/05/18 03:27:17 ERROR executor.Executor: Managed memory leak detected; size = 6685476 bytes, TID = 3405 16/05/18 03:27:17 ERROR executor.Executor: Exception in task 285.0 in stage 6.0 (TID 3405) Is it related to https://issues.apache.org/jira/browse/SPARK-11293 Is there any recommended workaround?
Re: Why does spark 1.6.0 can't use jar files stored on HDFS
Hi, I know about that approach. I don't want to run mess of classes from single jar, I want to utilize distributed cache functionality and ship application jar and dependent jars explicitly. --deploy-mode client unfortunately copies and distributes all jars repeatedly for every spark job started from driver class... 2016-05-17 15:41 GMT+02:00 <spark@yahoo.com>: > Hi Serega, > > Create a jar including all the the dependencies and execute it like below > through shell script > > /usr/local/spark/bin/spark-submit \ //location of your spark-submit > --class classname \ //location of your main classname > --master yarn \ > --deploy-mode cluster \ > /home/hadoop/SparkSampleProgram.jar //location of your jar file > > Thanks > Raj > > > > Sent from Yahoo Mail. Get the app <https://yho.com/148vdq> > > > On Tuesday, May 17, 2016 6:03 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > > > hi, I'm trying to: > 1. upload my app jar files to HDFS > 2. run spark-submit with: > 2.1. --master yarn --deploy-mode cluster > or > 2.2. --master yarn --deploy-mode client > > specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar > > When spark job is submitted, SparkSubmit client outputs: > Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar > ... > > and then spark application main class fails with class not found exception. > Is there any workaround? > > >
Re: Why does spark 1.6.0 can't use jar files stored on HDFS
spark-submit --conf "spark.driver.userClassPathFirst=true" --class com.MyClass --master yarn --deploy-mode client --jars hdfs:///my-lib.jar,hdfs:///my-seocnd-lib.jar jar-wth-com-MyClass.jar job_params 2016-05-17 15:41 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > https://issues.apache.org/jira/browse/SPARK-10643 > > Looks like it's the reason... > > 2016-05-17 15:31 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > >> No, and it looks like a problem. >> >> 2.2. --master yarn --deploy-mode client >> means: >> 1. submit spark as yarn app, but spark-driver is started on local >> machine. >> 2. A upload all dependent jars to HDFS and specify jar HDFS paths in >> --jars arg. >> 3. Driver runs my Spark Application main class named "MySuperSparkJob" >> and MySuperSparkJob fails because it doesn't get jars, thay are all in >> HDFS and not accessible from local machine... >> >> >> 2016-05-17 15:18 GMT+02:00 Jeff Zhang <zjf...@gmail.com>: >> >>> Do you put your app jar on hdfs ? The app jar must be on your local >>> machine. >>> >>> On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> >>>> hi, I'm trying to: >>>> 1. upload my app jar files to HDFS >>>> 2. run spark-submit with: >>>> 2.1. --master yarn --deploy-mode cluster >>>> or >>>> 2.2. --master yarn --deploy-mode client >>>> >>>> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar >>>> >>>> When spark job is submitted, SparkSubmit client outputs: >>>> Warning: Skip remote jar >>>> hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ... >>>> >>>> and then spark application main class fails with class not found >>>> exception. >>>> Is there any workaround? >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >
Re: Why does spark 1.6.0 can't use jar files stored on HDFS
https://issues.apache.org/jira/browse/SPARK-10643 Looks like it's the reason... 2016-05-17 15:31 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>: > No, and it looks like a problem. > > 2.2. --master yarn --deploy-mode client > means: > 1. submit spark as yarn app, but spark-driver is started on local machine. > 2. A upload all dependent jars to HDFS and specify jar HDFS paths in > --jars arg. > 3. Driver runs my Spark Application main class named "MySuperSparkJob" and > MySuperSparkJob > fails because it doesn't get jars, thay are all in HDFS and not accessible > from local machine... > > > 2016-05-17 15:18 GMT+02:00 Jeff Zhang <zjf...@gmail.com>: > >> Do you put your app jar on hdfs ? The app jar must be on your local >> machine. >> >> On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak <serega.shey...@gmail.com >> > wrote: >> >>> hi, I'm trying to: >>> 1. upload my app jar files to HDFS >>> 2. run spark-submit with: >>> 2.1. --master yarn --deploy-mode cluster >>> or >>> 2.2. --master yarn --deploy-mode client >>> >>> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar >>> >>> When spark job is submitted, SparkSubmit client outputs: >>> Warning: Skip remote jar >>> hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ... >>> >>> and then spark application main class fails with class not found >>> exception. >>> Is there any workaround? >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > >
Re: Why does spark 1.6.0 can't use jar files stored on HDFS
No, and it looks like a problem. 2.2. --master yarn --deploy-mode client means: 1. submit spark as yarn app, but spark-driver is started on local machine. 2. A upload all dependent jars to HDFS and specify jar HDFS paths in --jars arg. 3. Driver runs my Spark Application main class named "MySuperSparkJob" and MySuperSparkJob fails because it doesn't get jars, thay are all in HDFS and not accessible from local machine... 2016-05-17 15:18 GMT+02:00 Jeff Zhang <zjf...@gmail.com>: > Do you put your app jar on hdfs ? The app jar must be on your local > machine. > > On Tue, May 17, 2016 at 8:33 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> hi, I'm trying to: >> 1. upload my app jar files to HDFS >> 2. run spark-submit with: >> 2.1. --master yarn --deploy-mode cluster >> or >> 2.2. --master yarn --deploy-mode client >> >> specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar >> >> When spark job is submitted, SparkSubmit client outputs: >> Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar >> ... >> >> and then spark application main class fails with class not found >> exception. >> Is there any workaround? >> > > > > -- > Best Regards > > Jeff Zhang >
Why does spark 1.6.0 can't use jar files stored on HDFS
hi, I'm trying to: 1. upload my app jar files to HDFS 2. run spark-submit with: 2.1. --master yarn --deploy-mode cluster or 2.2. --master yarn --deploy-mode client specifying --jars hdfs:///my/home/commons.jar,hdfs:///my/home/super.jar When spark job is submitted, SparkSubmit client outputs: Warning: Skip remote jar hdfs:///user/baba/lib/akka-slf4j_2.11-2.3.11.jar ... and then spark application main class fails with class not found exception. Is there any workaround?
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
version We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1 cqlsh select * from site_users returns fast, subsecond, only 3 rows Can you show some code how you're doing the reads? dse beeline !connect ... select * from site_users --table has 3 rows, several columns in each row. Spark eunts 769 tasks and estimates input as 80 TB 0: jdbc:hive2://dsenode01:1 select count(*) from site_users; +--+ | _c0 | +--+ | 3| +--+ 1 row selected (41.635 seconds) Spark and Cassandra-connector /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar 2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
spark-sql estimates Cassandra table with 3 rows as 8 TB of data
Hi, spark-sql estimated input for Cassandra table with 3 rows as 8 TB. sometimes it's estimated as -167B. I run it on laptop, I don't have 8 TB space for the data.
Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
So, there is some input: So the problem could be in spark-sql-thriftserver. When I use spark console to submit SQL query, it takes 10 seconds and reasonable count of tasks. import com.datastax.spark.connector._; val cc = new CassandraSQLContext(sc); cc.sql(select su.user_id from appdata.site_users su join appdata.user_orders uo on uo.user_id=su.user_id).count(); res8: Long = 2 If the same query submitted through beeline, it takes minutes and spark creates up to 2000 tasks to read 3 lines of data. We think spark-sql-thriftserver has bugs in it. 2015-06-17 14:14 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: version We are on DSE 4.7. (Cassandra 2.1) and spark 1.2.1 cqlsh select * from site_users returns fast, subsecond, only 3 rows Can you show some code how you're doing the reads? dse beeline !connect ... select * from site_users --table has 3 rows, several columns in each row. Spark eunts 769 tasks and estimates input as 80 TB 0: jdbc:hive2://dsenode01:1 select count(*) from site_users; +--+ | _c0 | +--+ | 3| +--+ 1 row selected (41.635 seconds) Spark and Cassandra-connector /usr/share/dse/spark/lib/spark-cassandra-connector-java_2.10-1.2.1.jar /usr/share/dse/spark/lib/spark-cassandra-connector_2.10-1.2.1.jar 2015-06-17 13:52 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com: Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra-connector are you using? Also, what do you get for select count(*) from foo -- is that just as bad? On Wed, Jun 17, 2015 at 4:37 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, can somebody suggest me the way to reduce quantity of task? 2015-06-15 18:26 GMT+02:00 Serega Sheypak serega.shey...@gmail.com: Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table
Hi, I'm running spark sql against Cassandra table. I have 3 C* nodes, Each of them has spark worker. The problem is that spark runs 869 task to read 3 lines: select bar from foo. I've tried these properties: #try to avoid 769 tasks per dummy select foo from bar qeury spark.cassandra.input.split.size_in_mb=32mb spark.cassandra.input.fetch.size_in_rows=1000 spark.cassandra.input.split.size=1 but it doesn't help. Here are mean metrics for the job : input1= 8388608.0 TB input2 = -320 B input3 = -400 B I'm confused with input, there are only 3 rows in C* table. Definitely, I don't have 8388608.0 TB of data :)
Re: Driver memory leak?
The memory leak could be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. @Sean Will it be backported to CDH? I did't find that bug in CDH 5.4 release notes. 2015-04-29 14:51 GMT+02:00 Conor Fennell conor.fenn...@altocloud.com: The memory leak could be related to this https://issues.apache.org/jira/browse/SPARK-5967 defect that was resolved in Spark 1.2.2 and 1.3.0. It also was a HashMap causing the issue. -Conor On Wed, Apr 29, 2015 at 12:01 PM, Sean Owen so...@cloudera.com wrote: Please use user@, not dev@ This message does not appear to be from your driver. It also doesn't say you ran out of memory. It says you didn't tell YARN to let it use the memory you want. Look at the memory overhead param and please search first for related discussions. On Apr 29, 2015 11:43 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi, Dear developer, I am using Spark Streaming to read data from kafka, the program already run about 120 hours, but today the program failed because of driver's OOM as follow: Container [pid=49133,containerID=container_1429773909253_0050_02_01] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.2 GB of 50 GB virtual memory used. Killing container. I set --driver-memory to 2g, In my mind, driver is responsibility for job scheduler and job monitor(Please correct me If I'm wrong), Why it using so much memory? So I using jmap to monitor other program(already run about 48 hours): sudo /home/q/java7/jdk1.7.0_45/bin/jmap -histo:live 31256, the result as follow: the java.util.HashMap$Entry and java.lang.Long object using about 600Mb memory! and I also using jmap to monitor other program(already run about 1 hours ), the result as follow: the java.util.HashMap$Entry and java.lang.Long object doesn't using so many memory, But I found, as time goes by, the java.util.HashMap$Entry and java.lang.Long object will occupied more and more memory, It is driver's memory leak question? or other reason? Thanks Best Regards
Re: history-server does't read logs which are on FS
Thanks, it helped. We can't use Spark 1.3 because Cassandra DSE doesn't support it. 2015-04-17 21:48 GMT+02:00 Imran Rashid iras...@cloudera.com: are you calling sc.stop() at the end of your applications? The history server only displays completed applications, but if you don't call sc.stop(), it doesn't know that those applications have been stopped. Note that in spark 1.3, the history server can also display running applications (including completed applications, but that it thinks are still running), which improves things a little bit. On Fri, Apr 17, 2015 at 10:13 AM, Serega Sheypak serega.shey...@gmail.com wrote: Hi, started history-server Here is UI output: - *Event log directory:* file:/var/log/spark/applicationHistory/ No completed applications found! Did you specify the correct logging directory? Please verify your setting of spark.history.fs.logDirectory and whether you have the permissions to access it. It is also possible that your application did not run to completion or did not stop the SparkContext. Spark 1.2.0 I goto node where server runs and: ls -la /var/log/spark/applicationHistory/ total 44 drwxrwxrwx 11 root root4096 Apr 17 14:50 . drwxrwxrwx 3 cassandra root4096 Apr 16 15:31 .. drwxrwxrwx 2 vagrant vagrant 4096 Apr 17 10:06 app-20150417100630- drwxrwxrwx 2 vagrant vagrant 4096 Apr 17 11:01 app-20150417110140-0001 drwxrwxrwx 2 vagrant vagrant 4096 Apr 17 11:12 app-20150417111216-0002 drwxrwxrwx 2 vagrant vagrant 4096 Apr 17 11:14 app-20150417111441-0003 drwxrwx--- 2 vagrant vagrant 4096 Apr 17 11:20 *app-20150417112028-0004* drwxrwx--- 2 vagrant vagrant 4096 Apr 17 14:17 *app-20150417141733-0005* drwxrwx--- 2 vagrant vagrant 4096 Apr 17 14:32 *app-20150417143237-0006* drwxrwx--- 2 vagrant vagrant 4096 Apr 17 14:49 *app-20150417144902-0007* drwxrwx--- 2 vagrant vagrant 4096 Apr 17 14:50 *app-20150417145025-0008* So there are logs, but history-server doesn't want to display them. I've checked workers, they are pointed to that dir also, I run app, I see new log. Here is history-server log output: vagrant@dsenode01:/usr/lib/spark/logs$ cat spark-root-org.apache.spark.deploy.history.HistoryServer-1-dsenode01.out Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: java -cp ::/usr/lib/spark/sbin/../conf:/usr/lib/spark/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/usr/lib/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Dspark.history.fs.logDirectory=/var/log/spark/applicationHistory -Dspark.eventLog.enabled=true -Xms512m -Xmx512m org.apache.spark.deploy.history.HistoryServer Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/04/17 09:55:21 INFO HistoryServer: Registered signal handlers for [TERM, HUP, INT] 15/04/17 09:55:21 INFO SecurityManager: Changing view acls to: root 15/04/17 09:55:21 INFO SecurityManager: Changing modify acls to: root 15/04/17 09:55:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/04/17 09:55:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/04/17 09:55:24 INFO Utils: Successfully started service on port 18080. 15/04/17 09:55:24 INFO HistoryServer: Started HistoryServer at http://dsenode01:18080 What could be wrong with it?
Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored
Here is related problem: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html but no answer. What I'm trying to do: wrap spark-history with /etc/init.d script Problems I have: can't make it read spark-defaults.conf I've put this file here: /etc/spark/conf /usr/lib/spark/conf where /usr/lib/spark is locaition for spark no luck. spark-history tries to use default value for applications log location, it doesn't read specified value from spark-defaults.conf