[jira] [Reopened] (SPARK-21141) spark-update --version is hard to parse
[ https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael procopio reopened SPARK-21141: -- My apologies, I mean spark-submit --version. > spark-update --version is hard to parse > --- > > Key: SPARK-21141 > URL: https://issues.apache.org/jira/browse/SPARK-21141 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Debian 8 using hadoop 2.7.2 >Reporter: michael procopio > > We have need of being able to determine the spark version in order to > reference our jars: one set built for 1.x using scala 2.10 and the other > built for 2.x using scala 2.11. spark-update --version returns a lot of > extraneous output. It would be preferable if an option were available that > only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael procopio reopened SPARK-21140: -- I am not sure what detail you are looking for. I provided the test code I was using. Seems to me multiple copies of the data must be generated when collecting a partition. Having to set driver.executor.memory to 3gb to collect a partition of 512 mb seems high to me. > Reduce collect high memory requrements > -- > > Key: SPARK-21140 > URL: https://issues.apache.org/jira/browse/SPARK-21140 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Linux Debian 8 using hadoop 2.7.2. >Reporter: michael procopio > > I wrote a very simple Scala application which used flatMap to create an RDD > containing a 512 mb partition of 256 byte arrays. Experimentally, I > determined that spark.executor.memory had to be set at 3 gb in order to > colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054043#comment-16054043 ] michael procopio commented on SPARK-21140: -- I disagree executor memory does depend on the size of the partition being collected. A 6 to 1 ratio to collect data seems onerous to me. Seems like multiple copies must be created in the executor. I found that to collect an RDD containing a single partition of 512 mb took 3gb. Here's the code I was using: package com.test import org.apache.spark._ import org.apache.spark.SparkContext._ object SparkRdd2 { def main(args: Array[String]) { try { // // Process any arguments. // def parseOptions( map: Map[String,Any], listArgs: List[String]): Map[String,Any] = { listArgs match { case Nil => map case "-master" :: value :: tail => parseOptions( map+("master"-> value),tail) case "-recordSize" :: value :: tail => parseOptions( map+("recordSize"-> value.toInt),tail) case "-partitionSize" :: value :: tail => parseOptions( map+("partitionSize"-> value.toLong),tail) case "-executorMemory" :: value :: tail => parseOptions( map+("executorMemory"-> value),tail) case option :: tail => println("unknown option"+option) sys.exit(1) } } val listArgs = args.toList val optionmap = parseOptions( Map[String,Any](),listArgs) val master = optionmap.getOrElse("master","local").asInstanceOf[String] val recordSize = optionmap.getOrElse("recordSize",128).asInstanceOf[Int] val partitionSize = optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long] val executorMemory = optionmap.getOrElse("executorMemory","6g").asInstanceOf[String] println(f"Creating single partition of $partitionSize%d with records of length $recordSize%d") println(f"Setting spark.executor.memory to $executorMemory") // // Create SparkConf. // val sparkConf = new SparkConf() sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good") sparkConf.set("spark.executor.cores","1") sparkConf.set("spark.executor.instances","1") sparkConf.set("spark.executor.memory",executorMemory) sparkConf.set("spark.eventLog.enabled","true") sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events"); sparkConf.set("spark.driver.maxResultSize","0") /* sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryoserializer.buffer.max","768m") sparkConf.set("spark.kryoserializer.buffer","64k") */ // // Create SparkContext // val sc = new SparkContext(sparkConf) // // // def createdSizedPartition( recordSize:Int ,partitionSize:Long): Iterator[Array[Byte]] = { var sizeReturned:Long = 0 new Iterator[Array[Byte]] { override def hasNext(): Boolean = { (sizeReturned createdSizedPartition( rddInfo._1, rddInfo._2)) val results = sizedRdd.collect var countLines: Int = 0 var countBytes: Long = 0 var maxRecord: Int = 0 for (line <- results) { countLines = countLines+1 countBytes = countBytes+line.length if (line.length> maxRecord) { maxRecord = line.length } } println(f"Collected $countLines%d lines") println(f" $countBytes%d bytes") println(f"Max record $maxRecord%d bytes") } catch { case e: Exception => println("Error in executing application: ", e.getMessage) throw e } } } After building it can be invoked as: spark-submit --class com.test.SparkRdd2 --driver-memory 10g ./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 536870912 Allows you to vary the > Reduce collect high memory requrements > -- > > Key: SPARK-21140 > URL: https://issues.apache.org/jira/browse/SPARK-21140 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Linux Debian 8 using hadoop 2.7.2. >Reporter: michael procopio > > I wrote a very simple Scala application which used flatMap to create an RDD > containing a 512 mb partition of
[jira] [Created] (SPARK-21141) spark-update --version is hard to parse
michael procopio created SPARK-21141: Summary: spark-update --version is hard to parse Key: SPARK-21141 URL: https://issues.apache.org/jira/browse/SPARK-21141 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.1.1 Environment: Debian 8 using hadoop 2.7.2 Reporter: michael procopio We have need of being able to determine the spark version in order to reference our jars: one set built for 1.x using scala 2.10 and the other built for 2.x using scala 2.11. spark-update --version returns a lot of extraneous output. It would be preferable if an option were available that only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21140) Reduce collect high memory requrements
michael procopio created SPARK-21140: Summary: Reduce collect high memory requrements Key: SPARK-21140 URL: https://issues.apache.org/jira/browse/SPARK-21140 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.1.1 Environment: Linux Debian 8 using hadoop 2.7.2. Reporter: michael procopio I wrote a very simple Scala application which used flatMap to create an RDD containing a 512 mb partition of 256 byte arrays. Experimentally, I determined that spark.executor.memory had to be set at 3 gb in order to colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19030) Dropped event errors being reported after SparkContext has been stopped
michael procopio created SPARK-19030: Summary: Dropped event errors being reported after SparkContext has been stopped Key: SPARK-19030 URL: https://issues.apache.org/jira/browse/SPARK-19030 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: Debian 8 using spark-submit with MATLAB integration spark code is being code using java. Reporter: michael procopio Priority: Minor After stop has been called on SparkContext, errors are being reported. 6/12/29 15:54:04 ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(2,WrappedArray()) The stack in the hearbeat thread is at the point where the error is thrown is: Daemon Thread [heartbeat-receiver-event-loop-thread] (Suspended (breakpoint at line 124 in LiveListenerBus)) LiveListenerBus.post(SparkListenerEvent) line: 124 DAGScheduler.executorHeartbeatReceived(String, Tuple4>[], BlockManagerId) line: 228 YarnScheduler(TaskSchedulerImpl).executorHeartbeatReceived(String, Tuple2>>[], BlockManagerId) line: 402 HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp() line: 128 Utils$.tryLogNonFatalError(Function0) line: 1290 HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run() line: 127 Executors$RunnableAdapter.call() line: 511 ScheduledThreadPoolExecutor$ScheduledFutureTask(FutureTask).run() line: 266 ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor$ScheduledFutureTask) line: 180 ScheduledThreadPoolExecutor$ScheduledFutureTask.run() line: 293 ScheduledThreadPoolExecutor(ThreadPoolExecutor).runWorker(ThreadPoolExecutor$Worker) line: 1142 ThreadPoolExecutor$Worker.run() line: 617 Thread.run() line: 745 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10453) There's now way to use spark.dynmicAllocation.enabled with pyspark
Michael Procopio created SPARK-10453: Summary: There's now way to use spark.dynmicAllocation.enabled with pyspark Key: SPARK-10453 URL: https://issues.apache.org/jira/browse/SPARK-10453 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: When using spark.dynamicAllocation.enabled, the assumption is that memory/core resources will be mediated by the yarn resource manager. Unfortunately, whatever value is used for spark.executor.memory is consumed as JVM heap space by the executor. There's no way to account for the memory requirements of the pyspark worker. Executor JVM heap space should be decoupled from spark.executor.memory. Reporter: Michael Procopio -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10452) Pyspark worker security issue
Michael Procopio created SPARK-10452: Summary: Pyspark worker security issue Key: SPARK-10452 URL: https://issues.apache.org/jira/browse/SPARK-10452 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 running on hadoop 2.5.2. Reporter: Michael Procopio Priority: Critical The python worker launched by the executor is given the credentials used to launch yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org