Submit & Kill Spark Application program programmatically from another application
Hi, I am wondering if it is possible to submit, monitor & kill spark applications from another service. I have wrote a service this: parse user commands translate them into understandable arguments to an already prepared Spark-SQL application submit the application along with arguments to Spark Cluster using spark-submit from ProcessBuilder run generated applications' driver in cluster mode. The above 4 steps has been finished, but I have difficulties in these two: Query about the applications status, for example, the percentage completion. Kill queries accordingly What I find in spark standalone documentation suggest kill application using: ./bin/spark-class org.apache.spark.deploy.Client kill And should find the driver ID through the standalone Master web UI at http://:8080. Are there any programmatically methods I could get the driverID submitted by my `ProcessBuilder` and query status about the query? Any Suggestions? — Best Regards! Yijie Shen
Re: Spark SQL saveAsParquet failed after a few waves
I have 7 workers for spark and set SPARK_WORKER_CORES=12, therefore 84 tasks in one job can run simultaneously, I call the tasks in a job started almost simultaneously a wave. While inserting, there is only one job on spark, not inserting from multiple programs concurrently. — Best Regards! Yijie Shen On April 2, 2015 at 2:05:31 AM, Michael Armbrust (mich...@databricks.com) wrote: When few waves (1 or 2) are used in a job, LoadApp could finish after a few failures and retries. But when more waves (3) are involved in a job, the job would terminate abnormally. Can you clarify what you mean by "waves"? Are you inserting from multiple programs concurrently?
Spark SQL saveAsParquet failed after a few waves
Hi, I am using spark-1.3 prebuilt release with hadoop2.4 support and Hadoop 2.4.0. I wrote a spark application(LoadApp) to generate data in each task and load the data into HDFS as parquet Files (use “saveAsParquet()” in spark sql) When few waves (1 or 2) are used in a job, LoadApp could finish after a few failures and retries. But when more waves (3) are involved in a job, the job would terminate abnormally. All the failures I faced with is: “java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN" and the stacktraces are: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137) at parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129) at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173) at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:634) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) I have no idea what happened since jobs may fail or success without any reason. Thanks. Yijie Shen
Re: Read parquet folders recursively
org.apache.spark.deploy.SparkHadoopUtil has a method: /** * Get [[FileStatus]] objects for all leaf children (files) under the given base path. If the * given path points to a file, return a single-element collection containing [[FileStatus]] of * that file. */ def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = { def recurse(path: Path) = { val (directories, leaves) = fs.listStatus(path).partition(_.isDir) leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath)) } val baseStatus = fs.getFileStatus(basePath) if (baseStatus.isDir) recurse(basePath) else Array(baseStatus) } — Best Regards! Yijie Shen On March 12, 2015 at 2:35:49 PM, Akhil Das (ak...@sigmoidanalytics.com) wrote: Hi We have a custom build to read directories recursively, Currently we use it with fileStream like: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/datadumps/", (t: Path) => true, true, true) Making the 4th argument true to read recursively. You could give it a try https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz Thanks Best Regards On Wed, Mar 11, 2015 at 9:45 PM, Masf wrote: Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel
A way to share RDD directly using Tachyon?
Hi, I would like to share a RDD in several Spark Applications, i.e, create one in application A, publish the ID somewhere and get the RDD back directly using ID in Application B. I know I can use Tachyon just as a filesystem and s.saveAsTextFile("tachyon://localhost:19998/Y”) like this. But get a RDD directly from tachyon instead of a file can sometimes avoid parsing the same file repeatedly in different Apps, I think. What am I supposed to do in order to share RDDs to get a better performance? — Best Regards! Yijie Shen