Submit & Kill Spark Application program programmatically from another application

2015-05-02 Thread Yijie Shen
Hi,

I am wondering if it is possible to submit, monitor & kill spark applications 
from another service.

I have wrote a service this:

parse user commands
translate them into understandable arguments to an already prepared Spark-SQL 
application
submit the application along with arguments to Spark Cluster using spark-submit 
from ProcessBuilder
run generated applications' driver in cluster mode.
The above 4 steps has been finished, but I have difficulties in these two:

Query about the applications status, for example, the percentage completion.
Kill queries accordingly
What I find in spark standalone documentation suggest kill application using:

./bin/spark-class org.apache.spark.deploy.Client kill  
And should find the driver ID through the standalone Master web UI at 
http://:8080.

Are there any programmatically methods I could get the driverID submitted by my 
`ProcessBuilder` and query status about the query?

Any Suggestions?

— 
Best Regards!
Yijie Shen

Re: Spark SQL saveAsParquet failed after a few waves

2015-04-01 Thread Yijie Shen
I have 7 workers for spark and set SPARK_WORKER_CORES=12, therefore 84 tasks in 
one job can run simultaneously,
I call the tasks in a job started almost  simultaneously a wave.

While inserting, there is only one job on spark, not inserting from multiple 
programs concurrently.

— 
Best Regards!
Yijie Shen

On April 2, 2015 at 2:05:31 AM, Michael Armbrust (mich...@databricks.com) wrote:

When few waves (1 or 2) are used in a job, LoadApp could finish after a few 
failures and retries.
But when more waves (3) are involved in a job, the job would terminate 
abnormally.

Can you clarify what you mean by "waves"?  Are you inserting from multiple 
programs concurrently? 

Spark SQL saveAsParquet failed after a few waves

2015-03-31 Thread Yijie Shen
Hi,

I am using spark-1.3 prebuilt release with hadoop2.4 support and Hadoop 2.4.0.

I wrote a spark application(LoadApp) to generate data in each task and load the 
data into HDFS as parquet Files (use “saveAsParquet()” in spark sql)

When few waves (1 or 2) are used in a job, LoadApp could finish after a few 
failures and retries.
But when more waves (3) are involved in a job, the job would terminate 
abnormally.

All the failures I faced with is:
“java.io.IOException: The file being written is in an invalid state. Probably 
caused by an error thrown previously. Current state: COLUMN"

and the stacktraces  are:

java.io.IOException: The file being written is in an invalid state. Probably 
caused by an error thrown previously. Current state: COLUMN
at 
parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at 
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at 
parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at 
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at 
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:634)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:648)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


I have no idea what happened since jobs may fail or success without any reason.

Thanks.


Yijie Shen

Re: Read parquet folders recursively

2015-03-12 Thread Yijie Shen
org.apache.spark.deploy.SparkHadoopUtil has a method:

/**
   * Get [[FileStatus]] objects for all leaf children (files) under the given 
base path. If the
   * given path points to a file, return a single-element collection containing 
[[FileStatus]] of
   * that file.
   */
  def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
    def recurse(path: Path) = {
      val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
      leaves ++ directories.flatMap(f => listLeafStatuses(fs, f.getPath))
    }

    val baseStatus = fs.getFileStatus(basePath)
    if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
  }

— 
Best Regards!
Yijie Shen

On March 12, 2015 at 2:35:49 PM, Akhil Das (ak...@sigmoidanalytics.com) wrote:

Hi

We have a custom build to read directories recursively, Currently we use it 
with fileStream like:

val lines = ssc.fileStream[LongWritable, Text, TextInputFormat]("/datadumps/",
     (t: Path) => true, true, true)

Making the 4th argument true to read recursively.


You could give it a try 
https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz

Thanks
Best Regards

On Wed, Mar 11, 2015 at 9:45 PM, Masf  wrote:
Hi all

Is it possible to read recursively folders to read parquet files?


Thanks.

--


Saludos.
Miguel Ángel



A way to share RDD directly using Tachyon?

2015-03-08 Thread Yijie Shen
Hi,

I would like to share a RDD in several Spark Applications, 
i.e, create one in application A, publish the ID somewhere and get the RDD back 
directly using ID in Application B.

I know I can use Tachyon just as a filesystem and 
s.saveAsTextFile("tachyon://localhost:19998/Y”) like this.

But get a RDD directly from tachyon instead of a file can sometimes avoid 
parsing the same file repeatedly in different Apps, I think.

What am I supposed to do in order to share RDDs to get a better performance?  


— 
Best Regards!
Yijie Shen