Architecture for a Spark batch job.

Renato Perini Thu, 08 Oct 2015 16:59:22 -0700

I have started a project using Spark 1.5.1 consisting of several jobs Ilaunch (actually manually) using shell scripts against a small Sparkstandalone cluster.Those jobs generally read a Cassandra table (using a RDD of typeJavaRDD<CassandraRow> or using plain DataFrames), compute results onthat data and write another Cassandra table with that results.The project builds (using Apache Maven) a single shaded uber jar. Thisjar has many main methods. Each main method is launched against thecluster with a specific shell script (basically a spark-submit wrapper).

The number of jobs I'm writing is constantly increasing, the code baseis growing in size and is becoming a little bit disorganized. I'mfacing some difficulties in logically organizing the code base, whenall I write are operations (trasformations and actions) on RDDs andDataFrames.

So my question is: how do you generally organize the code base for largeprojects? Can you give example, code snippets, architecture templates,etc. of the general workflow you use to create a new job?

Any help is really appreciated.

Thanks.

P.S.: I code in Java 7, we're not switching to Java 8 anytime soon andScala is not an option at this time.




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Architecture for a Spark batch job.

Reply via email to