I have started a project using Spark 1.5.1 consisting of several jobs I launch (actually manually) using shell scripts against a small Spark standalone cluster. Those jobs generally read a Cassandra table (using a RDD of type JavaRDD<CassandraRow> or using plain DataFrames), compute results on that data and write another Cassandra table with that results. The project builds (using Apache Maven) a single shaded uber jar. This jar has many main methods. Each main method is launched against the cluster with a specific shell script (basically a spark-submit wrapper).

The number of jobs I'm writing is constantly increasing, the code base is growing in size and is becoming a little bit disorganized. I'm facing some difficulties in logically organizing the code base, when all I write are operations (trasformations and actions) on RDDs and DataFrames.

So my question is: how do you generally organize the code base for large projects? Can you give example, code snippets, architecture templates, etc. of the general workflow you use to create a new job?
Any help is really appreciated.

Thanks.

P.S.: I code in Java 7, we're not switching to Java 8 anytime soon and Scala is not an option at this time.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to