I have started a project using Spark 1.5.1 consisting of several jobs I
launch (actually manually) using shell scripts against a small Spark
standalone cluster.
Those jobs generally read a Cassandra table (using a RDD of type
JavaRDD<CassandraRow> or using plain DataFrames), compute results on
that data and write another Cassandra table with that results.
The project builds (using Apache Maven) a single shaded uber jar. This
jar has many main methods. Each main method is launched against the
cluster with a specific shell script (basically a spark-submit wrapper).
The number of jobs I'm writing is constantly increasing, the code base
is growing in size and is becoming a little bit disorganized. I'm
facing some difficulties in logically organizing the code base, when
all I write are operations (trasformations and actions) on RDDs and
DataFrames.
So my question is: how do you generally organize the code base for large
projects? Can you give example, code snippets, architecture templates,
etc. of the general workflow you use to create a new job?
Any help is really appreciated.
Thanks.
P.S.: I code in Java 7, we're not switching to Java 8 anytime soon and
Scala is not an option at this time.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org