You need to factor your program so that it’s not just a main(). This is not a
Spark-specific issue, it’s about how you’d unit test any program in general. In
this case, your main() creates a SparkContext, so you can’t pass one from
outside, and your code has to read data from a file and write it to a file. It
would be better to move your code for transforming data into a new function:
def processData(lines: RDD[String]): RDD[String] = {
// build and return your “res” variable
}
Then you can unit-test this directly on data you create in your program:
val myLines = sc.parallelize(Seq(“line 1”, “line 2”))
val result = GetInfo.processData(myLines).collect()
assert(result.toSet === Set(“res 1”, “res 2”))
Matei
On Jun 13, 2014, at 2:42 PM, SK <[email protected]> wrote:
> Hi,
>
> I have looked through some of the test examples and also the brief
> documentation on unit testing at
> http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, but
> still dont have a good understanding of writing unit tests using the Spark
> framework. Previously, I have written unit tests using specs2 framework and
> have got them to work in Scalding. I tried to use the specs2 framework with
> Spark, but could not find any simple examples I could follow. I am open to
> specs2 or Funsuite, whichever works best with Spark. I would like some
> additional guidance, or some simple sample code using specs2 or Funsuite. My
> code is provided below.
>
>
> I have the following code in src/main/scala/GetInfo.scala. It reads a Json
> file and extracts some data. It takes the input file (args(0)) and output
> file (args(1)) as arguments.
>
> object GetInfo{
>
> def main(args: Array[String]) {
> val inp_file = args(0)
> val conf = new SparkConf().setAppName("GetInfo")
> val sc = new SparkContext(conf)
> val res = sc.textFile(log_file)
> .map(line => { parse(line) })
> .map(json =>
> {
> implicit lazy val formats =
> org.json4s.DefaultFormats
> val aid = (json \ "d" \ "TypeID").extract[Int]
> val ts = (json \ "d" \ "TimeStamp").extract[Long]
> val gid = (json \ "d" \ "ID").extract[String]
> (aid, ts, gid)
> }
> )
> .groupBy(tup => tup._3)
> .sortByKey(true)
> .map(g => (g._1, g._2.map(_._2).max))
> res.map(tuple=> "%s, %d".format(tuple._1,
> tuple._2)).saveAsTextFile(args(1))
> }
>
>
> I would like to test the above code. My unit test is in src/test/scala. The
> code I have so far for the unit test appears below:
>
> import org.apache.spark._
> import org.specs2.mutable._
>
> class GetInfoTest extends Specification with java.io.Serializable{
>
> val data = List (
> ("d": {"TypeID" = 10, "Timestamp": 1234, "ID": "ID1"}),
> ("d": {"TypeID" = 11, "Timestamp": 5678, "ID": "ID1"}),
> ("d": {"TypeID" = 10, "Timestamp": 1357, "ID": "ID2"}),
> ("d": {"TypeID" = 11, "Timestamp": 2468, "ID": "ID2"})
> )
>
> val expected_out = List(
> ("ID1",5678),
> ("ID2",2468),
> )
>
> "A GetInfo job" should {
> //***** How do I pass "data" define above as input and output
> which GetInfo expects as arguments? ******
> val sc = new SparkContext("local", "GetInfo")
>
> //*** how do I get the output ***
>
> //assuming out_buffer has the output I want to match it to the
> expected output
> "match expected output" in {
> ( out_buffer == expected_out) must beTrue
> }
> }
>
> }
>
> I would like some help with the tasks marked with "****" in the unit test
> code above. If specs2 is not the right way to go, I am also open to
> FunSuite. I would like to know how to pass the input while calling my
> program from the unit test and get the output.
>
> Thanks for your help.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.