We recently open sourced mockrdd, a library for testing PySpark code. github.com/LiveRamp/mockrdd
The mockrdd.MockRDD class offers similar behavior to pyspark.RDD with the following extra benefits. * Extensive sanity checks to identify invalid inputs * More meaningful error messages for debugging issues * Straightforward to running within pdb * Removes Spark dependencies from development and testing environments * No Spark overhead when running through a large test suite More details in this blog post: liveramp.com/engineering/introducing-mockrdd-for-testing-pyspark-code Would anyone find this useful? What other features would make this more useful? Are there benefits to using PySpark in local mode for testing that we're not considering? Thanks!