Re: Mock spark reads and writes

2020-07-15 Thread Jeff Evans
Why do you need to mock the read/write at all?  Why not have your test CSV
file, and invoke it (which will perform the real Spark DF read of CSV),
write it, and assert on the output?

On Tue, Jul 14, 2020 at 12:19 PM Dark Crusader 
wrote:

> Sorry I wasn't very clear in my last email.
>
> I have a function like this:
>
> def main( read_file):
> df = spark.read.csv(read_file)
> ** Some other code **
> df.write.csv(path)
>
> Which I need to write a unit test for.
> Would pythons unittest mock help me here?
>
> When I googled this, I mostly see that we shouldn't mock these reads and
> writes, but this doesn't solve the problem of how I unittest helper
> functions/main method that will have to read and write files.
>
> An example of the proper way to do this in python would be really helpful.
>
> Thanks a lot.
>


Re: Mock spark reads and writes

2020-07-15 Thread ed
Hi,

For testing things like this you have a couple of options, you could isolate 
all your business logic separately from your read/write/spark code which, in my 
experience, makes the code harder to write and manage.

The other option is to accept that tests will be slower than you would expect 
unit tests to be normally and actually allow the read/write to happen on a 
local instance of spark.

Your tests then become:

- Write the data you need for the test
- spark-submit the test (or something similar)
- Check the results of the test

If you have any truly isolated business logic then you can unit test that as 
you would do normally but it is likely that most spark jobs are going to call 
spark functions which you either mock out (and if a spark function calls in a 
mocked out forest, does anyone here it fail?) or you allow it and take the 
performance hit.

Personally, I have used both approaches and I would favour the second approach 
of allowing reads and writes to happen on a local spark instance as it  tells 
you so much more than just whether functions were called in the right order and 
with the right parameters.



Ed

> On 14 Jul 2020, at 18:18, Dark Crusader  wrote:
> 
> Sorry I wasn't very clear in my last email.
> 
> I have a function like this:
> 
> def main( read_file):
> df = spark.read.csv(read_file)
> ** Some other code **
> df.write.csv(path)
> 
> Which I need to write a unit test for.
> Would pythons unittest mock help me here?
> 
> When I googled this, I mostly see that we shouldn't mock these reads and 
> writes, but this doesn't solve the problem of how I unittest helper 
> functions/main method that will have to read and write files.
> 
> An example of the proper way to do this in python would be really helpful.
> 
> Thanks a lot.



Mock spark reads and writes

2020-07-14 Thread Dark Crusader
Sorry I wasn't very clear in my last email.

I have a function like this:

def main( read_file):
df = spark.read.csv(read_file)
** Some other code **
df.write.csv(path)

Which I need to write a unit test for.
Would pythons unittest mock help me here?

When I googled this, I mostly see that we shouldn't mock these reads and
writes, but this doesn't solve the problem of how I unittest helper
functions/main method that will have to read and write files.

An example of the proper way to do this in python would be really helpful.

Thanks a lot.