Re: unit testing for spark code

2021-03-22 Thread Attila Zsolt Piros
Hi!

Let me draw your attention to Holden's* spark-testing-base* project.
The documentation is at  https://github.com/holdenk/spark-testing-base/wiki.

As I usually write test for spark internal features I haven't needed to
test so high level.
But I am interested about your experiences.

Best regards,
Attila

On Mon, Mar 22, 2021 at 4:34 PM Nicholas Gustafson 
wrote:

> I've found pytest works well if you're using PySpark. Though if you have a
> lot of tests, running them all can be pretty slow.
>
> On Mon, Mar 22, 2021 at 6:32 AM Amit Sharma  wrote:
>
>> Hi, can we write unit tests for spark code. Is there any specific
>> framework?
>>
>>
>> Thanks
>> Amit
>>
>


Re: unit testing for spark code

2021-03-22 Thread Nicholas Gustafson
I've found pytest works well if you're using PySpark. Though if you have a
lot of tests, running them all can be pretty slow.

On Mon, Mar 22, 2021 at 6:32 AM Amit Sharma  wrote:

> Hi, can we write unit tests for spark code. Is there any specific
> framework?
>
>
> Thanks
> Amit
>


Re: unit testing for spark code

2021-03-22 Thread Mich Talebzadeh
coding in Scala or Python?

Are you using any IDE (IntelliJ, PyCharm)


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 Mar 2021 at 13:33, Amit Sharma  wrote:

> Hi, can we write unit tests for spark code. Is there any specific
> framework?
>
>
> Thanks
> Amit
>


unit testing for spark code

2021-03-22 Thread Amit Sharma
Hi, can we write unit tests for spark code. Is there any specific framework?


Thanks
Amit


Re: Unit testing Spark/Scala code with Mockito

2020-05-20 Thread ZHANG Wei
AFAICT, depends on testing goals, Unit Test, Integration Test or E2E
Test.

For Unit Test, mostly, it tests individual class or class methods.
Mockito can help mock and verify dependent instances or methods.

For Integration Test, some Spark testing helper methods can setup the
environment, such as `runInterpreter`[1] for running codes in REPL. The
data source can be mocked by `Seq(...).toDS()` or reading a local file,
no need to access Hive service.

For E2E Test, the HDFS and Hive (normally, a local mini version) have
to be setup to service the real operations from Spark.

Just my 2 cents.

-- 
Cheers,
-z
[1] 
https://github.com/apache/spark/blob/a06768ec4d5059d1037086fe5495e5d23cde514b/repl/src/test/scala/org/apache/spark/repl/ReplSuite.scala#L49

On Wed, 20 May 2020 15:36:06 +0100
Mich Talebzadeh  wrote:

> On a second note with regard Spark and read writes as I understand unit
> tests are not meant to test database connections. This should be done in
> integration tests to check that all the parts work together. Unit tests are
> just meant to test the functional logic, and not spark's ability to read
> from a database.
> 
> I would have thought that if the specific connectivity through third part
> tool (in my case reading XML file using Databricks jar) is required, then
> this should be done through Read Evaluate Print Loop – REPL environment of
> Spark Shell by writing some codec to quickly establish where the API
> successfully reads from the XML file.
> 
> Does this assertion sound correct?
> 
> thanks,
> 
> Mich
> 
> 
> 
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
> 
> 
> 
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> 
> On Wed, 20 May 2020 at 11:58, Mich Talebzadeh 
> wrote:
> 
> > Hi,
> >
> > I have a spark job that reads an XML file from HDFS, process it and port
> > data to Hive tables, one good and one exception table
> >
> > The Code itself works fine. I need to create Unit Test with Mockito
> > for it.. A unit
> > test should test functionality in isolation. Side effects from other
> > classes or the system should be eliminated for a unit test, if possible. So
> > basically there are three classes.
> >
> >
> >1. Class A, reads XML file and created a DF1 on it plus a DF2 on top
> >of DF1. Test data for XML file is already created
> >2. Class B, reads DF2 and post correct data through TempView and Spark
> >SQL to the underlying Hive table
> >3. Class C, read DF2 and post exception data again through TempView
> >and Spark SQL to the underlying Hive exception table
> >
> > I would like to know for cases covering tests for Class B and Class C what
> > Mockito format needs to be used..
> >
> > Thanks,
> >
> > Mich
> >
> >
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising from
> > such loss, damage or destruction.
> >
> >
> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unit testing Spark/Scala code with Mockito

2020-05-20 Thread Mich Talebzadeh
On a second note with regard Spark and read writes as I understand unit
tests are not meant to test database connections. This should be done in
integration tests to check that all the parts work together. Unit tests are
just meant to test the functional logic, and not spark's ability to read
from a database.

I would have thought that if the specific connectivity through third part
tool (in my case reading XML file using Databricks jar) is required, then
this should be done through Read Evaluate Print Loop – REPL environment of
Spark Shell by writing some codec to quickly establish where the API
successfully reads from the XML file.

Does this assertion sound correct?

thanks,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 20 May 2020 at 11:58, Mich Talebzadeh 
wrote:

> Hi,
>
> I have a spark job that reads an XML file from HDFS, process it and port
> data to Hive tables, one good and one exception table
>
> The Code itself works fine. I need to create Unit Test with Mockito
> for it.. A unit
> test should test functionality in isolation. Side effects from other
> classes or the system should be eliminated for a unit test, if possible. So
> basically there are three classes.
>
>
>1. Class A, reads XML file and created a DF1 on it plus a DF2 on top
>of DF1. Test data for XML file is already created
>2. Class B, reads DF2 and post correct data through TempView and Spark
>SQL to the underlying Hive table
>3. Class C, read DF2 and post exception data again through TempView
>and Spark SQL to the underlying Hive exception table
>
> I would like to know for cases covering tests for Class B and Class C what
> Mockito format needs to be used..
>
> Thanks,
>
> Mich
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Unit testing Spark/Scala code with Mockito

2020-05-20 Thread Mich Talebzadeh
Hi,

I have a spark job that reads an XML file from HDFS, process it and port
data to Hive tables, one good and one exception table

The Code itself works fine. I need to create Unit Test with Mockito
for it.. A unit
test should test functionality in isolation. Side effects from other
classes or the system should be eliminated for a unit test, if possible. So
basically there are three classes.


   1. Class A, reads XML file and created a DF1 on it plus a DF2 on top of
   DF1. Test data for XML file is already created
   2. Class B, reads DF2 and post correct data through TempView and Spark
   SQL to the underlying Hive table
   3. Class C, read DF2 and post exception data again through TempView and
   Spark SQL to the underlying Hive exception table

I would like to know for cases covering tests for Class B and Class C what
Mockito format needs to be used..

Thanks,

Mich




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Unit testing PySpark Code and doing assertion

2019-09-03 Thread Rahul Nandi
Hi,
I'm trying to do unit testing of my pyspark DataFrame code. My goal is to
do an assertion on the schema and data of the DataFrames. I'm looking for
options if there are any known libraries that I can use for doing the same.
Any library which can work on 10-15 records in the DataFrame is good for
me.
As of now I'm using unittest library and using *assertCountEquals* method
to do the assertion. This is quite okay, but it does not do the schema
level validation. The failure message is not easily understandable.

If any of you are using any special techniques, let me know. Thanks
in advance.

Regards,
Rahul


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-18 Thread Vadim Semenov
you can create a Super class "FunSuiteWithSparkContext" that's going to
create a Spark sessions, Spark context, and SQLContext with all the desired
properties.
Then you add the class to all the relevant test suites, and that's pretty
much it.

The other option can be is to pass it as a VM parameter like
`-Dspark.driver.memory=2g -Xmx3G -Dspark.master=local[3]`

For example, if you run your tests with sbt:

```
SBT_OPTS="-Xmx3G -Dspark.driver.memory=1536m" sbt test
```

On Sat, Sep 16, 2017 at 2:54 PM, Femi Anthony  wrote:

> How are you specifying it, as an option to spark-submit ?
>
> On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das  wrote:
>
>> spark.sql.shuffle.partitions is still used I believe. I can see it in the
>> code
>> 
>>  and
>> in the documentation page
>> 
>> .
>>
>> On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:
>>
>>> Hello,
>>>
>>> I am running unit tests with Spark DataFrames, and I am looking for
>>> configuration tweaks that would make tests faster. Usually, I use a
>>> local[2] or local[4] master.
>>>
>>> Something that has been bothering me is that most of my stages end up
>>> using 200 partitions, independently of whether I repartition the input.
>>> This seems a bit overkill for small unit tests that barely have 200 rows
>>> per DataFrame.
>>>
>>> spark.sql.shuffle.partitions used to control this I believe, but it
>>> seems to be gone and I could not find any information on what
>>> mechanism/setting replaces it or the corresponding JIRA.
>>>
>>> Has anyone experience to share on how to tune Spark best for very small
>>> local runs like that?
>>>
>>> Thanks!
>>>
>>>
>>
>>
>> --
>> Cheers!
>>
>>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Femi Anthony
How are you specifying it, as an option to spark-submit ?

On Sat, Sep 16, 2017 at 12:26 PM, Akhil Das  wrote:

> spark.sql.shuffle.partitions is still used I believe. I can see it in the
> code
> 
>  and
> in the documentation page
> 
> .
>
> On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:
>
>> Hello,
>>
>> I am running unit tests with Spark DataFrames, and I am looking for
>> configuration tweaks that would make tests faster. Usually, I use a
>> local[2] or local[4] master.
>>
>> Something that has been bothering me is that most of my stages end up
>> using 200 partitions, independently of whether I repartition the input.
>> This seems a bit overkill for small unit tests that barely have 200 rows
>> per DataFrame.
>>
>> spark.sql.shuffle.partitions used to control this I believe, but it seems
>> to be gone and I could not find any information on what mechanism/setting
>> replaces it or the corresponding JIRA.
>>
>> Has anyone experience to share on how to tune Spark best for very small
>> local runs like that?
>>
>> Thanks!
>>
>>
>
>
> --
> Cheers!
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Akhil Das
spark.sql.shuffle.partitions is still used I believe. I can see it in the
code

and
in the documentation page

.

On Wed, Sep 13, 2017 at 4:46 AM, peay  wrote:

> Hello,
>
> I am running unit tests with Spark DataFrames, and I am looking for
> configuration tweaks that would make tests faster. Usually, I use a
> local[2] or local[4] master.
>
> Something that has been bothering me is that most of my stages end up
> using 200 partitions, independently of whether I repartition the input.
> This seems a bit overkill for small unit tests that barely have 200 rows
> per DataFrame.
>
> spark.sql.shuffle.partitions used to control this I believe, but it seems
> to be gone and I could not find any information on what mechanism/setting
> replaces it or the corresponding JIRA.
>
> Has anyone experience to share on how to tune Spark best for very small
> local runs like that?
>
> Thanks!
>
>


-- 
Cheers!


Configuration for unit testing and sql.shuffle.partitions

2017-09-12 Thread peay
Hello,

I am running unit tests with Spark DataFrames, and I am looking for 
configuration tweaks that would make tests faster. Usually, I use a local[2] or 
local[4] master.

Something that has been bothering me is that most of my stages end up using 200 
partitions, independently of whether I repartition the input. This seems a bit 
overkill for small unit tests that barely have 200 rows per DataFrame.

spark.sql.shuffle.partitions used to control this I believe, but it seems to be 
gone and I could not find any information on what mechanism/setting replaces it 
or the corresponding JIRA.

Has anyone experience to share on how to tune Spark best for very small local 
runs like that?

Thanks!

Re: unit testing in spark

2017-04-11 Thread Elliot West
Jörn, I'm interested in your point on coverage. Coverage has been a useful
tool for highlighting areas in the codebase that pose a source of potential
risk. However, generally speaking, I've found that traditional coverage
tools do not provide useful information when applied to distributed data
processing frameworks. Here the code is mostly constructional, comprising
calls to factories, constructors, and the like, and resulting in a
representation of a job that will be executed later on, in some other
environment. One could attain high levels of coverage simply by building
the pipeline and not submitting it. Certainly it is easy measure coverage
on individual transforms, but for jobs/pipelines it seems somewhat more
illusive.

I'd be keen to hear of your experiences and approaches in this regard as it
sounds as though you are generating more useful coverage metrics.
Personally I've been considering adopting a mutation-testing/chaos-monkey
type approach to pipeline testing in an effort to ascertain which parts of
a pipeline are not covered by a test suite. I describe it here, albeit for
the purpose of reporting code coverage on Hive SQL statements:
https://github.com/klarna/HiveRunner/issues/65#issuecomment-283785351

Thanks,

Elliot.


On 10 April 2017 at 15:32, Jörn Franke <jornfra...@gmail.com> wrote:

>
> I think in the end you need to check the coverage of your application. If
> your application is well covered on the job or pipeline level (depends
> however on how you implement these tests) then it can be fine.
> In the end it really depends on the data and what kind of transformation
> you implement. For example, you have 90% of your job with standard
> transformations, but 10% are more or less complex customized functions,
> then it might be worth to test the function with many different data inputs
> as unit tests and have integrated job/pipeline tests in addition.
>
> On 10. Apr 2017, at 15:46, Gokula Krishnan D <email2...@gmail.com> wrote:
>
> Hello Shiv,
>
> Unit Testing is really helping when you follow TDD approach. And it's a
> safe way to code a program locally and also you can make use those test
> cases during the build process by using any of the continuous integration
> tools ( Bamboo, Jenkins). If so you can ensure that artifacts are being
> tested before deploying into Cluster.
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Wed, Apr 5, 2017 at 7:32 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
>
>> Hi,
>>
>> I've been following this thread for a while.
>>
>> I'm trying to bring in a test strategy in my team to test a number of
>> data pipelines before production. I have watched Lars' presentation and
>> find it great. However I'm debating whether unit tests are worth the effort
>> if there are good job-level and pipeline-level tests. Does anybody have any
>> experiences benefitting from unit-tests in such a case?
>>
>> Cheers,
>> Shiv
>>
>> On Mon, Dec 12, 2016 at 6:00 AM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I would also would like to participate on that.
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>> On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton <
>>> michael.strat...@komodohealth.com> wrote:
>>>
>>>> That sounds great, please include me so I can get involved.
>>>>
>>>> On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Me too as I spent most of my time writing unit/integ tests  pls
>>>>> advise on where I  can start
>>>>> Kr
>>>>>
>>>>> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I would be interested in contributing.  Ive created my own library
>>>>>> for this as well.  In my blog post I talk about testing with Spark in 
>>>>>> RSpec
>>>>>> style:
>>>>>> https://medium.com/@therevoltingx/test-driven-development-w-
>>>>>> apache-spark-746082b44941
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>> There are also libraries designed to simplify testing Spark in the
>>>>>> various platforms, spark-testing-base
>>>>>> <http://github.com/holdenk/spark-testing-base> for Scala/Java/

Re: unit testing in spark

2017-04-11 Thread Steve Loughran

(sorry sent an empty reply by accident)

Unit testing is one of the easiest ways to isolate problems in an an internal 
class, things you can get wrong. But: time spent writing unit tests is time 
*not* spent writing integration tests. Which biases me towards the integration.

What I do find is good is writing integration tests to debug things: if 
something is playing up, if you can write a unit test to replicate then not 
only can you isolate the problem, you can verify it is fixed and stays fixed. 
And as they are fast & often runnable in parallel, easy to do repetitively.

But: Tests have a maintenance cost, especially if the tests go into the 
internals, making them very brittle to change. Mocking is the real troublespot 
here. It's good to be able to simulate failures, but given the choice between 
"integration test against real code" and "something using mocks which produce 
"impossible' stack traces and, after a code rework, fail so badly you can't 
tell if it's a regression or just the tests are obsolete", I'd go for 
production, even if runs up some bills.

I really liked Lar's slides; gave me some ideas. One thing I've been exploring 
is using system metrics in testing, adding more metrics to help note what is 
happening

https://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html

Strengths: encourages me to write metrics, can be used in in-VM tests, and 
collected from a distributed SUT integration tests, both for asserts and 
logging. Weakness1. : exposing internal state which, again, can be brittle. 2. 
in integration tests the results can vary a lot, so you can't really make 
assertions on it. Better there to collect things and use in test reports.

Which brings me to a real issue with integration tests, which isn't a fault of 
the apps or the tests, but in today's test runners: log capture and reporting 
dates from the era where we were running unit tests, so thinking about the 
reporting problems there: standard out and error for a single process, no 
standard log format so naive stream capture over structured log entries; test 
runners which don't repot much on a failure but the stack trace, or, with 
scalatest, half the stack trace (*), missing out on those of the remote 
systems. Systems which, if you are playing with cloud infra, may not be there 
when you get to analyse the test results. You are left trying to compare 9 logs 
across 3 destroyed VMs to work out why the test runner through an assertion 
failure.

This is tractable, and indeed, the Kakfa people have been advocating "use kafka 
as the collector of test results" to address it: the logs, metrics, events 
raised by the SUT., etc, and then somehow correlate them into test reports, or 
at least provide the ordering of events and state across parts of the system so 
that you can work back from a test failure. Yes, that means moving way beyond 
the usual ant-JUnit XML report everything creates, but like I said: that was 
written for a different era. It's time to move on, generating the XML report as 
one of the outputs if you want, but not the one you use for diagnosing why a 
test fails.

I'd love to see what people have been up to in that area. If anyone has 
insights there, it'd be topic for a hangout.

-Steve


(*) Scaltest opinions: 
https://steveloughran.blogspot.co.uk/2016/09/scalatest-thoughts-and-ideas.html




Re: unit testing in spark

2017-04-10 Thread Jörn Franke

I think in the end you need to check the coverage of your application. If your 
application is well covered on the job or pipeline level (depends however on 
how you implement these tests) then it can be fine.
In the end it really depends on the data and what kind of transformation you 
implement. For example, you have 90% of your job with standard transformations, 
but 10% are more or less complex customized functions, then it might be worth 
to test the function with many different data inputs as unit tests and have 
integrated job/pipeline tests in addition.

> On 10. Apr 2017, at 15:46, Gokula Krishnan D <email2...@gmail.com> wrote:
> 
> Hello Shiv, 
> 
> Unit Testing is really helping when you follow TDD approach. And it's a safe 
> way to code a program locally and also you can make use those test cases 
> during the build process by using any of the continuous integration tools ( 
> Bamboo, Jenkins). If so you can ensure that artifacts are being tested before 
> deploying into Cluster.
> 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 
>> On Wed, Apr 5, 2017 at 7:32 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
>> Hi,
>> 
>> I've been following this thread for a while. 
>> 
>> I'm trying to bring in a test strategy in my team to test a number of data 
>> pipelines before production. I have watched Lars' presentation and find it 
>> great. However I'm debating whether unit tests are worth the effort if there 
>> are good job-level and pipeline-level tests. Does anybody have any 
>> experiences benefitting from unit-tests in such a case?
>> 
>> Cheers,
>> Shiv
>> 
>>> On Mon, Dec 12, 2016 at 6:00 AM, Juan Rodríguez Hortalá 
>>> <juan.rodriguez.hort...@gmail.com> wrote:
>>> Hi all, 
>>> 
>>> I would also would like to participate on that. 
>>> 
>>> Greetings, 
>>> 
>>> Juan 
>>> 
>>>> On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton 
>>>> <michael.strat...@komodohealth.com> wrote:
>>>> That sounds great, please include me so I can get involved.
>>>> 
>>>>> On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com> 
>>>>> wrote:
>>>>> Me too as I spent most of my time writing unit/integ tests  pls 
>>>>> advise on where I  can start
>>>>> Kr
>>>>> 
>>>>>> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com> wrote:
>>>>>> I would be interested in contributing.  Ive created my own library for 
>>>>>> this as well.  In my blog post I talk about testing with Spark in RSpec 
>>>>>> style: 
>>>>>> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>>>>>> 
>>>>>>> There are also libraries designed to simplify testing Spark in the 
>>>>>>> various platforms, spark-testing-base for Scala/Java/Python (& video 
>>>>>>> https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused 
>>>>>>> property based), pyspark.test (python focused with py.test instead of 
>>>>>>> unittest2) (& blog post from nextdoor 
>>>>>>> https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>>>>>>>  )
>>>>>>> 
>>>>>>> Good luck on your Spark Adventures :)
>>>>>>> 
>>>>>>> P.S.
>>>>>>> 
>>>>>>> If anyone is interested in helping improve spark testing libraries I'm 
>>>>>>> always looking for more people to be involved with spark-testing-base 
>>>>>>> because I'm lazy :p
>>>>>>> 
>>>>>>>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> 
>>>>>>>> wrote:
>>>>>>>> I wrote some advice in a previous post on the list:
>>>>>>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>>>>>> 
>>>>>>>> It does not mention python, but the strategy advice is the same. Just
>>>>>>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>>>>>>> python test framework.
>>

Re: unit testing in spark

2017-04-10 Thread Gokula Krishnan D
Hello Shiv,

Unit Testing is really helping when you follow TDD approach. And it's a
safe way to code a program locally and also you can make use those test
cases during the build process by using any of the continuous integration
tools ( Bamboo, Jenkins). If so you can ensure that artifacts are being
tested before deploying into Cluster.


Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Wed, Apr 5, 2017 at 7:32 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:

> Hi,
>
> I've been following this thread for a while.
>
> I'm trying to bring in a test strategy in my team to test a number of data
> pipelines before production. I have watched Lars' presentation and find it
> great. However I'm debating whether unit tests are worth the effort if
> there are good job-level and pipeline-level tests. Does anybody have any
> experiences benefitting from unit-tests in such a case?
>
> Cheers,
> Shiv
>
> On Mon, Dec 12, 2016 at 6:00 AM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would also would like to participate on that.
>>
>> Greetings,
>>
>> Juan
>>
>> On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton <
>> michael.strat...@komodohealth.com> wrote:
>>
>>> That sounds great, please include me so I can get involved.
>>>
>>> On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com>
>>> wrote:
>>>
>>>> Me too as I spent most of my time writing unit/integ tests  pls
>>>> advise on where I  can start
>>>> Kr
>>>>
>>>> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com>
>>>> wrote:
>>>>
>>>>> I would be interested in contributing.  Ive created my own library for
>>>>> this as well.  In my blog post I talk about testing with Spark in RSpec
>>>>> style:
>>>>> https://medium.com/@therevoltingx/test-driven-development-w-
>>>>> apache-spark-746082b44941
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>>>>
>>>>> There are also libraries designed to simplify testing Spark in the
>>>>> various platforms, spark-testing-base
>>>>> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python
>>>>> (& video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
>>>>> <https://github.com/juanrh/sscheck> (scala focused property based),
>>>>> pyspark.test (python focused with py.test instead of unittest2) (&
>>>>> blog post from nextdoor https://engblog.nextd
>>>>> oor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>>>>>  )
>>>>>
>>>>> Good luck on your Spark Adventures :)
>>>>>
>>>>> P.S.
>>>>>
>>>>> If anyone is interested in helping improve spark testing libraries I'm
>>>>> always looking for more people to be involved with spark-testing-base
>>>>> because I'm lazy :p
>>>>>
>>>>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com>
>>>>> wrote:
>>>>>
>>>>>> I wrote some advice in a previous post on the list:
>>>>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>>>>
>>>>>> It does not mention python, but the strategy advice is the same. Just
>>>>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>>>>> python test framework.
>>>>>>
>>>>>>
>>>>>> I recently held a presentation on the subject. There is a video
>>>>>> recording at https://vimeo.com/192429554 and slides at
>>>>>> http://www.slideshare.net/lallea/test-strategies-for-data-pr
>>>>>> ocessing-pipelines-67244458
>>>>>>
>>>>>> You can find more material on test strategies at
>>>>>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Lars Albertsson
>>>>>> Data engineering consultant
>>>>>> www.mapflat.com
>>>>>> https://twitter.com/lalleal
>>>>>> +46 70 7687109
>>>>>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/lalle@mapf
>>>>>> lat.com
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>>>>>> wrote:
>>>>>> > somone can tell me how i can make unit test on pyspark ?
>>>>>> > (book, tutorial ...)
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>
>>
>


Re: unit testing in spark

2017-04-05 Thread Shiva Ramagopal
Hi,

I've been following this thread for a while.

I'm trying to bring in a test strategy in my team to test a number of data
pipelines before production. I have watched Lars' presentation and find it
great. However I'm debating whether unit tests are worth the effort if
there are good job-level and pipeline-level tests. Does anybody have any
experiences benefitting from unit-tests in such a case?

Cheers,
Shiv

On Mon, Dec 12, 2016 at 6:00 AM, Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com> wrote:

> Hi all,
>
> I would also would like to participate on that.
>
> Greetings,
>
> Juan
>
> On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton <michael.stratton@
> komodohealth.com> wrote:
>
>> That sounds great, please include me so I can get involved.
>>
>> On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Me too as I spent most of my time writing unit/integ tests  pls
>>> advise on where I  can start
>>> Kr
>>>
>>> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com>
>>> wrote:
>>>
>>>> I would be interested in contributing.  Ive created my own library for
>>>> this as well.  In my blog post I talk about testing with Spark in RSpec
>>>> style:
>>>> https://medium.com/@therevoltingx/test-driven-development-w-
>>>> apache-spark-746082b44941
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>>>
>>>> There are also libraries designed to simplify testing Spark in the
>>>> various platforms, spark-testing-base
>>>> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python
>>>> (& video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
>>>> <https://github.com/juanrh/sscheck> (scala focused property based),
>>>> pyspark.test (python focused with py.test instead of unittest2) (&
>>>> blog post from nextdoor https://engblog.nextd
>>>> oor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>>>>  )
>>>>
>>>> Good luck on your Spark Adventures :)
>>>>
>>>> P.S.
>>>>
>>>> If anyone is interested in helping improve spark testing libraries I'm
>>>> always looking for more people to be involved with spark-testing-base
>>>> because I'm lazy :p
>>>>
>>>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com>
>>>> wrote:
>>>>
>>>>> I wrote some advice in a previous post on the list:
>>>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>>>
>>>>> It does not mention python, but the strategy advice is the same. Just
>>>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>>>> python test framework.
>>>>>
>>>>>
>>>>> I recently held a presentation on the subject. There is a video
>>>>> recording at https://vimeo.com/192429554 and slides at
>>>>> http://www.slideshare.net/lallea/test-strategies-for-data-pr
>>>>> ocessing-pipelines-67244458
>>>>>
>>>>> You can find more material on test strategies at
>>>>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Lars Albertsson
>>>>> Data engineering consultant
>>>>> www.mapflat.com
>>>>> https://twitter.com/lalleal
>>>>> +46 70 7687109
>>>>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>>>>
>>>>>
>>>>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>>>>> wrote:
>>>>> > somone can tell me how i can make unit test on pyspark ?
>>>>> > (book, tutorial ...)
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>
>


Re: unit testing in spark

2016-12-11 Thread Juan Rodríguez Hortalá
Hi all,

I would also would like to participate on that.

Greetings,

Juan

On Fri, Dec 9, 2016 at 6:03 AM, Michael Stratton <
michael.strat...@komodohealth.com> wrote:

> That sounds great, please include me so I can get involved.
>
> On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> Me too as I spent most of my time writing unit/integ tests  pls
>> advise on where I  can start
>> Kr
>>
>> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com> wrote:
>>
>>> I would be interested in contributing.  Ive created my own library for
>>> this as well.  In my blog post I talk about testing with Spark in RSpec
>>> style:
>>> https://medium.com/@therevoltingx/test-driven-development-w-
>>> apache-spark-746082b44941
>>>
>>> Sent from my iPhone
>>>
>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>>
>>> There are also libraries designed to simplify testing Spark in the
>>> various platforms, spark-testing-base
>>> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python (&
>>> video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
>>> <https://github.com/juanrh/sscheck> (scala focused property based),
>>> pyspark.test (python focused with py.test instead of unittest2) (& blog
>>> post from nextdoor https://engblog.nextdoor.com/unit-testing-apache-sp
>>> ark-with-py-test-3b8970dc013b#.jw3bdcej9 )
>>>
>>> Good luck on your Spark Adventures :)
>>>
>>> P.S.
>>>
>>> If anyone is interested in helping improve spark testing libraries I'm
>>> always looking for more people to be involved with spark-testing-base
>>> because I'm lazy :p
>>>
>>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com>
>>> wrote:
>>>
>>>> I wrote some advice in a previous post on the list:
>>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>>
>>>> It does not mention python, but the strategy advice is the same. Just
>>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>>> python test framework.
>>>>
>>>>
>>>> I recently held a presentation on the subject. There is a video
>>>> recording at https://vimeo.com/192429554 and slides at
>>>> http://www.slideshare.net/lallea/test-strategies-for-data-pr
>>>> ocessing-pipelines-67244458
>>>>
>>>> You can find more material on test strategies at
>>>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>>>
>>>>
>>>>
>>>>
>>>> Lars Albertsson
>>>> Data engineering consultant
>>>> www.mapflat.com
>>>> https://twitter.com/lalleal
>>>> +46 70 7687109
>>>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>>>
>>>>
>>>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>>>> wrote:
>>>> > somone can tell me how i can make unit test on pyspark ?
>>>> > (book, tutorial ...)
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>


Re: unit testing in spark

2016-12-09 Thread Michael Stratton
That sounds great, please include me so I can get involved.

On Fri, Dec 9, 2016 at 7:39 AM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Me too as I spent most of my time writing unit/integ tests  pls advise
> on where I  can start
> Kr
>
> On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com> wrote:
>
>> I would be interested in contributing.  Ive created my own library for
>> this as well.  In my blog post I talk about testing with Spark in RSpec
>> style:
>> https://medium.com/@therevoltingx/test-driven-development-w-
>> apache-spark-746082b44941
>>
>> Sent from my iPhone
>>
>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>
>> There are also libraries designed to simplify testing Spark in the
>> various platforms, spark-testing-base
>> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python (&
>> video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
>> <https://github.com/juanrh/sscheck> (scala focused property based),
>> pyspark.test (python focused with py.test instead of unittest2) (& blog
>> post from nextdoor https://engblog.nextdoor.com/unit-testing-apache-
>> spark-with-py-test-3b8970dc013b#.jw3bdcej9 )
>>
>> Good luck on your Spark Adventures :)
>>
>> P.S.
>>
>> If anyone is interested in helping improve spark testing libraries I'm
>> always looking for more people to be involved with spark-testing-base
>> because I'm lazy :p
>>
>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com>
>> wrote:
>>
>>> I wrote some advice in a previous post on the list:
>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>
>>> It does not mention python, but the strategy advice is the same. Just
>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>> python test framework.
>>>
>>>
>>> I recently held a presentation on the subject. There is a video
>>> recording at https://vimeo.com/192429554 and slides at
>>> http://www.slideshare.net/lallea/test-strategies-for-data-pr
>>> ocessing-pipelines-67244458
>>>
>>> You can find more material on test strategies at
>>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>>
>>>
>>>
>>>
>>> Lars Albertsson
>>> Data engineering consultant
>>> www.mapflat.com
>>> https://twitter.com/lalleal
>>> +46 70 7687109
>>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>>
>>>
>>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>>> wrote:
>>> > somone can tell me how i can make unit test on pyspark ?
>>> > (book, tutorial ...)
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>>


Re: unit testing in spark

2016-12-09 Thread Marco Mistroni
Me too as I spent most of my time writing unit/integ tests  pls advise
on where I  can start
Kr

On 9 Dec 2016 12:15 am, "Miguel Morales" <therevolti...@gmail.com> wrote:

> I would be interested in contributing.  Ive created my own library for
> this as well.  In my blog post I talk about testing with Spark in RSpec
> style:
> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-
> 746082b44941
>
> Sent from my iPhone
>
> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> There are also libraries designed to simplify testing Spark in the various
> platforms, spark-testing-base
> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python (&
> video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
> <https://github.com/juanrh/sscheck> (scala focused property based),
> pyspark.test (python focused with py.test instead of unittest2) (& blog
> post from nextdoor https://engblog.nextdoor.com/unit-testing-
> apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9 )
>
> Good luck on your Spark Adventures :)
>
> P.S.
>
> If anyone is interested in helping improve spark testing libraries I'm
> always looking for more people to be involved with spark-testing-base
> because I'm lazy :p
>
> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> wrote:
>
>> I wrote some advice in a previous post on the list:
>> http://markmail.org/message/bbs5acrnksjxsrrs
>>
>> It does not mention python, but the strategy advice is the same. Just
>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>> python test framework.
>>
>>
>> I recently held a presentation on the subject. There is a video
>> recording at https://vimeo.com/192429554 and slides at
>> http://www.slideshare.net/lallea/test-strategies-for-data-
>> processing-pipelines-67244458
>>
>> You can find more material on test strategies at
>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>
>>
>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>> wrote:
>> > somone can tell me how i can make unit test on pyspark ?
>> > (book, tutorial ...)
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>


Re: unit testing in spark

2016-12-08 Thread Miguel Morales
Sure I'd love to participate.  Being new at Scala things like dependency 
injection are still a bit iffy.  Would love to exchange ideas.

Sent from my iPhone

> On Dec 8, 2016, at 4:29 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> Maybe diverging a bit from the original question - but would it maybe make 
> sense for those of us that all care about testing to try and do a hangout at 
> some point so that we can exchange ideas?
> 
>> On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales <therevolti...@gmail.com> 
>> wrote:
>> I would be interested in contributing.  Ive created my own library for this 
>> as well.  In my blog post I talk about testing with Spark in RSpec style: 
>> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941
>> 
>> Sent from my iPhone
>> 
>>> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>>> 
>>> There are also libraries designed to simplify testing Spark in the various 
>>> platforms, spark-testing-base for Scala/Java/Python (& video 
>>> https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused 
>>> property based), pyspark.test (python focused with py.test instead of 
>>> unittest2) (& blog post from nextdoor 
>>> https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>>>  )
>>> 
>>> Good luck on your Spark Adventures :)
>>> 
>>> P.S.
>>> 
>>> If anyone is interested in helping improve spark testing libraries I'm 
>>> always looking for more people to be involved with spark-testing-base 
>>> because I'm lazy :p
>>> 
>>>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> wrote:
>>>> I wrote some advice in a previous post on the list:
>>>> http://markmail.org/message/bbs5acrnksjxsrrs
>>>> 
>>>> It does not mention python, but the strategy advice is the same. Just
>>>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>>>> python test framework.
>>>> 
>>>> 
>>>> I recently held a presentation on the subject. There is a video
>>>> recording at https://vimeo.com/192429554 and slides at
>>>> http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458
>>>> 
>>>> You can find more material on test strategies at
>>>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Lars Albertsson
>>>> Data engineering consultant
>>>> www.mapflat.com
>>>> https://twitter.com/lalleal
>>>> +46 70 7687109
>>>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>>> 
>>>> 
>>>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com> 
>>>> wrote:
>>>> > somone can tell me how i can make unit test on pyspark ?
>>>> > (book, tutorial ...)
>>>> 
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Holden Karau
Maybe diverging a bit from the original question - but would it maybe make
sense for those of us that all care about testing to try and do a hangout
at some point so that we can exchange ideas?

On Thu, Dec 8, 2016 at 4:15 PM, Miguel Morales <therevolti...@gmail.com>
wrote:

> I would be interested in contributing.  Ive created my own library for
> this as well.  In my blog post I talk about testing with Spark in RSpec
> style:
> https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-
> 746082b44941
>
> Sent from my iPhone
>
> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> There are also libraries designed to simplify testing Spark in the various
> platforms, spark-testing-base
> <http://github.com/holdenk/spark-testing-base> for Scala/Java/Python (&
> video https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck
> <https://github.com/juanrh/sscheck> (scala focused property based),
> pyspark.test (python focused with py.test instead of unittest2) (& blog
> post from nextdoor https://engblog.nextdoor.com/unit-testing-
> apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9 )
>
> Good luck on your Spark Adventures :)
>
> P.S.
>
> If anyone is interested in helping improve spark testing libraries I'm
> always looking for more people to be involved with spark-testing-base
> because I'm lazy :p
>
> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> wrote:
>
>> I wrote some advice in a previous post on the list:
>> http://markmail.org/message/bbs5acrnksjxsrrs
>>
>> It does not mention python, but the strategy advice is the same. Just
>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>> python test framework.
>>
>>
>> I recently held a presentation on the subject. There is a video
>> recording at https://vimeo.com/192429554 and slides at
>> http://www.slideshare.net/lallea/test-strategies-for-data-
>> processing-pipelines-67244458
>>
>> You can find more material on test strategies at
>> http://www.mapflat.com/lands/resources/reading-list/index.html
>>
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>>
>>
>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
>> wrote:
>> > somone can tell me how i can make unit test on pyspark ?
>> > (book, tutorial ...)
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Miguel Morales
I would be interested in contributing.  Ive created my own library for this as 
well.  In my blog post I talk about testing with Spark in RSpec style: 
https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941

Sent from my iPhone

> On Dec 8, 2016, at 4:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> There are also libraries designed to simplify testing Spark in the various 
> platforms, spark-testing-base for Scala/Java/Python (& video 
> https://www.youtube.com/watch?v=f69gSGSLGrY), sscheck (scala focused property 
> based), pyspark.test (python focused with py.test instead of unittest2) (& 
> blog post from nextdoor 
> https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
>  )
> 
> Good luck on your Spark Adventures :)
> 
> P.S.
> 
> If anyone is interested in helping improve spark testing libraries I'm always 
> looking for more people to be involved with spark-testing-base because I'm 
> lazy :p
> 
>> On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> wrote:
>> I wrote some advice in a previous post on the list:
>> http://markmail.org/message/bbs5acrnksjxsrrs
>> 
>> It does not mention python, but the strategy advice is the same. Just
>> replace JUnit/Scalatest with pytest, unittest, or your favourite
>> python test framework.
>> 
>> 
>> I recently held a presentation on the subject. There is a video
>> recording at https://vimeo.com/192429554 and slides at
>> http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458
>> 
>> You can find more material on test strategies at
>> http://www.mapflat.com/lands/resources/reading-list/index.html
>> 
>> 
>> 
>> 
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> https://twitter.com/lalleal
>> +46 70 7687109
>> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>> 
>> 
>> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com> wrote:
>> > somone can tell me how i can make unit test on pyspark ?
>> > (book, tutorial ...)
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Holden Karau
There are also libraries designed to simplify testing Spark in the various
platforms, spark-testing-base <http://github.com/holdenk/spark-testing-base>
for Scala/Java/Python (& video https://www.youtube.com/watch?v=f69gSGSLGrY),
sscheck <https://github.com/juanrh/sscheck> (scala focused property based),
pyspark.test (python focused with py.test instead of unittest2) (& blog
post from nextdoor
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b#.jw3bdcej9
 )

Good luck on your Spark Adventures :)

P.S.

If anyone is interested in helping improve spark testing libraries I'm
always looking for more people to be involved with spark-testing-base
because I'm lazy :p

On Thu, Dec 8, 2016 at 2:05 PM, Lars Albertsson <la...@mapflat.com> wrote:

> I wrote some advice in a previous post on the list:
> http://markmail.org/message/bbs5acrnksjxsrrs
>
> It does not mention python, but the strategy advice is the same. Just
> replace JUnit/Scalatest with pytest, unittest, or your favourite
> python test framework.
>
>
> I recently held a presentation on the subject. There is a video
> recording at https://vimeo.com/192429554 and slides at
> http://www.slideshare.net/lallea/test-strategies-for-
> data-processing-pipelines-67244458
>
> You can find more material on test strategies at
> http://www.mapflat.com/lands/resources/reading-list/index.html
>
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109
> Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com
>
>
> On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp <pseudo20...@gmail.com>
> wrote:
> > somone can tell me how i can make unit test on pyspark ?
> > (book, tutorial ...)
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: unit testing in spark

2016-12-08 Thread Lars Albertsson
I wrote some advice in a previous post on the list:
http://markmail.org/message/bbs5acrnksjxsrrs

It does not mention python, but the strategy advice is the same. Just
replace JUnit/Scalatest with pytest, unittest, or your favourite
python test framework.


I recently held a presentation on the subject. There is a video
recording at https://vimeo.com/192429554 and slides at
http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458

You can find more material on test strategies at
http://www.mapflat.com/lands/resources/reading-list/index.html




Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com


On Thu, Dec 8, 2016 at 4:14 PM, pseudo oduesp  wrote:
> somone can tell me how i can make unit test on pyspark ?
> (book, tutorial ...)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: unit testing in spark

2016-12-08 Thread ndjido
Hi Pseudo,

Just use unittest https://docs.python.org/2/library/unittest.html .

> On 8 Dec 2016, at 19:14, pseudo oduesp  wrote:
> 
> somone can tell me how i can make unit test on pyspark ?
> (book, tutorial ...)


unit testing in spark

2016-12-08 Thread pseudo oduesp
somone can tell me how i can make unit test on pyspark ?
(book, tutorial ...)


Driver/Executor Memory values during Unit Testing

2016-12-07 Thread Aleksander Eskilson
Hi there,

I've been trying to increase the spark.driver.memory and
spark.executor.memory during some unit tests. Most of the information I can
find about increasing memory for Spark is based on either flags to
spark-submit, or settings in the spark-defaults.conf file. Running unit
tests with Maven on both a local machine and on a Jenkins box, and editing
both the .conf file, and attempting to set the spark.driver.memory and
spark.executor.memory variables in a SparkConf object in the unit tests
@BeforeClass method, I still can't seem to change what the Storage Memory
of Executor is, it remains the same upon every execution when I check the
UI during the tests. When Spark is invoked on a local machine, and not
through spark-submit in the shell (as during unit test), are the memory
defaults computed some other way, perhaps based on JVM heap allocation
settings?

Best,
Alek


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-22 Thread Bedrytski Aliaksandr
Hi Everett,

HiveContext is initialized only once as a lazy val, so if you mean
initializing different jvms for each (or a group of) test(s), then in
this case the context will not, obviously, be shared.

But specs2 (by default) launches specs (inside of tests classes) in
parallel threads and in this case the context is shared.

To sum up, tests are launched sequentially, but specs inside of tests
are launched in parallel. We don't have anything specific in our .sbt
file in regards to the parallel test execution and hive context is
initialized only once.

In my opinion (correct me if I'm wrong), if you already have >1 specs
per test, the CPU will be already saturated, so total parallel execution
of tests will not give additional gains.

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Sun, Aug 21, 2016, at 18:30, Everett Anderson wrote:
>
>
> On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr
> <sp...@bedryt.ski> wrote:
>> __
>> Hi,
>>
>> we share the same spark/hive context between tests (executed in
>> parallel), so the main problem is that the temporary tables are
>> overwritten each time they are created, this may create race
>> conditions
>> as these tempTables may be seen as global mutable shared state.
>>
>> So each time we create a temporary table, we add an unique,
>> incremented,
>> thread safe id (AtomicInteger) to its name so that there are only
>> specific, non-shared temporary tables used for a test.
>
> Makes sense.
>
> But when you say you're sharing the same spark/hive context between
> tests, I'm assuming that's between the same tests within one test
> class, but you're not sharing across test classes (which a build tool
> like Maven or Gradle might have executed in separate JVMs).
>
> Is that right?
>
>
>
>>
>>
>> --
>>   Bedrytski Aliaksandr
>>   sp...@bedryt.ski
>>
>>
>>
>>> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
>>> Hi!
>>>
>>> Just following up on this --
>>>
>>> When people talk about a shared session/context for testing
>>> like this,
>>> I assume it's still within one test class. So it's still the
>>> case that
>>> if you have a lot of test classes that test Spark-related
>>> things, you
>>> must configure your build system to not run in them in parallel.
>>> You'll get the benefit of not creating and tearing down a Spark
>>> session/context between test cases with a test class, though.
>>>
>>> Is that right?
>>>
>>> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
>>> share Spark sessions/contexts across integration tests in a
>>> safe way?
>>>
>>>
>>> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
>>> <hol...@pigscanfly.ca> wrote:
>>> Thats a good point - there is an open issue for spark-testing-
>>> base to
>>> support this shared sparksession approach - but I haven't had the
>>> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
>>> I'll try and include this in the next release :)
>>>
>>> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
>>> <ko...@tresata.com> wrote:
>>> we share a single single sparksession across tests, and they can run
>>> in parallel. is pretty fast
>>>
>>> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
>>> <ever...@nuna.com.invalid> wrote:
>>> Hi,
>>>
>>> Right now, if any code uses DataFrame/Dataset, I need a test setup
>>> that brings up a local master as in this article[1].
>>>
>>>
>>> That's a lot of overhead for unit testing and the tests can't run
>>> in parallel, so testing is slow -- this is more like what I'd call
>>> an integration test.
>>>
>>> Do people have any tricks to get around this? Maybe using spy mocks
>>> on fake DataFrame/Datasets?
>>>
>>> Anyone know if there are plans to make more traditional unit
>>> testing possible with Spark SQL, perhaps with a stripped down in-
>>> memory implementation? (I admit this does seem quite hard since
>>> there's so much functionality in these classes!)
>>>
>>> Thanks!
>>>
>>>
>>> - Everett
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Everett Anderson
On Sun, Aug 21, 2016 at 3:08 AM, Bedrytski Aliaksandr <sp...@bedryt.ski>
wrote:

> Hi,
>
> we share the same spark/hive context between tests (executed in
> parallel), so the main problem is that the temporary tables are
> overwritten each time they are created, this may create race conditions
> as these tempTables may be seen as global mutable shared state.
>
> So each time we create a temporary table, we add an unique, incremented,
> thread safe id (AtomicInteger) to its name so that there are only
> specific, non-shared temporary tables used for a test.
>

Makes sense.

But when you say you're sharing the same spark/hive context between tests,
I'm assuming that's between the same tests within one test class, but
you're not sharing across test classes (which a build tool like Maven or
Gradle might have executed in separate JVMs).

Is that right?




>
> --
>   Bedrytski Aliaksandr
>   sp...@bedryt.ski
>
>
>
> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
> <hol...@pigscanfly.ca> wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
> <ko...@tresata.com> wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
> <ever...@nuna.com.invalid> wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-21 Thread Bedrytski Aliaksandr
Hi,

we share the same spark/hive context between tests (executed in
parallel), so the main problem is that the temporary tables are
overwritten each time they are created, this may create race conditions
as these tempTables may be seen as global mutable shared state.

So each time we create a temporary table, we add an unique, incremented,
thread safe id (AtomicInteger) to its name so that there are only
specific, non-shared temporary tables used for a test.

--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



> On Sat, Aug 20, 2016, at 01:25, Everett Anderson wrote:
> Hi!
>
> Just following up on this --
>
> When people talk about a shared session/context for testing like this,
> I assume it's still within one test class. So it's still the case that
> if you have a lot of test classes that test Spark-related things, you
> must configure your build system to not run in them in parallel.
> You'll get the benefit of not creating and tearing down a Spark
> session/context between test cases with a test class, though.
>
> Is that right?
>
> Or have people figured out a way to have sbt (or Maven/Gradle/etc)
> share Spark sessions/contexts across integration tests in a safe way?
>
>
> On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau
> <hol...@pigscanfly.ca> wrote:
> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the
> time ( https://github.com/holdenk/spark-testing-base/issues/123 ).
> I'll try and include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers
> <ko...@tresata.com> wrote:
> we share a single single sparksession across tests, and they can run
> in parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
> <ever...@nuna.com.invalid> wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup
> that brings up a local master as in this article[1].
>
> That's a lot of overhead for unit testing and the tests can't run
> in parallel, so testing is slow -- this is more like what I'd call
> an integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks
> on fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit
> testing possible with Spark SQL, perhaps with a stripped down in-
> memory implementation? (I admit this does seem quite hard since
> there's so much functionality in these classes!)
>
> Thanks!
>
>
> - Everett
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-19 Thread Everett Anderson
Hi!

Just following up on this --

When people talk about a shared session/context for testing like this, I
assume it's still within one test class. So it's still the case that if you
have a lot of test classes that test Spark-related things, you must
configure your build system to not run in them in parallel. You'll get the
benefit of not creating and tearing down a Spark session/context between
test cases with a test class, though.

Is that right?

Or have people figured out a way to have sbt (or Maven/Gradle/etc) share
Spark sessions/contexts across integration tests in a safe way?


On Mon, Aug 1, 2016 at 3:23 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Thats a good point - there is an open issue for spark-testing-base to
> support this shared sparksession approach - but I haven't had the time (
> https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and
> include this in the next release :)
>
> On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> we share a single single sparksession across tests, and they can run in
>> parallel. is pretty fast
>>
>> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <
>> ever...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> Right now, if any code uses DataFrame/Dataset, I need a test setup that
>>> brings up a local master as in this article
>>> <http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/>
>>> .
>>>
>>> That's a lot of overhead for unit testing and the tests can't run in
>>> parallel, so testing is slow -- this is more like what I'd call an
>>> integration test.
>>>
>>> Do people have any tricks to get around this? Maybe using spy mocks on
>>> fake DataFrame/Datasets?
>>>
>>> Anyone know if there are plans to make more traditional unit testing
>>> possible with Spark SQL, perhaps with a stripped down in-memory
>>> implementation? (I admit this does seem quite hard since there's so much
>>> functionality in these classes!)
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Holden Karau
Thats a good point - there is an open issue for spark-testing-base to
support this shared sparksession approach - but I haven't had the time (
https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and
include this in the next release :)

On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers <ko...@tresata.com> wrote:

> we share a single single sparksession across tests, and they can run in
> parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi,
>>
>> Right now, if any code uses DataFrame/Dataset, I need a test setup that
>> brings up a local master as in this article
>> <http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/>
>> .
>>
>> That's a lot of overhead for unit testing and the tests can't run in
>> parallel, so testing is slow -- this is more like what I'd call an
>> integration test.
>>
>> Do people have any tricks to get around this? Maybe using spy mocks on
>> fake DataFrame/Datasets?
>>
>> Anyone know if there are plans to make more traditional unit testing
>> possible with Spark SQL, perhaps with a stripped down in-memory
>> implementation? (I admit this does seem quite hard since there's so much
>> functionality in these classes!)
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Koert Kuipers
we share a single single sparksession across tests, and they can run in
parallel. is pretty fast

On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <ever...@nuna.com.invalid>
wrote:

> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup that
> brings up a local master as in this article
> <http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/>
> .
>
> That's a lot of overhead for unit testing and the tests can't run in
> parallel, so testing is slow -- this is more like what I'd call an
> integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks on
> fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit testing
> possible with Spark SQL, perhaps with a stripped down in-memory
> implementation? (I admit this does seem quite hard since there's so much
> functionality in these classes!)
>
> Thanks!
>
> - Everett
>
>


Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Everett Anderson
Hi,

Right now, if any code uses DataFrame/Dataset, I need a test setup that
brings up a local master as in this article
<http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/>
.

That's a lot of overhead for unit testing and the tests can't run in
parallel, so testing is slow -- this is more like what I'd call an
integration test.

Do people have any tricks to get around this? Maybe using spy mocks on fake
DataFrame/Datasets?

Anyone know if there are plans to make more traditional unit testing
possible with Spark SQL, perhaps with a stripped down in-memory
implementation? (I admit this does seem quite hard since there's so much
functionality in these classes!)

Thanks!

- Everett


Re: Unit testing framework for Spark Jobs?

2016-05-21 Thread Lars Albertsson
 and Cassandra on your local machine, and connect your
>> >> application to them. Then feed input to a Kafka topic, and wait for
>> >> the result to appear in Cassandra.
>> >>
>> >> With this setup, your application still runs in Scalatest, the tests
>> >> run without custom setup in maven/sbt/gradle, and you can easily run
>> >> and debug inside IntelliJ.
>> >>
>> >> Docker is suitable for spinning up external components. If you use
>> >> Kafka, the Docker image spotify/kafka is useful, since it bundles
>> >> Zookeeper.
>> >>
>> >> When waiting for output to appear, don't sleep for a long time and
>> >> then check, since it will slow down your tests. Instead enter a loop
>> >> where you poll for the results and sleep for a few milliseconds in
>> >> between, with a long timeout (~30s) before the test fails with a
>> >> timeout.
>> >
>> > org.scalatest.concurrent.Eventually is your friend there
>> >
>> > eventually(stdTimeout, stdInterval) {
>> > listRestAPIApplications(connector, webUI, true) should
>> contain(expectedAppId)
>> > }
>> >
>> > It has good exponential backoff, for fast initial success without using
>> too much CPU later, and is simple to use
>> >
>> > If it has weaknesses in my tests, they are
>> >
>> > 1. it will retry on all exceptions, rather than assertions. If there's
>> a bug in the test code then it manifests as a timeout. ( I think I could
>> play with Suite.anExceptionThatShouldCauseAnAbort()) here.
>> > 2. it's timeout action is simply to rethrow the fault; I like to exec a
>> closure to grab more diagnostics
>> > 3. It doesn't support some fail-fast exception which your code can
>> raise to indicate that the desired state is never going to be reached, and
>> so the test should fail fast. Here a new exception and another entry in
>> anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down
>> and play with that some more.
>> >
>> >
>> >>
>> >> This poll and sleep strategy both makes tests quick in successful
>> >> cases, but still robust to occasional delays. The strategy does not
>> >> work if you want to test for absence, e.g. ensure that a particular
>> >> message if filtered. You can work around it by adding another message
>> >> afterwards and polling for its effect before testing for absence of
>> >> the first. Be aware that messages can be processed out of order in
>> >> Spark Streaming depending on partitioning, however.
>> >>
>> >>
>> >> I have tested Spark applications with both strategies described above,
>> >> and it is straightforward to set up. Let me know if you want
>> >> clarifications or assistance.
>> >>
>> >> Regards,
>> >>
>> >>
>> >>
>> >> Lars Albertsson
>> >> Data engineering consultant
>> >> www.mapflat.com
>> >> +46 70 7687109
>> >>
>> >>
>> >> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
>> >>> Hi,
>> >>>
>> >>> What is a good unit testing framework for Spark batch/streaming jobs?
>> I have
>> >>> core spark, spark sql with dataframes and streaming api getting used.
>> Any
>> >>> good framework to cover unit tests for these APIs?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >>
>> >
>>
>
>


Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread Todd Nist
Perhaps these may be of some use:

https://github.com/mkuthan/example-spark
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
https://github.com/holdenk/spark-testing-base

On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:

> Hi Lars,
>
> Do you have any examples for the methods that you described for Spark
> batch and Streaming?
>
> Thanks!
>
> On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson <la...@mapflat.com>
> wrote:
>
>> Thanks!
>>
>> It is on my backlog to write a couple of blog posts on the topic, and
>> eventually some example code, but I am currently busy with clients.
>>
>> Thanks for the pointer to Eventually - I was unaware. Fast exit on
>> exception would be a useful addition, indeed.
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>> On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>> > this is a good summary -Have you thought of publishing it at the end of
>> a URL for others to refer to
>> >
>> >> On 18 Mar 2016, at 07:05, Lars Albertsson <la...@mapflat.com> wrote:
>> >>
>> >> I would recommend against writing unit tests for Spark programs, and
>> >> instead focus on integration tests of jobs or pipelines of several
>> >> jobs. You can still use a unit test framework to execute them. Perhaps
>> >> this is what you meant.
>> >>
>> >> You can use any of the popular unit test frameworks to drive your
>> >> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
>> >> gives you choice of TDD vs BDD, and it is also well integrated with
>> >> IntelliJ.
>> >>
>> >> I would also recommend against using testing frameworks tied to a
>> >> processing technology, such as Spark Testing Base. Although it does
>> >> seem well crafted, and makes it easy to get started with testing,
>> >> there are drawbacks:
>> >>
>> >> 1. I/O routines are not tested. Bundled test frameworks typically do
>> >> not materialise datasets on storage, but pass them directly in memory.
>> >> (I have not verified this for Spark Testing Base, but it looks so.)
>> >> I/O routines are therefore not exercised, and they often hide bugs,
>> >> e.g. related to serialisation.
>> >>
>> >> 2. You create a strong coupling between processing technology and your
>> >> tests. If you decide to change processing technology (which can happen
>> >> soon in this fast paced world...), you need to rewrite your tests.
>> >> Therefore, during a migration process, the tests cannot detect bugs
>> >> introduced in migration, and help you migrate fast.
>> >>
>> >> I recommend that you instead materialise input datasets on local disk,
>> >> run your Spark job, which writes output datasets to local disk, read
>> >> output from disk, and verify the results. You can still use Spark
>> >> routines to read and write input and output datasets. A Spark context
>> >> is expensive to create, so for speed, I would recommend reusing the
>> >> Spark context between input generation, running the job, and reading
>> >> output.
>> >>
>> >> This is easy to set up, so you don't need a dedicated framework for
>> >> it. Just put your common boilerplate in a shared test trait or base
>> >> class.
>> >>
>> >> In the future, when you want to replace your Spark job with something
>> >> shinier, you can still use the old tests, and only replace the part
>> >> that runs your job, giving you some protection from regression bugs.
>> >>
>> >>
>> >> Testing Spark Streaming applications is a different beast, and you can
>> >> probably not reuse much from your batch testing.
>> >>
>> >> For testing streaming applications, I recommend that you run your
>> >> application inside a unit test framework, e.g, Scalatest, and have the
>> >> test setup create a fixture that includes your input and output
>> >> components. For example, if your streaming application consumes from
>> >> Kafka and updates tables in Cassandra, spin up single node instances
>> >> of Kafka and Cassandra on your local machine, and connect your
>> >> application to them. Then feed input to a Kafka topic, and wait for
>> >> the result to appear in Cas

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread swetha kasireddy
> >> where you poll for the results and sleep for a few milliseconds in
> >> between, with a long timeout (~30s) before the test fails with a
> >> timeout.
> >
> > org.scalatest.concurrent.Eventually is your friend there
> >
> > eventually(stdTimeout, stdInterval) {
> > listRestAPIApplications(connector, webUI, true) should
> contain(expectedAppId)
> > }
> >
> > It has good exponential backoff, for fast initial success without using
> too much CPU later, and is simple to use
> >
> > If it has weaknesses in my tests, they are
> >
> > 1. it will retry on all exceptions, rather than assertions. If there's a
> bug in the test code then it manifests as a timeout. ( I think I could play
> with Suite.anExceptionThatShouldCauseAnAbort()) here.
> > 2. it's timeout action is simply to rethrow the fault; I like to exec a
> closure to grab more diagnostics
> > 3. It doesn't support some fail-fast exception which your code can raise
> to indicate that the desired state is never going to be reached, and so the
> test should fail fast. Here a new exception and another entry in
> anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down
> and play with that some more.
> >
> >
> >>
> >> This poll and sleep strategy both makes tests quick in successful
> >> cases, but still robust to occasional delays. The strategy does not
> >> work if you want to test for absence, e.g. ensure that a particular
> >> message if filtered. You can work around it by adding another message
> >> afterwards and polling for its effect before testing for absence of
> >> the first. Be aware that messages can be processed out of order in
> >> Spark Streaming depending on partitioning, however.
> >>
> >>
> >> I have tested Spark applications with both strategies described above,
> >> and it is straightforward to set up. Let me know if you want
> >> clarifications or assistance.
> >>
> >> Regards,
> >>
> >>
> >>
> >> Lars Albertsson
> >> Data engineering consultant
> >> www.mapflat.com
> >> +46 70 7687109
> >>
> >>
> >> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>> What is a good unit testing framework for Spark batch/streaming jobs?
> I have
> >>> core spark, spark sql with dataframes and streaming api getting used.
> Any
> >>> good framework to cover unit tests for these APIs?
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >>
> >
>


Re: Scala: Perform Unit Testing in spark

2016-04-06 Thread Shishir Anshuman
I placed the *tests* jars in the *lib* folder, Now its working.

On Wed, Apr 6, 2016 at 7:34 PM, Lars Albertsson <la...@mapflat.com> wrote:

> Hi,
>
> I wrote a longish mail on Spark testing strategy last month, which you
> may find useful:
> http://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser
>
> Let me know if you have follow up questions or want assistance.
>
> Regards,
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
>
>
> On Fri, Apr 1, 2016 at 10:31 PM, Shishir Anshuman
> <shishiranshu...@gmail.com> wrote:
> > Hello,
> >
> > I have a code written in scala using Mllib. I want to perform unit
> testing
> > it. I cant decide between Junit 4 and ScalaTest.
> > I am new to Spark. Please guide me how to proceed with the testing.
> >
> > Thank you.
>


Re: Scala: Perform Unit Testing in spark

2016-04-06 Thread Lars Albertsson
Hi,

I wrote a longish mail on Spark testing strategy last month, which you
may find useful:
http://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/browser

Let me know if you have follow up questions or want assistance.

Regards,


Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109


On Fri, Apr 1, 2016 at 10:31 PM, Shishir Anshuman
<shishiranshu...@gmail.com> wrote:
> Hello,
>
> I have a code written in scala using Mllib. I want to perform unit testing
> it. I cant decide between Junit 4 and ScalaTest.
> I am new to Spark. Please guide me how to proceed with the testing.
>
> Thank you.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Scala: Perform Unit Testing in spark

2016-04-02 Thread Ted Yu
I think you should specify dependencies in this way:

*"org.apache.spark" % "spark-core_2.10" % "1.6.0"* % "tests"

Please refer to http://www.scalatest.org/user_guide/using_scalatest_with_sbt

On Fri, Apr 1, 2016 at 3:33 PM, Shishir Anshuman <shishiranshu...@gmail.com>
wrote:

> When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0",  *it
> should include spark-core_2.10-1.6.1-tests.jar.
> Why do I need to use the jar file explicitly?
>
> And how do I use the jars for compiling with *sbt* and running the tests
> on spark?
>
>
> On Sat, Apr 2, 2016 at 3:46 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> You need to include the following jars:
>>
>> jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite
>>   1787 Thu Mar 03 09:06:14 PST 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>   1780 Thu Mar 03 09:06:14 PST 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>   3982 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite.class
>>
>> jar tvf ./mllib/target/spark-mllib_2.10-1.6.1-tests.jar | grep
>> MLlibTestSparkContext
>>   1447 Thu Mar 03 09:53:54 PST 2016
>> org/apache/spark/mllib/util/MLlibTestSparkContext.class
>>   1704 Thu Mar 03 09:53:54 PST 2016
>> org/apache/spark/mllib/util/MLlibTestSparkContext$class.class
>>
>> On Fri, Apr 1, 2016 at 3:07 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> I got the file ALSSuite.scala and trying to run it. I have copied the
>>> file under *src/test/scala *in my project folder. When I run *sbt test*,
>>> I get errors. I have attached the screenshot of the errors. Befor *sbt
>>> test*, I am building the package with *sbt package*.
>>>
>>> Dependencies of *simple.sbt*:
>>>
>>>>
>>>>
>>>>
>>>>
>>>> *libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" %
>>>> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )*
>>>
>>>
>>>
>>>
>>> On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Assuming your code is written in Scala, I would suggest using
>>>> ScalaTest.
>>>>
>>>> Please take a look at the XXSuite.scala files under mllib/
>>>>
>>>> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
>>>> shishiranshu...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a code written in scala using Mllib. I want to perform unit
>>>>> testing it. I cant decide between Junit 4 and ScalaTest.
>>>>> I am new to Spark. Please guide me how to proceed with the testing.
>>>>>
>>>>> Thank you.
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Shishir Anshuman
When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0",  *it should
include spark-core_2.10-1.6.1-tests.jar.
Why do I need to use the jar file explicitly?

And how do I use the jars for compiling with *sbt* and running the tests on
spark?


On Sat, Apr 2, 2016 at 3:46 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> You need to include the following jars:
>
> jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite
>   1787 Thu Mar 03 09:06:14 PST 2016
> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>   1780 Thu Mar 03 09:06:14 PST 2016
> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>   3982 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite.class
>
> jar tvf ./mllib/target/spark-mllib_2.10-1.6.1-tests.jar | grep
> MLlibTestSparkContext
>   1447 Thu Mar 03 09:53:54 PST 2016
> org/apache/spark/mllib/util/MLlibTestSparkContext.class
>   1704 Thu Mar 03 09:53:54 PST 2016
> org/apache/spark/mllib/util/MLlibTestSparkContext$class.class
>
> On Fri, Apr 1, 2016 at 3:07 PM, Shishir Anshuman <
> shishiranshu...@gmail.com> wrote:
>
>> I got the file ALSSuite.scala and trying to run it. I have copied the
>> file under *src/test/scala *in my project folder. When I run *sbt test*,
>> I get errors. I have attached the screenshot of the errors. Befor *sbt
>> test*, I am building the package with *sbt package*.
>>
>> Dependencies of *simple.sbt*:
>>
>>>
>>>
>>>
>>>
>>> *libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" %
>>> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )*
>>
>>
>>
>>
>> On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Assuming your code is written in Scala, I would suggest using ScalaTest.
>>>
>>> Please take a look at the XXSuite.scala files under mllib/
>>>
>>> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
>>> shishiranshu...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have a code written in scala using Mllib. I want to perform unit
>>>> testing it. I cant decide between Junit 4 and ScalaTest.
>>>> I am new to Spark. Please guide me how to proceed with the testing.
>>>>
>>>> Thank you.
>>>>
>>>
>>>
>>
>


Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
You need to include the following jars:

jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite
  1787 Thu Mar 03 09:06:14 PST 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
  1780 Thu Mar 03 09:06:14 PST 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
  3982 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite.class

jar tvf ./mllib/target/spark-mllib_2.10-1.6.1-tests.jar | grep
MLlibTestSparkContext
  1447 Thu Mar 03 09:53:54 PST 2016
org/apache/spark/mllib/util/MLlibTestSparkContext.class
  1704 Thu Mar 03 09:53:54 PST 2016
org/apache/spark/mllib/util/MLlibTestSparkContext$class.class

On Fri, Apr 1, 2016 at 3:07 PM, Shishir Anshuman <shishiranshu...@gmail.com>
wrote:

> I got the file ALSSuite.scala and trying to run it. I have copied the file
> under *src/test/scala *in my project folder. When I run *sbt test*, I get
> errors. I have attached the screenshot of the errors. Befor *sbt test*, I
> am building the package with *sbt package*.
>
> Dependencies of *simple.sbt*:
>
>>
>>
>>
>>
>> *libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" %
>> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )*
>
>
>
>
> On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Assuming your code is written in Scala, I would suggest using ScalaTest.
>>
>> Please take a look at the XXSuite.scala files under mllib/
>>
>> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a code written in scala using Mllib. I want to perform unit
>>> testing it. I cant decide between Junit 4 and ScalaTest.
>>> I am new to Spark. Please guide me how to proceed with the testing.
>>>
>>> Thank you.
>>>
>>
>>
>


Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Holden Karau
You can also look at spark-testing-base which works in both Scalatest and
Junit and see if that works for your use case.

On Friday, April 1, 2016, Ted Yu <yuzhih...@gmail.com> wrote:

> Assuming your code is written in Scala, I would suggest using ScalaTest.
>
> Please take a look at the XXSuite.scala files under mllib/
>
> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
> shishiranshu...@gmail.com
> <javascript:_e(%7B%7D,'cvml','shishiranshu...@gmail.com');>> wrote:
>
>> Hello,
>>
>> I have a code written in scala using Mllib. I want to perform unit
>> testing it. I cant decide between Junit 4 and ScalaTest.
>> I am new to Spark. Please guide me how to proceed with the testing.
>>
>> Thank you.
>>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
Assuming your code is written in Scala, I would suggest using ScalaTest.

Please take a look at the XXSuite.scala files under mllib/

On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <shishiranshu...@gmail.com>
wrote:

> Hello,
>
> I have a code written in scala using Mllib. I want to perform unit testing
> it. I cant decide between Junit 4 and ScalaTest.
> I am new to Spark. Please guide me how to proceed with the testing.
>
> Thank you.
>


Scala: Perform Unit Testing in spark

2016-04-01 Thread Shishir Anshuman
Hello,

I have a code written in scala using Mllib. I want to perform unit testing
it. I cant decide between Junit 4 and ScalaTest.
I am new to Spark. Please guide me how to proceed with the testing.

Thank you.


Re: Unit testing framework for Spark Jobs?

2016-03-30 Thread Lars Albertsson
 than assertions. If there's a
bug in the test code then it manifests as a timeout. ( I think I could play
with Suite.anExceptionThatShouldCauseAnAbort()) here.
> 2. it's timeout action is simply to rethrow the fault; I like to exec a
closure to grab more diagnostics
> 3. It doesn't support some fail-fast exception which your code can raise
to indicate that the desired state is never going to be reached, and so the
test should fail fast. Here a new exception and another entry in
anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down
and play with that some more.
>
>
>>
>> This poll and sleep strategy both makes tests quick in successful
>> cases, but still robust to occasional delays. The strategy does not
>> work if you want to test for absence, e.g. ensure that a particular
>> message if filtered. You can work around it by adding another message
>> afterwards and polling for its effect before testing for absence of
>> the first. Be aware that messages can be processed out of order in
>> Spark Streaming depending on partitioning, however.
>>
>>
>> I have tested Spark applications with both strategies described above,
>> and it is straightforward to set up. Let me know if you want
>> clarifications or assistance.
>>
>> Regards,
>>
>>
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>>
>> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
>>> Hi,
>>>
>>> What is a good unit testing framework for Spark batch/streaming jobs? I
have
>>> core spark, spark sql with dataframes and streaming api getting used.
Any
>>> good framework to cover unit tests for these APIs?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Unit testing framework for Spark Jobs?

2016-03-28 Thread Steve Loughran
o test for absence, e.g. ensure that a particular
> message if filtered. You can work around it by adding another message
> afterwards and polling for its effect before testing for absence of
> the first. Be aware that messages can be processed out of order in
> Spark Streaming depending on partitioning, however.
> 
> 
> I have tested Spark applications with both strategies described above,
> and it is straightforward to set up. Let me know if you want
> clarifications or assistance.
> 
> Regards,
> 
> 
> 
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
> 
> 
> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
>> Hi,
>> 
>> What is a good unit testing framework for Spark batch/streaming jobs? I have
>> core spark, spark sql with dataframes and streaming api getting used. Any
>> good framework to cover unit tests for these APIs?
>> 
>> Thanks!
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit testing framework for Spark Jobs?

2016-03-24 Thread Shiva Ramagopal
Hi Lars,

Very pragmatic ideas around testing of Spark applications end-to-end!

-Shiva

On Fri, Mar 18, 2016 at 12:35 PM, Lars Albertsson <la...@mapflat.com> wrote:

> I would recommend against writing unit tests for Spark programs, and
> instead focus on integration tests of jobs or pipelines of several
> jobs. You can still use a unit test framework to execute them. Perhaps
> this is what you meant.
>
> You can use any of the popular unit test frameworks to drive your
> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
> gives you choice of TDD vs BDD, and it is also well integrated with
> IntelliJ.
>
> I would also recommend against using testing frameworks tied to a
> processing technology, such as Spark Testing Base. Although it does
> seem well crafted, and makes it easy to get started with testing,
> there are drawbacks:
>
> 1. I/O routines are not tested. Bundled test frameworks typically do
> not materialise datasets on storage, but pass them directly in memory.
> (I have not verified this for Spark Testing Base, but it looks so.)
> I/O routines are therefore not exercised, and they often hide bugs,
> e.g. related to serialisation.
>
> 2. You create a strong coupling between processing technology and your
> tests. If you decide to change processing technology (which can happen
> soon in this fast paced world...), you need to rewrite your tests.
> Therefore, during a migration process, the tests cannot detect bugs
> introduced in migration, and help you migrate fast.
>
> I recommend that you instead materialise input datasets on local disk,
> run your Spark job, which writes output datasets to local disk, read
> output from disk, and verify the results. You can still use Spark
> routines to read and write input and output datasets. A Spark context
> is expensive to create, so for speed, I would recommend reusing the
> Spark context between input generation, running the job, and reading
> output.
>
> This is easy to set up, so you don't need a dedicated framework for
> it. Just put your common boilerplate in a shared test trait or base
> class.
>
> In the future, when you want to replace your Spark job with something
> shinier, you can still use the old tests, and only replace the part
> that runs your job, giving you some protection from regression bugs.
>
>
> Testing Spark Streaming applications is a different beast, and you can
> probably not reuse much from your batch testing.
>
> For testing streaming applications, I recommend that you run your
> application inside a unit test framework, e.g, Scalatest, and have the
> test setup create a fixture that includes your input and output
> components. For example, if your streaming application consumes from
> Kafka and updates tables in Cassandra, spin up single node instances
> of Kafka and Cassandra on your local machine, and connect your
> application to them. Then feed input to a Kafka topic, and wait for
> the result to appear in Cassandra.
>
> With this setup, your application still runs in Scalatest, the tests
> run without custom setup in maven/sbt/gradle, and you can easily run
> and debug inside IntelliJ.
>
> Docker is suitable for spinning up external components. If you use
> Kafka, the Docker image spotify/kafka is useful, since it bundles
> Zookeeper.
>
> When waiting for output to appear, don't sleep for a long time and
> then check, since it will slow down your tests. Instead enter a loop
> where you poll for the results and sleep for a few milliseconds in
> between, with a long timeout (~30s) before the test fails with a
> timeout.
>
> This poll and sleep strategy both makes tests quick in successful
> cases, but still robust to occasional delays. The strategy does not
> work if you want to test for absence, e.g. ensure that a particular
> message if filtered. You can work around it by adding another message
> afterwards and polling for its effect before testing for absence of
> the first. Be aware that messages can be processed out of order in
> Spark Streaming depending on partitioning, however.
>
>
> I have tested Spark applications with both strategies described above,
> and it is straightforward to set up. Let me know if you want
> clarifications or assistance.
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
>
>
> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
> > Hi,
> >
> > What is a good unit testing framework for Spark batch/streaming jobs? I
> have
> > core spark, spark sql with dataframes and streaming api getting used. Any
> > good framework to cover unit tests for these APIs?
> >
> > Thanks!

Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Vikas Kawadia
I just wrote a blog post on Unit testing Apache Spark with py.test
https://engblog.nextdoor.com/unit-testing-apache-spark-with-py-test-3b8970dc013b

If you prefer using the py.test framework, then it might be useful.

-vikas

On Wed, Mar 2, 2016 at 10:59 AM, radoburansky <radoburan...@gmail.com>
wrote:

> I am sure you have googled this:
> https://github.com/holdenk/spark-testing-base
>
> On Wed, Mar 2, 2016 at 6:54 PM, SRK [via Apache Spark User List] <[hidden
> email] <http:///user/SendEmail.jtp?type=node=26384=0>> wrote:
>
>> Hi,
>>
>> What is a good unit testing framework for Spark batch/streaming jobs? I
>> have core spark, spark sql with dataframes and streaming api getting used.
>> Any good framework to cover unit tests for these APIs?
>>
>> Thanks!
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>> To start a new topic under Apache Spark User List, email [hidden email]
>> <http:///user/SendEmail.jtp?type=node=26384=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> --
> View this message in context: Re: Unit testing framework for Spark Jobs?
> <http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380p26384.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


Re: Unit testing framework for Spark Jobs?

2016-03-19 Thread Lars Albertsson
I would recommend against writing unit tests for Spark programs, and
instead focus on integration tests of jobs or pipelines of several
jobs. You can still use a unit test framework to execute them. Perhaps
this is what you meant.

You can use any of the popular unit test frameworks to drive your
tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
gives you choice of TDD vs BDD, and it is also well integrated with
IntelliJ.

I would also recommend against using testing frameworks tied to a
processing technology, such as Spark Testing Base. Although it does
seem well crafted, and makes it easy to get started with testing,
there are drawbacks:

1. I/O routines are not tested. Bundled test frameworks typically do
not materialise datasets on storage, but pass them directly in memory.
(I have not verified this for Spark Testing Base, but it looks so.)
I/O routines are therefore not exercised, and they often hide bugs,
e.g. related to serialisation.

2. You create a strong coupling between processing technology and your
tests. If you decide to change processing technology (which can happen
soon in this fast paced world...), you need to rewrite your tests.
Therefore, during a migration process, the tests cannot detect bugs
introduced in migration, and help you migrate fast.

I recommend that you instead materialise input datasets on local disk,
run your Spark job, which writes output datasets to local disk, read
output from disk, and verify the results. You can still use Spark
routines to read and write input and output datasets. A Spark context
is expensive to create, so for speed, I would recommend reusing the
Spark context between input generation, running the job, and reading
output.

This is easy to set up, so you don't need a dedicated framework for
it. Just put your common boilerplate in a shared test trait or base
class.

In the future, when you want to replace your Spark job with something
shinier, you can still use the old tests, and only replace the part
that runs your job, giving you some protection from regression bugs.


Testing Spark Streaming applications is a different beast, and you can
probably not reuse much from your batch testing.

For testing streaming applications, I recommend that you run your
application inside a unit test framework, e.g, Scalatest, and have the
test setup create a fixture that includes your input and output
components. For example, if your streaming application consumes from
Kafka and updates tables in Cassandra, spin up single node instances
of Kafka and Cassandra on your local machine, and connect your
application to them. Then feed input to a Kafka topic, and wait for
the result to appear in Cassandra.

With this setup, your application still runs in Scalatest, the tests
run without custom setup in maven/sbt/gradle, and you can easily run
and debug inside IntelliJ.

Docker is suitable for spinning up external components. If you use
Kafka, the Docker image spotify/kafka is useful, since it bundles
Zookeeper.

When waiting for output to appear, don't sleep for a long time and
then check, since it will slow down your tests. Instead enter a loop
where you poll for the results and sleep for a few milliseconds in
between, with a long timeout (~30s) before the test fails with a
timeout.

This poll and sleep strategy both makes tests quick in successful
cases, but still robust to occasional delays. The strategy does not
work if you want to test for absence, e.g. ensure that a particular
message if filtered. You can work around it by adding another message
afterwards and polling for its effect before testing for absence of
the first. Be aware that messages can be processed out of order in
Spark Streaming depending on partitioning, however.


I have tested Spark applications with both strategies described above,
and it is straightforward to set up. Let me know if you want
clarifications or assistance.

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109


On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I have
> core spark, spark sql with dataframes and streaming api getting used. Any
> good framework to cover unit tests for these APIs?
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread radoburansky
I am sure you have googled this:
https://github.com/holdenk/spark-testing-base

On Wed, Mar 2, 2016 at 6:54 PM, SRK [via Apache Spark User List] <
ml-node+s1001560n2638...@n3.nabble.com> wrote:

> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I
> have core spark, spark sql with dataframes and streaming api getting used.
> Any good framework to cover unit tests for these APIs?
>
> Thanks!
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmFkb2J1cmFuc2t5QGdtYWlsLmNvbXwxfC03MDA2NjE5MjQ=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380p26384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Ricardo Paiva
I use the plain and old Junit

Spark batch example:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.junit.AfterClass
import org.junit.Assert.assertEquals
import org.junit.BeforeClass
import org.junit.Test

object TestMyCode {

  var sc: SparkContext = null

  @BeforeClass
  def setup(): Unit = {
val sparkConf = new SparkConf()
  .setAppName("Test Spark")
  .setMaster("local[*]")
sc = new SparkContext(sparkConf)
  }

  @AfterClass
  def cleanup(): Unit = {
sc.stop()
  }
}

class TestMyCode {

  @Test
  def testSaveNumbersToExtractor(): Unit = {
val sql = new SQLContext(TestDataframeToTableau.sc)
import sql.implicits._

val numList = List(1, 2, 3, 4, 5)
val df = TestDataframeToTableau.sc.parallelize(numList).toDF
val numDf = df.select(df("_1").alias("num"))
assertEquals(5, numDf.count)
  }

}

On Wed, Mar 2, 2016 at 2:54 PM, SRK [via Apache Spark User List] <
ml-node+s1001560n26380...@n3.nabble.com> wrote:

> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I
> have core spark, spark sql with dataframes and streaming api getting used.
> Any good framework to cover unit tests for these APIs?
>
> Thanks!
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code=1=cmljYXJkby5wYWl2YUBjb3JwLmdsb2JvLmNvbXwxfDQ1MDcxMTc2Mw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer=instant_html%21nabble%3Aemail.naml=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Ricardo Paiva
Big Data / Semântica
2483-6432
*globo.com* <http://www.globo.com>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380p26383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Silvio Fiorito
Please check out the following for some good resources:

https://github.com/holdenk/spark-testing-base


https://spark-summit.org/east-2016/events/beyond-collect-and-parallelize-for-tests/





On 3/2/16, 12:54 PM, "SRK" <swethakasire...@gmail.com> wrote:

>Hi,
>
>What is a good unit testing framework for Spark batch/streaming jobs? I have
>core spark, spark sql with dataframes and streaming api getting used. Any
>good framework to cover unit tests for these APIs?
>
>Thanks!
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Yin Yang
Cycling prior bits:

http://search-hadoop.com/m/q3RTto4sby1Cd2rt=Re+Unit+test+with+sqlContext

On Wed, Mar 2, 2016 at 9:54 AM, SRK <swethakasire...@gmail.com> wrote:

> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I
> have
> core spark, spark sql with dataframes and streaming api getting used. Any
> good framework to cover unit tests for these APIs?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Unit testing framework for Spark Jobs?

2016-03-02 Thread SRK
Hi,

What is a good unit testing framework for Spark batch/streaming jobs? I have
core spark, spark sql with dataframes and streaming api getting used. Any
good framework to cover unit tests for these APIs?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Already tried setting spark.driver.allowMultipleContexts to true. But it
not successful. I the problem is we have different test suites which of
course run in parallel. How do we stop sparkContext after each test suite
and start it in the next test suite or is there any way to share
sparkContext across all test suites ???

On Thu, Nov 5, 2015 at 12:36 AM, Bryan Jeffrey 
wrote:

> Priya,
>
> If you're trying to get unit tests running local spark contexts, you can
> just set up your spark context with 'spark.driver.allowMultipleContexts'
> set to true.
>
> Example:
>
> def create(seconds : Int, appName : String): StreamingContext = {
>   val master = "local[*]"
>   val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
> "true").setAppName(appName).setMaster(master)
>   new StreamingContext(conf, Seconds(seconds))
> }
>
> Regards,
>
> Bryan Jeffrey
>
>
> On Wed, Nov 4, 2015 at 9:49 AM, Ted Yu  wrote:
>
>> Are you trying to speed up tests where each test suite uses single 
>> SparkContext
>> ?
>>
>> You may want to read:
>> https://issues.apache.org/jira/browse/SPARK-2243
>>
>> Cheers
>>
>> On Wed, Nov 4, 2015 at 4:59 AM, Priya Ch 
>> wrote:
>>
>>> Hello All,
>>>
>>>   How to use multiple Spark Context in executing multiple test suite of
>>> spark code ???
>>> Can some one throw light on this ?
>>>
>>
>>
>


Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Bryan Jeffrey
Priya,

If you're trying to get unit tests running local spark contexts, you can
just set up your spark context with 'spark.driver.allowMultipleContexts'
set to true.

Example:

def create(seconds : Int, appName : String): StreamingContext = {
  val master = "local[*]"
  val conf = new SparkConf().set("spark.driver.allowMultipleContexts",
"true").setAppName(appName).setMaster(master)
  new StreamingContext(conf, Seconds(seconds))
}

Regards,

Bryan Jeffrey


On Wed, Nov 4, 2015 at 9:49 AM, Ted Yu  wrote:

> Are you trying to speed up tests where each test suite uses single 
> SparkContext
> ?
>
> You may want to read:
> https://issues.apache.org/jira/browse/SPARK-2243
>
> Cheers
>
> On Wed, Nov 4, 2015 at 4:59 AM, Priya Ch 
> wrote:
>
>> Hello All,
>>
>>   How to use multiple Spark Context in executing multiple test suite of
>> spark code ???
>> Can some one throw light on this ?
>>
>
>


Re: Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Ted Yu
Are you trying to speed up tests where each test suite uses single SparkContext
?

You may want to read:
https://issues.apache.org/jira/browse/SPARK-2243

Cheers

On Wed, Nov 4, 2015 at 4:59 AM, Priya Ch 
wrote:

> Hello All,
>
>   How to use multiple Spark Context in executing multiple test suite of
> spark code ???
> Can some one throw light on this ?
>


Allow multiple SparkContexts in Unit Testing

2015-11-04 Thread Priya Ch
Hello All,

  How to use multiple Spark Context in executing multiple test suite of
spark code ???
Can some one throw light on this ?


Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
One more question, if i have a function which takes RDD as a parameter, how
do we mock an RDD ??

On Thu, Oct 29, 2015 at 5:20 PM, Priya Ch 
wrote:

> How do we do it for Cassandra..can we use the same Mocking ?
> EmbeddedCassandra Server is available with CassandraUnit. Can this be used
> in Spark Code as well ? I mean with Scala code ?
>
> On Thu, Oct 29, 2015 at 5:03 PM, Василец Дмитрий  > wrote:
>
>> there is example how i mock mysql
>> import org.scalamock.scalatest.MockFactory
>>  val connectionMock = mock[java.sql.Connection]
>>  val statementMock = mock[PreparedStatement]
>> (conMock.prepareStatement(_:
>> String)).expects(sql.toString).returning(statementMock)
>> (statementMock.executeUpdate _).expects()
>>
>>
>> On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch 
>> wrote:
>>
>>> Hi All,
>>>
>>>   For my  Spark Streaming code, which writes the results to Cassandra
>>> DB, I need to write Unit test cases. what are the available test frameworks
>>> to mock the connection to Cassandra DB ?
>>>
>>
>>
>


Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
How do we do it for Cassandra..can we use the same Mocking ?
EmbeddedCassandra Server is available with CassandraUnit. Can this be used
in Spark Code as well ? I mean with Scala code ?

On Thu, Oct 29, 2015 at 5:03 PM, Василец Дмитрий 
wrote:

> there is example how i mock mysql
> import org.scalamock.scalatest.MockFactory
>  val connectionMock = mock[java.sql.Connection]
>  val statementMock = mock[PreparedStatement]
> (conMock.prepareStatement(_:
> String)).expects(sql.toString).returning(statementMock)
> (statementMock.executeUpdate _).expects()
>
>
> On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch 
> wrote:
>
>> Hi All,
>>
>>   For my  Spark Streaming code, which writes the results to Cassandra DB,
>> I need to write Unit test cases. what are the available test frameworks to
>> mock the connection to Cassandra DB ?
>>
>
>


Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
Hi All,

  For my  Spark Streaming code, which writes the results to Cassandra DB, I
need to write Unit test cases. what are the available test frameworks to
mock the connection to Cassandra DB ?


Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Василец Дмитрий
there is example how i mock mysql
import org.scalamock.scalatest.MockFactory
 val connectionMock = mock[java.sql.Connection]
 val statementMock = mock[PreparedStatement]
(conMock.prepareStatement(_:
String)).expects(sql.toString).returning(statementMock)
(statementMock.executeUpdate _).expects()


On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch 
wrote:

> Hi All,
>
>   For my  Spark Streaming code, which writes the results to Cassandra DB,
> I need to write Unit test cases. what are the available test frameworks to
> mock the connection to Cassandra DB ?
>


Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Adrian Tanase
Does it need to be a mock? Can you use sc.parallelize(data)?

From: Priya Ch
Date: Thursday, October 29, 2015 at 2:00 PM
To: Василец Дмитрий
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>", 
"spark-connector-u...@lists.datastax.com<mailto:spark-connector-u...@lists.datastax.com>"
Subject: Re: Mock Cassandra DB Connection in Unit Testing

One more question, if i have a function which takes RDD as a parameter, how do 
we mock an RDD ??

On Thu, Oct 29, 2015 at 5:20 PM, Priya Ch 
<learnings.chitt...@gmail.com<mailto:learnings.chitt...@gmail.com>> wrote:
How do we do it for Cassandra..can we use the same Mocking ? EmbeddedCassandra 
Server is available with CassandraUnit. Can this be used in Spark Code as well 
? I mean with Scala code ?

On Thu, Oct 29, 2015 at 5:03 PM, Василец Дмитрий 
<pronix.serv...@gmail.com<mailto:pronix.serv...@gmail.com>> wrote:
there is example how i mock mysql
import org.scalamock.scalatest.MockFactory
 val connectionMock = mock[java.sql.Connection]
 val statementMock = mock[PreparedStatement]
(conMock.prepareStatement(_: 
String)).expects(sql.toString).returning(statementMock)
(statementMock.executeUpdate _).expects()


On Thu, Oct 29, 2015 at 12:27 PM, Priya Ch 
<learnings.chitt...@gmail.com<mailto:learnings.chitt...@gmail.com>> wrote:
Hi All,

  For my  Spark Streaming code, which writes the results to Cassandra DB, I 
need to write Unit test cases. what are the available test frameworks to mock 
the connection to Cassandra DB ?





Re: What are best practices from Unit Testing Spark Code?

2015-09-26 Thread ehrlichja
Check out the spark-testing-base project.  I haven't tried it yet, looks good
though:

http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-are-best-practices-from-Unit-Testing-Spark-Code-tp24821p24833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit Testing

2015-08-13 Thread jay vyas
yes there certainly is, so long as eclipse has the right plugins and so on
to run scala programs.  You're really asking two questions: (1) Can I use a
modern IDE to develop spark apps and (2) can we easily  unit test spark
streaming apps.

the answer is yes to both...

Regarding your IDE:

I like to use intellij with the set plugins for scala development.  It
allows you to run everything from inside the IDE.  I've written up setup
instructions here:
http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html

Now, regarding local unit testing:

As an example, here is a unit test for confirming that spark can write to
cassandra.

https://github.com/jayunit100/SparkStreamingApps/blob/master/src/test/scala/TestTwitterCassandraETL.scala

The key here is to just set your local master in the unit test, like so

sc.setMaster(local[2])

local[2] gaurantees that you'll have a producer and a consumer, so that you
don't get a starvation scenario.


On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 Is there a way to run spark streaming methods in standalone eclipse
 environment to test out the functionality?




-- 
jay vyas


Re: Unit Testing

2015-08-13 Thread Burak Yavuz
I would recommend this spark package for your unit testing needs (
http://spark-packages.org/package/holdenk/spark-testing-base).

Best,
Burak

On Thu, Aug 13, 2015 at 5:51 AM, jay vyas jayunit100.apa...@gmail.com
wrote:

 yes there certainly is, so long as eclipse has the right plugins and so on
 to run scala programs.  You're really asking two questions: (1) Can I use a
 modern IDE to develop spark apps and (2) can we easily  unit test spark
 streaming apps.

 the answer is yes to both...

 Regarding your IDE:

 I like to use intellij with the set plugins for scala development.  It
 allows you to run everything from inside the IDE.  I've written up setup
 instructions here:
 http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html

 Now, regarding local unit testing:

 As an example, here is a unit test for confirming that spark can write to
 cassandra.


 https://github.com/jayunit100/SparkStreamingApps/blob/master/src/test/scala/TestTwitterCassandraETL.scala

 The key here is to just set your local master in the unit test, like so

 sc.setMaster(local[2])

 local[2] gaurantees that you'll have a producer and a consumer, so that
 you don't get a starvation scenario.


 On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 Is there a way to run spark streaming methods in standalone eclipse
 environment to test out the functionality?




 --
 jay vyas



Unit Testing

2015-08-12 Thread Mohit Anchlia
Is there a way to run spark streaming methods in standalone eclipse
environment to test out the functionality?


Unit Testing Spark Transformations/Actions

2015-06-16 Thread Mark Tse
Hi there,

I am looking to use Mockito to mock out some functionality while unit testing a 
Spark application.

I currently have code that happily runs on a cluster, but fails when I try to 
run unit tests against it, throwing a SparkException:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 
1, localhost): java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1.f$14 of type 
org.apache.spark.api.java.function.VoidFunction in instance of 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreach$1
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2089)

(Full error/stacktrace and description on SO: 
http://stackoverflow.com/q/30871109/2687324).

Has anyone experienced this error before while unit testing?

Thanks,
Mark


Re: Spark Unit Testing

2015-04-21 Thread James King
Hi Emre, thanks for the help will have a look. Cheers!

On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello James,

 Did you check the following resources:

  -
 https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming

  -
 http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

 --
 Emre Sevinç
 http://www.bigindustries.be/


 On Tue, Apr 21, 2015 at 1:26 PM, James King jakwebin...@gmail.com wrote:

 I'm trying to write some unit tests for my spark code.

 I need to pass a JavaPairDStreamString, String to my spark class.

 Is there a way to create a JavaPairDStream using Java API?

 Also is there a good resource that covers an approach (or approaches) for
 unit testing using Java.

 Regards
 jk




 --
 Emre Sevinc



Re: Spark Unit Testing

2015-04-21 Thread Emre Sevinc
Hello James,

Did you check the following resources:

 -
https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming

 -
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

--
Emre Sevinç
http://www.bigindustries.be/


On Tue, Apr 21, 2015 at 1:26 PM, James King jakwebin...@gmail.com wrote:

 I'm trying to write some unit tests for my spark code.

 I need to pass a JavaPairDStreamString, String to my spark class.

 Is there a way to create a JavaPairDStream using Java API?

 Also is there a good resource that covers an approach (or approaches) for
 unit testing using Java.

 Regards
 jk




-- 
Emre Sevinc


Spark Unit Testing

2015-04-21 Thread James King
I'm trying to write some unit tests for my spark code.

I need to pass a JavaPairDStreamString, String to my spark class.

Is there a way to create a JavaPairDStream using Java API?

Also is there a good resource that covers an approach (or approaches) for
unit testing using Java.

Regards
jk


Re: Unit testing with HiveContext

2015-04-09 Thread Daniel Siegmann
Thanks Ted, using HiveTest as my context worked. It still left a metastore
directory and Derby log in my current working directory though; I manually
added a shutdown hook to delete them and all was well.

On Wed, Apr 8, 2015 at 4:33 PM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at
 sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala :

   protected def configure(): Unit = {
 warehousePath.delete()
 metastorePath.delete()
 setConf(javax.jdo.option.ConnectionURL,
   sjdbc:derby:;databaseName=$metastorePath;create=true)
 setConf(hive.metastore.warehouse.dir, warehousePath.toString)
   }

 Cheers

 On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann 
 daniel.siegm...@teamaol.com wrote:

 I am trying to unit test some code which takes an existing HiveContext
 and uses it to execute a CREATE TABLE query (among other things).
 Unfortunately I've run into some hurdles trying to unit test this, and I'm
 wondering if anyone has a good approach.

 The metastore DB is automatically created in the local directory, but it
 doesn't seem to be cleaned up afterward. Is there any way to get Spark to
 clean this up when the context is stopped? Or can I point this to some
 other location, such as a temp directory?

 Trying to create a table fails because it is using the default warehouse
 directory (/user/hive/warehouse). Is there some way to change this without
 hard-coding a directory in a hive-site.xml; again, I'd prefer to point it
 to a temp directory so it will be automatically removed. I tried a couple
 of things that didn't work:

- hiveContext.sql(SET hive.metastore.warehouse.dir=/tmp/dir/xyz)
- hiveContext.setConf(hive.metastore.warehouse.dir, /tmp/dir/xyz)

 Any advice from those who have been here before would be appreciated.





Unit testing with HiveContext

2015-04-08 Thread Daniel Siegmann
I am trying to unit test some code which takes an existing HiveContext and
uses it to execute a CREATE TABLE query (among other things). Unfortunately
I've run into some hurdles trying to unit test this, and I'm wondering if
anyone has a good approach.

The metastore DB is automatically created in the local directory, but it
doesn't seem to be cleaned up afterward. Is there any way to get Spark to
clean this up when the context is stopped? Or can I point this to some
other location, such as a temp directory?

Trying to create a table fails because it is using the default warehouse
directory (/user/hive/warehouse). Is there some way to change this without
hard-coding a directory in a hive-site.xml; again, I'd prefer to point it
to a temp directory so it will be automatically removed. I tried a couple
of things that didn't work:

   - hiveContext.sql(SET hive.metastore.warehouse.dir=/tmp/dir/xyz)
   - hiveContext.setConf(hive.metastore.warehouse.dir, /tmp/dir/xyz)

Any advice from those who have been here before would be appreciated.


Re: Unit testing with HiveContext

2015-04-08 Thread Ted Yu
Please take a look at
sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala :

  protected def configure(): Unit = {
warehousePath.delete()
metastorePath.delete()
setConf(javax.jdo.option.ConnectionURL,
  sjdbc:derby:;databaseName=$metastorePath;create=true)
setConf(hive.metastore.warehouse.dir, warehousePath.toString)
  }

Cheers

On Wed, Apr 8, 2015 at 1:07 PM, Daniel Siegmann daniel.siegm...@teamaol.com
 wrote:

 I am trying to unit test some code which takes an existing HiveContext and
 uses it to execute a CREATE TABLE query (among other things). Unfortunately
 I've run into some hurdles trying to unit test this, and I'm wondering if
 anyone has a good approach.

 The metastore DB is automatically created in the local directory, but it
 doesn't seem to be cleaned up afterward. Is there any way to get Spark to
 clean this up when the context is stopped? Or can I point this to some
 other location, such as a temp directory?

 Trying to create a table fails because it is using the default warehouse
 directory (/user/hive/warehouse). Is there some way to change this without
 hard-coding a directory in a hive-site.xml; again, I'd prefer to point it
 to a temp directory so it will be automatically removed. I tried a couple
 of things that didn't work:

- hiveContext.sql(SET hive.metastore.warehouse.dir=/tmp/dir/xyz)
- hiveContext.setConf(hive.metastore.warehouse.dir, /tmp/dir/xyz)

 Any advice from those who have been here before would be appreciated.



Unit testing and Spark Streaming

2014-12-12 Thread Eric Loots
Hi,

I’ve started my first experiments with Spark Streaming and started with setting 
up an environment using ScalaTest to do unit testing. Poked around on this 
mailing list and googled the topic.

One of the things I wanted to be able to do is to use Scala Sequences as data 
source in the tests (instead of using files for example). For this, queueStream 
on a StreamingContext came in handy.

I now have a setup that allows me to run WordSpec style tests like in:

class StreamTests extends StreamingContextBaseSpec(Some-tests) with Matchers 
with WordsCountsTestData {

  Running word count should {
produce the correct word counts for a non-empty list of words in {

  val streamingData = injectData(data1)
  val wordCountsStream = WordCounter.wordCounter(streamingData)
  val wordCounts = startStreamAndExtractResult(wordCountsStream, ssc)
  val sliceSet = wordCounts.toSet

  wordCounts.toSet shouldBe wordCounts1
}

return count = 1 for the empty string in {

  val streamingData: InputDStream[String] = injectData(data2)
  val wordCountsStream: DStream[(String, Int)] = 
WordCounter.wordCounter(streamingData)
  val wordCounts: Seq[(String, Int)] = 
startStreamAndExtractResult(wordCountsStream, ssc)

  wordCounts.toSet shouldBe wordCounts2
}
return an empty result for an empty list of words in {

  val streamingData = injectData(data3)
  val wordCountsStream = WordCounter.wordCounter(streamingData)
  val wordCounts = startStreamAndExtractResult(wordCountsStream, ssc)

  wordCounts.toSet shouldBe wordCounts3
}

  }

  Running word count with filtering out words with single occurrence should {
produce the correct word counts for a non-empty list of words in {

  val streamingData = injectData(data1)
  val wordCountsStream = WordCounter.wordCountOverOne(streamingData)
  val wordCounts = startStreamAndExtractResult(wordCountsStream, ssc)

  wordCounts.toSet shouldBe wordCounts1.filter(_._2  1)
}
  }
}

where WordsCountsTestData (added at the end of this message) is a trait that 
contains the test data and the correct results. 

The two methods under test in the above test code (WordCounter.wordCounter and 
WordCounter.wordCountOverOne) are:

object WordCounter {
  def wordCounter(input: InputDStream[String]): DStream[(String, Int)] = {
val pairs = input.map(word = (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts
  }

  def wordCountOverOne(input: InputDStream[String]): DStream[(String, Int)] = {
val pairs = input.map(word = (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts filter (_._2  1)
  }
}

StreamingContextBaseSpec contains the actual test helper methods such as 
injectData and startStreamAndExtractResult.

package spark.testing

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Milliseconds, StreamingContext, Time}
import org.scalatest.{BeforeAndAfter, WordSpec}

import scala.collection.mutable.Queue
import scala.reflect.ClassTag

class StreamingContextBaseSpec(name: String, silenceSpark : Boolean = true) 
extends WordSpec with BeforeAndAfter {

  val BatchDuration = 10  // milliseconds
  val DeltaTBefore  = 20 * BatchDuration
  val DeltaTAfter   = 10 * BatchDuration
  def injectData[T: ClassTag](data: Seq[T]): InputDStream[T] = {
val dataAsRDD = ssc.sparkContext.parallelize(data)
val dataAsRDDOnQueue = Queue(dataAsRDD)
ssc.queueStream(dataAsRDDOnQueue, oneAtATime = false)
  }

  def startStreamAndExtractResult[T: ClassTag](stream: DStream[T], ssc: 
StreamingContext): Seq[T] = {
stream.print()
println(s~~~ starting execution context $ssc)
val sTime = System.currentTimeMillis()
ssc.start()
val startWindow = new Time(sTime - DeltaTBefore)
val endWindow = new Time(sTime + DeltaTAfter)
val sliceRDDs = stream.slice(startWindow, endWindow)
sliceRDDs.map(rdd = rdd.collect()).flatMap(data = data.toVector)
  }

  var ssc: StreamingContext = _

  before {
System.clearProperty(spark.driver.port)
System.clearProperty(spark.driver.host)
if ( silenceSpark ) SparkUtil.silenceSpark()
val conf = new SparkConf().setMaster(local).setAppName(name)
ssc = new StreamingContext(conf, Milliseconds(BatchDuration))
  }

  after {
println(s~~~ stopping execution context $ssc)
System.clearProperty(spark.driver.port)
System.clearProperty(spark.driver.host)
ssc.stop(stopSparkContext = true, stopGracefully = true)
ssc.awaitTermination()
ssc = null
  }
}

So far for the prelude, now my questions:
Is this a good way to perform this kind of testing ?
Are there more efficient ways to run this kind of testing ?
To reduce the test run time, I’m running the stream with a batch interval of 
only 10ms and a window that extends to 100ms (This seems to work fine as far as 
I can see. When the batch interval

Re: Unit testing and Spark Streaming

2014-12-12 Thread Emre Sevinc
On Fri, Dec 12, 2014 at 2:17 PM, Eric Loots eric.lo...@gmail.com wrote:
 How can the log level in test mode be reduced (or extended when needed) ?

Hello Eric,

The following might be helpful for reducing the log messages during unit
testing:

 http://stackoverflow.com/a/2736/236007

--
Emre Sevinç
https://be.linkedin.com/in/emresevinc


Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo
 
On this note, I've built a framework which is mostly pure so that functional 
unit tests can be run composing mock data for Twitter statuses, with just 
regular junit... That might be relevant also.

I think at some point we should come up with a robust test driven  framework 
for building stream apps... And the idea of Scala test with the injection and 
comparison you did might be a good start.

Thanks for starting this dialogue!

 On Dec 12, 2014, at 9:18 AM, Emre Sevinc emre.sev...@gmail.com wrote:
 
 On Fri, Dec 12, 2014 at 2:17 PM, Eric Loots eric.lo...@gmail.com wrote:
  How can the log level in test mode be reduced (or extended when needed) ?
 
 Hello Eric,
 
 The following might be helpful for reducing the log messages during unit 
 testing:
 
  http://stackoverflow.com/a/2736/236007
 
 --
 Emre Sevinç
 https://be.linkedin.com/in/emresevinc
 


embedded spark for unit testing..

2014-11-09 Thread Kevin Burton
What’s the best way to embed spark to run local mode in unit tests?

Some or our jobs are mildly complex and I want to keep verifying that they
work including during schema changes / migration.

I think for some of this I would just run local mode, read from a few text
files via resources, and then write to /tmp …

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
You can write unittest with a local spark context by mixing
LocalSparkContext trait.

See
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/LocalSparkContext.scala

as an example.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton bur...@spinn3r.com wrote:
 What’s the best way to embed spark to run local mode in unit tests?

 Some or our jobs are mildly complex and I want to keep verifying that they
 work including during schema changes / migration.

 I think for some of this I would just run local mode, read from a few text
 files via resources, and then write to /tmp …

 --

 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit Testing (JUnit) with Spark

2014-10-29 Thread touchdown
add these to your dependencies:

io.netty % netty % 3.6.6.Final
exclude(io.netty, netty-all) to the end of spark and hadoop dependencies

reference: https://spark-project.atlassian.net/browse/SPARK-1138

I am using Spark 1.1 so the akka issue is already fixed



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861p17652.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unit testing: Mocking out Spark classes

2014-10-16 Thread Saket Kumar
Hello all,

I am trying to unit test my classes involved my Spark job. I am trying to
mock out the Spark classes (like SparkContext and Broadcast) so that I can
unit test my classes in isolation. However I have realised that these are
classes instead of traits. My first question is why?

It is quite hard to mock out classes using ScalaTest+ScalaMock as the
classes which need to be mocked out need to be annotated with
org.scalamock.annotation.mock as per
http://www.scalatest.org/user_guide/testing_with_mock_objects#generatedMocks.
I cannot do that in my case as I am trying to mock out the spark classes.

Am I missing something? Is there a better way to do this?

val sparkContext = mock[SparkInteraction]
val trainingDatasetLoader = mock[DatasetLoader]
val broadcastTrainingDatasetLoader = mock[Broadcast[DatasetLoader]]
def transformerFunction(source: Iterator[(HubClassificationData,
String)]): Iterator[String] = {
  source.map(_._2)
}
val classificationResultsRDD = mock[RDD[String]]
val classificationResults = Array(,,)
val inputRDD = mock[RDD[(HubClassificationData, String)]]

inSequence{
  inAnyOrder{
(sparkContext.broadcast[DatasetLoader]
_).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader)
  }
}

val sparkInvoker = new SparkJobInvoker(sparkContext,
trainingDatasetLoader)

when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD)
sparkInvoker.invoke(inputRDD)

Thanks,
Saket


Re: Unit testing: Mocking out Spark classes

2014-10-16 Thread Daniel Siegmann
Mocking these things is difficult; executing your unit tests in a local
Spark context is preferred, as recommended in the programming guide
http://spark.apache.org/docs/latest/programming-guide.html#unit-testing.
I know this may not technically be a unit test, but it is hopefully close
enough.

You can load your test data using SparkContext.parallelize and retrieve the
data (for verification) using RDD.collect.

On Thu, Oct 16, 2014 at 9:07 AM, Saket Kumar saket.ku...@bgch.co.uk wrote:

 Hello all,

 I am trying to unit test my classes involved my Spark job. I am trying to
 mock out the Spark classes (like SparkContext and Broadcast) so that I can
 unit test my classes in isolation. However I have realised that these are
 classes instead of traits. My first question is why?

 It is quite hard to mock out classes using ScalaTest+ScalaMock as the
 classes which need to be mocked out need to be annotated with
 org.scalamock.annotation.mock as per
 http://www.scalatest.org/user_guide/testing_with_mock_objects#generatedMocks.
 I cannot do that in my case as I am trying to mock out the spark classes.

 Am I missing something? Is there a better way to do this?

 val sparkContext = mock[SparkInteraction]
 val trainingDatasetLoader = mock[DatasetLoader]
 val broadcastTrainingDatasetLoader = mock[Broadcast[DatasetLoader]]
 def transformerFunction(source: Iterator[(HubClassificationData,
 String)]): Iterator[String] = {
   source.map(_._2)
 }
 val classificationResultsRDD = mock[RDD[String]]
 val classificationResults = Array(,,)
 val inputRDD = mock[RDD[(HubClassificationData, String)]]

 inSequence{
   inAnyOrder{
 (sparkContext.broadcast[DatasetLoader]
 _).expects(trainingDatasetLoader).returns(broadcastTrainingDatasetLoader)
   }
 }

 val sparkInvoker = new SparkJobInvoker(sparkContext,
 trainingDatasetLoader)

 when(inputRDD.mapPartitions(transformerFunction)).thenReturn(classificationResultsRDD)
 sparkInvoker.invoke(inputRDD)

 Thanks,
 Saket




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Unit testing jar request

2014-10-15 Thread Jean Charles Jabouille

Hi,

we are Spark users and we use some Spark's test classes for our own application 
unit tests. We use LocalSparkContext and SharedSparkContext. But these classes 
are not included in the spark-core library. This is a good option as it's not a 
good idea to include test classes in the runtime jar...

Anyway, do you think that it will be possible please to Spark team to push the 
jar test of spark core module on the maven repository ?

If I understand it's just a plug in to add in the spark/core/pom.xml file like 
describe here 
http://maven.apache.org/plugins/maven-jar-plugin/examples/create-test-jar.html

Thanks,

jean charles


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Unit Testing (JUnit) with Spark

2014-07-29 Thread soumick86
Is there any example out there for unit testing a Spark application in Java?
Even a trivial application like word count will be very helpful. I am very
new to this and I am struggling to understand how I can use JavaSpark
Context for JUnit



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
I've been working some on building spark blueprints, and recently tried to
generalize one for easy blueprints of spark apps.

https://github.com/jayunit100/SparkBlueprint.git

It runs the spark app's main method in a unit test, and builds in SBT.

You can easily try it out and improve on it.

Obviously, calling a main method is the wrong kind of coupling for a unit
test, but it works pretty good in a simple CI environment.

I'll improve it eventually by injecting the SparkContext and validating the
RDD directly, in a next iteration.

Pull requests welcome :)





On Tue, Jul 29, 2014 at 11:29 AM, soumick86 sdasgu...@dstsystems.com
wrote:

 Is there any example out there for unit testing a Spark application in
 Java?
 Even a trivial application like word count will be very helpful. I am very
 new to this and I am struggling to understand how I can use JavaSpark
 Context for JUnit



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
jay vyas


Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Kostiantyn Kudriavtsev
Hi, 

try this one 
http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html

it’s more about fixing windows-specific issue, but code snippet gives general 
idea
just run etl and check output w/ Assert(s)

On Jul 29, 2014, at 6:29 PM, soumick86 sdasgu...@dstsystems.com wrote:

 Is there any example out there for unit testing a Spark application in Java?
 Even a trivial application like word count will be very helpful. I am very
 new to this and I am struggling to understand how I can use JavaSpark
 Context for JUnit
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Sonal Goyal
You can take a look at
https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java
and model your junits based on it.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal




On Tue, Jul 29, 2014 at 10:10 PM, Kostiantyn Kudriavtsev 
kudryavtsev.konstan...@gmail.com wrote:

 Hi,

 try this one
 http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html

 it’s more about fixing windows-specific issue, but code snippet gives
 general idea
 just run etl and check output w/ Assert(s)

 On Jul 29, 2014, at 6:29 PM, soumick86 sdasgu...@dstsystems.com wrote:

  Is there any example out there for unit testing a Spark application in
 Java?
  Even a trivial application like word count will be very helpful. I am
 very
  new to this and I am struggling to understand how I can use JavaSpark
  Context for JUnit
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: guidance on simple unit testing with Spark

2014-06-16 Thread Daniel Siegmann
If you don't want to refactor your code, you can put your input into a test
file. After the test runs, read the data from the output file you specified
(probably want this to be a temp file and delete on exit). Of course, that
is not really a unit test - Metei's suggestion is preferable (this is how
we test). However, if you have a long and complex flow, you might unit test
different parts, and then have an integration test which reads from the
files and tests the whole flow together (I do this as well).




On Fri, Jun 13, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 You need to factor your program so that it’s not just a main(). This is
 not a Spark-specific issue, it’s about how you’d unit test any program in
 general. In this case, your main() creates a SparkContext, so you can’t
 pass one from outside, and your code has to read data from a file and write
 it to a file. It would be better to move your code for transforming data
 into a new function:

 def processData(lines: RDD[String]): RDD[String] = {
   // build and return your “res” variable
 }

 Then you can unit-test this directly on data you create in your program:

 val myLines = sc.parallelize(Seq(“line 1”, “line 2”))
 val result = GetInfo.processData(myLines).collect()
 assert(result.toSet === Set(“res 1”, “res 2”))

 Matei

 On Jun 13, 2014, at 2:42 PM, SK skrishna...@gmail.com wrote:

  Hi,
 
  I have looked through some of the  test examples and also the brief
  documentation on unit testing at
  http://spark.apache.org/docs/latest/programming-guide.html#unit-testing,
 but
  still dont have a good understanding of writing unit tests using the
 Spark
  framework. Previously, I have written unit tests using specs2 framework
 and
  have got them to work in Scalding.  I tried to use the specs2 framework
 with
  Spark, but could not find any simple examples I could follow. I am open
 to
  specs2 or Funsuite, whichever works best with Spark. I would like some
  additional guidance, or some simple sample code using specs2 or
 Funsuite. My
  code is provided below.
 
 
  I have the following code in src/main/scala/GetInfo.scala. It reads a
 Json
  file and extracts some data. It takes the input file (args(0)) and output
  file (args(1)) as arguments.
 
  object GetInfo{
 
def main(args: Array[String]) {
  val inp_file = args(0)
  val conf = new SparkConf().setAppName(GetInfo)
  val sc = new SparkContext(conf)
  val res = sc.textFile(log_file)
.map(line = { parse(line) })
.map(json =
   {
  implicit lazy val formats =
  org.json4s.DefaultFormats
  val aid = (json \ d \ TypeID).extract[Int]
  val ts = (json \ d \ TimeStamp).extract[Long]
  val gid = (json \ d \ ID).extract[String]
  (aid, ts, gid)
   }
 )
.groupBy(tup = tup._3)
.sortByKey(true)
.map(g = (g._1, g._2.map(_._2).max))
  res.map(tuple= %s, %d.format(tuple._1,
  tuple._2)).saveAsTextFile(args(1))
  }
 
 
  I would like to test the above code. My unit test is in src/test/scala.
 The
  code I have so far for the unit test appears below:
 
  import org.apache.spark._
  import org.specs2.mutable._
 
  class GetInfoTest extends Specification with java.io.Serializable{
 
  val data = List (
   (d: {TypeID = 10, Timestamp: 1234, ID: ID1}),
   (d: {TypeID = 11, Timestamp: 5678, ID: ID1}),
   (d: {TypeID = 10, Timestamp: 1357, ID: ID2}),
   (d: {TypeID = 11, Timestamp: 2468, ID: ID2})
 )
 
  val expected_out = List(
 (ID1,5678),
 (ID2,2468),
  )
 
 A GetInfo job should {
  //* How do I pass data define above as input and output
  which GetInfo expects as arguments? **
  val sc = new SparkContext(local, GetInfo)
 
  //*** how do I get the output ***
 
   //assuming out_buffer has the output I want to match it to
 the
  expected output
  match expected output in {
   ( out_buffer == expected_out) must beTrue
  }
  }
 
  }
 
  I would like some help with the tasks marked with  in the unit test
  code above. If specs2 is not the right way to go, I am also open to
  FunSuite. I would like to know how to pass the input while calling my
  program from the unit test and get the output.
 
  Thanks for your help.
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: guidance on simple unit testing with Sprk

2014-06-14 Thread Gerard Maas
Ll mlll
On Jun 14, 2014 4:05 AM, Matei Zaharia matei.zaha...@gmail.com wrote:

 You need to factor your program so that it’s not just a main(). This is
 not a Spark-specific issue, it’s about how you’d unit test any program in
 general. In this case, your main() creates a SparkContext, so you can’t
 pass one from outside, and your code has to read data from a file and write
 it to a file. It would be better to move your code for transforming data
 into a new function:

 def processData(lines: RDD[String]): RDD[String] = {
   // build and return your “res” variable
 }

 Then you can unit-test this directly on data you create in your program:

 val myLines = sc.parallelize(Seq(“line 1”, “line 2”))
 val result = GetInfo.processData(myLines).collect()
 assert(result.toSet === Set(“res 1”, “res 2”))

 Matei

 On Jun 13, 2014, at 2:42 PM, SK skrishna...@gmail.com wrote:

  Hi,
 
  I have looked through some of the  test examples and also the brief
  documentation on unit testing at
  http://spark.apache.org/docs/latest/programming-guide.html#unit-testing,
 but
  still dont have a good understanding of writing unit tests using the
 Spark
  framework. Previously, I have written unit tests using specs2 framework
 and
  have got them to work in Scalding.  I tried to use the specs2 framework
 with
  Spark, but could not find any simple examples I could follow. I am open
 to
  specs2 or Funsuite, whichever works best with Spark. I would like some
  additional guidance, or some simple sample code using specs2 or
 Funsuite. My
  code is provided below.
 
 
  I have the following code in src/main/scala/GetInfo.scala. It reads a
 Json
  file and extracts some data. It takes the input file (args(0)) and output
  file (args(1)) as arguments.
 
  object GetInfo{
 
def main(args: Array[String]) {
  val inp_file = args(0)
  val conf = new SparkConf().setAppName(GetInfo)
  val sc = new SparkContext(conf)
  val res = sc.textFile(log_file)
.map(line = { parse(line) })
.map(json =
   {
  implicit lazy val formats =
  org.json4s.DefaultFormats
  val aid = (json \ d \ TypeID).extract[Int]
  val ts = (json \ d \ TimeStamp).extract[Long]
  val gid = (json \ d \ ID).extract[String]
  (aid, ts, gid)
   }
 )
.groupBy(tup = tup._3)
.sortByKey(true)
.map(g = (g._1, g._2.map(_._2).max))
  res.map(tuple= %s, %d.format(tuple._1,
  tuple._2)).saveAsTextFile(args(1))
  }
 
 
  I would like to test the above code. My unit test is in src/test/scala.
 The
  code I have so far for the unit test appears below:
 
  import org.apache.spark._
  import org.specs2.mutable._
 
  class GetInfoTest extends Specification with java.io.Serializable{
 
  val data = List (
   (d: {TypeID = 10, Timestamp: 1234, ID: ID1}),
   (d: {TypeID = 11, Timestamp: 5678, ID: ID1}),
   (d: {TypeID = 10, Timestamp: 1357, ID: ID2}),
   (d: {TypeID = 11, Timestamp: 2468, ID: ID2})
 )
 
  val expected_out = List(
 (ID1,5678),
 (ID2,2468),
  )
 
 A GetInfo job should {
  //* How do I pass data define above as input and output
  which GetInfo expects as arguments? **
  val sc = new SparkContext(local, GetInfo)
 
  //*** how do I get the output ***
 
   //assuming out_buffer has the output I want to match it to
 the
  expected output
  match expected output in {
   ( out_buffer == expected_out) must beTrue
  }
  }
 
  }
 
  I would like some help with the tasks marked with  in the unit test
  code above. If specs2 is not the right way to go, I am also open to
  FunSuite. I would like to know how to pass the input while calling my
  program from the unit test and get the output.
 
  Thanks for your help.
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: guidance on simple unit testing with Spark

2014-06-13 Thread Matei Zaharia
You need to factor your program so that it’s not just a main(). This is not a 
Spark-specific issue, it’s about how you’d unit test any program in general. In 
this case, your main() creates a SparkContext, so you can’t pass one from 
outside, and your code has to read data from a file and write it to a file. It 
would be better to move your code for transforming data into a new function:

def processData(lines: RDD[String]): RDD[String] = {
  // build and return your “res” variable
}

Then you can unit-test this directly on data you create in your program:

val myLines = sc.parallelize(Seq(“line 1”, “line 2”))
val result = GetInfo.processData(myLines).collect()
assert(result.toSet === Set(“res 1”, “res 2”))

Matei

On Jun 13, 2014, at 2:42 PM, SK skrishna...@gmail.com wrote:

 Hi,
 
 I have looked through some of the  test examples and also the brief
 documentation on unit testing at
 http://spark.apache.org/docs/latest/programming-guide.html#unit-testing, but
 still dont have a good understanding of writing unit tests using the Spark
 framework. Previously, I have written unit tests using specs2 framework and
 have got them to work in Scalding.  I tried to use the specs2 framework with
 Spark, but could not find any simple examples I could follow. I am open to
 specs2 or Funsuite, whichever works best with Spark. I would like some
 additional guidance, or some simple sample code using specs2 or Funsuite. My
 code is provided below.
 
 
 I have the following code in src/main/scala/GetInfo.scala. It reads a Json
 file and extracts some data. It takes the input file (args(0)) and output
 file (args(1)) as arguments.
 
 object GetInfo{
 
   def main(args: Array[String]) {
 val inp_file = args(0)
 val conf = new SparkConf().setAppName(GetInfo)
 val sc = new SparkContext(conf)
 val res = sc.textFile(log_file)
   .map(line = { parse(line) })
   .map(json =
  {
 implicit lazy val formats =
 org.json4s.DefaultFormats
 val aid = (json \ d \ TypeID).extract[Int]
 val ts = (json \ d \ TimeStamp).extract[Long]
 val gid = (json \ d \ ID).extract[String]
 (aid, ts, gid)
  }
)
   .groupBy(tup = tup._3)
   .sortByKey(true)
   .map(g = (g._1, g._2.map(_._2).max))
 res.map(tuple= %s, %d.format(tuple._1,
 tuple._2)).saveAsTextFile(args(1))
 }
 
 
 I would like to test the above code. My unit test is in src/test/scala. The
 code I have so far for the unit test appears below:
 
 import org.apache.spark._
 import org.specs2.mutable._
 
 class GetInfoTest extends Specification with java.io.Serializable{
 
 val data = List (
  (d: {TypeID = 10, Timestamp: 1234, ID: ID1}),
  (d: {TypeID = 11, Timestamp: 5678, ID: ID1}),
  (d: {TypeID = 10, Timestamp: 1357, ID: ID2}),
  (d: {TypeID = 11, Timestamp: 2468, ID: ID2})
)
 
 val expected_out = List(
(ID1,5678),
(ID2,2468),
 )
 
A GetInfo job should {
 //* How do I pass data define above as input and output
 which GetInfo expects as arguments? **
 val sc = new SparkContext(local, GetInfo)
 
 //*** how do I get the output ***
 
  //assuming out_buffer has the output I want to match it to the
 expected output
 match expected output in {
  ( out_buffer == expected_out) must beTrue
 }
 }
 
 }
 
 I would like some help with the tasks marked with  in the unit test
 code above. If specs2 is not the right way to go, I am also open to
 FunSuite. I would like to know how to pass the input while calling my
 program from the unit test and get the output.
 
 Thanks for your help.
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/guidance-on-simple-unit-testing-with-Spark-tp7604.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Spark unit testing best practices

2014-05-16 Thread Andras Nemeth
Thanks for the answers!

On a concrete example, here is what I did to test my (wrong :) ) hypothesis
before writing my email:
class SomethingNotSerializable {
  def process(a: Int): Int = 2 *a
}
object NonSerializableClosure extends App {
  val sc = new spark.SparkContext(
  local,
  SerTest,
  /home/xandrew/spark-0.9.0-incubating,
  Seq(target/scala-2.10/sparktests_2.10-0.1-SNAPSHOT.jar))
  val sns = new SomethingNotSerializable
  println(sc.parallelize(Seq(1,2,3))
.map(sns.process(_))
.reduce(_ + _))
}

This program prints 12 correctly. If I change local to point to my spark
master the code fails on the worker with a NullPointerException in the line
.map(sns.process(_)).

But I have to say that my original assumption that this is a serialization
issue was wrong, as adding extends Serializable to my class does _not_
solve the problem in non-local mode. This seems to be something more
convoluted, the sns reference in my closure is probably not stored by
value, instead I guess it's a by name reference to
NonSerializableClosure.sns. I'm a bit surprised why this results in a
NullPointerException instead of some error when trying to run the
constructor of this object on the worker. Maybe something to do with the
magic of App.

Anyways, while this is indeed an example of an error that doesn't manifest
in local mode, I guess it turns out to be convoluted enough that we won't
worry about it for now, use local in tests, and I'll ask again if we see
some actual prod vs unittest problems.


On using local-cluster, this does sound like exactly what I had in mind.
But it doesn't seem to work for application developers. It seems to assume
you are running within a spark build (it fails while looking for the file
bin/compute-classpath.sh). So maybe that's a reason it's not documented...

Cheers,
Andras





On Wed, May 14, 2014 at 7:58 PM, Mark Hamstra m...@clearstorydata.comwrote:

 Local mode does serDe, so it should expose serialization problems.


 On Wed, May 14, 2014 at 10:53 AM, Philip Ogren philip.og...@oracle.comwrote:

 Have you actually found this to be true?  I have found Spark local mode
 to be quite good about blowing up if there is something non-serializable
 and so my unit tests have been great for detecting this.  I have never seen
 something that worked in local mode that didn't work on the cluster because
 of different serialization requirements between the two.  Perhaps it is
 different when using Kryo



 On 05/14/2014 04:34 AM, Andras Nemeth wrote:

 E.g. if I accidentally use a closure which has something
 non-serializable in it, then my test will happily succeed in local mode but
 go down in flames on a real cluster.






Re: Spark unit testing best practices

2014-05-16 Thread Nan Zhu
+1, at least with current code  

just watch the log printed by DAGScheduler…  

--  
Nan Zhu


On Wednesday, May 14, 2014 at 1:58 PM, Mark Hamstra wrote:

 serDe  



Spark unit testing best practices

2014-05-15 Thread Andras Nemeth
Hi,

Spark's local mode is great to create simple unit tests for our spark
logic. The disadvantage however is that certain types of problems are never
exposed in local mode because things never need to be put on the wire.

E.g. if I accidentally use a closure which has something non-serializable
in it, then my test will happily succeed in local mode but go down in
flames on a real cluster.

Other example is kryo: I'd like to use setRegistrationRequired(true) to
avoid any hidden performance problems due to forgotten registration. And of
course I'd like things to fail in tests. But it won't happen because we
never actually need to serialize the RDDs in local mode.

So, is there some good solution to the above problems? Is there some
local-like mode which simulates serializations as well? Or is there an easy
way to start up *from code* a standalone spark cluster on the machine
running the unit test?

Thanks,
Andras


Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster:

SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r

can you running your tests with a master URL of local-cluster[2,2,512] to
see if that does serialization?


On Wed, May 14, 2014 at 3:34 AM, Andras Nemeth 
andras.nem...@lynxanalytics.com wrote:

 Hi,

 Spark's local mode is great to create simple unit tests for our spark
 logic. The disadvantage however is that certain types of problems are never
 exposed in local mode because things never need to be put on the wire.

 E.g. if I accidentally use a closure which has something non-serializable
 in it, then my test will happily succeed in local mode but go down in
 flames on a real cluster.

 Other example is kryo: I'd like to use setRegistrationRequired(true) to
 avoid any hidden performance problems due to forgotten registration. And of
 course I'd like things to fail in tests. But it won't happen because we
 never actually need to serialize the RDDs in local mode.

 So, is there some good solution to the above problems? Is there some
 local-like mode which simulates serializations as well? Or is there an easy
 way to start up *from code* a standalone spark cluster on the machine
 running the unit test?

 Thanks,
 Andras




Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true?  I have found Spark local mode 
to be quite good about blowing up if there is something non-serializable 
and so my unit tests have been great for detecting this.  I have never 
seen something that worked in local mode that didn't work on the cluster 
because of different serialization requirements between the two.  
Perhaps it is different when using Kryo



On 05/14/2014 04:34 AM, Andras Nemeth wrote:
E.g. if I accidentally use a closure which has something 
non-serializable in it, then my test will happily succeed in local 
mode but go down in flames on a real cluster.