Awesome, thanks. Just reading your post
A few observations: 1) You're giving out Marius's email: "I have been lucky enough to build this pipeline with the amazing Marius Feteanu". A linked or github link might be more helpful. 2) "If you are in Pyspark world sadly Holden’s test base wont work so I suggest you check out Pytest and pytest-bdd.". doesn't read well to me, on first read I was wondering if Spark-Test-Base wasn't available in python... It took me about 20 seconds to figure out that you probably meant it doesn't allow for direct BDD semantics. My 2nd observation here is that BDD semantics can be aped in any given testing framework. You just need to be flexible :) 3) You're doing a transformation (IE JSON input against a JSON schema). You are testing for # of rows which is a good start. But I don't think that really exercises a test against your JSON schema. I tend to view schema as the things that need the most rigorous testing (it's code after all). IE I would want to confirm that the output matches the expected shape and values after being loaded against the schema. I saw a few minor spelling and grammatical issues as well. I put a PR into your blog for them. I won't be offended if you squish it :) I should be getting into our testing 'how-to' stuff this week. I'll scrape our org specific stuff and put it up to github this week as well. It'll be in python so maybe we'll get both use cases covered with examples :) G On 27 April 2017 at 03:46, Sam Elamin <hussam.ela...@gmail.com> wrote: > Hi > > @Lucas I certainly would love to write an integration testing library for > workflows, I have a few ideas I would love to share with others and they are > focused around Airflow since that is what we use > > > As promised here is the first blog post in a series of posts I hope to write > on how we build data pipelines > > Please feel free to retweet my original tweet and share because the more > ideas we have the better! > > Feedback is always welcome! > > Regards > Sam > > On Tue, Apr 25, 2017 at 10:32 PM, lucas.g...@gmail.com > <lucas.g...@gmail.com> wrote: >> >> Hi all, whoever (Sam I think) was going to do some work on doing a >> template testing pipeline. I'd love to be involved, I have a current task >> in my day job (data engineer) to flesh out our testing how-to / best >> practices for Spark jobs and I think I'll be doing something very similar >> for the next week or 2. >> >> I'll scrape out what i have now in the next day or so and put it up in a >> gist that I can share too. >> >> G >> >> On 25 April 2017 at 13:04, Holden Karau <hol...@pigscanfly.ca> wrote: >>> >>> Urgh hangouts did something frustrating, updated link >>> https://hangouts.google.com/hangouts/_/ha6kusycp5fvzei2trhay4uhhqe >>> >>> On Mon, Apr 24, 2017 at 12:13 AM, Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>>> >>>> The (tentative) link for those interested is >>>> https://hangouts.google.com/hangouts/_/oyjvcnffejcjhi6qazf3lysypue . >>>> >>>> On Mon, Apr 24, 2017 at 12:02 AM, Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>>> >>>>> So 14 people have said they are available on Tuesday the 25th at 1PM >>>>> pacific so we will do this meeting then ( >>>>> https://doodle.com/poll/69y6yab4pyf7u8bn ). >>>>> >>>>> Since hangouts tends to work ok on the Linux distro I'm running my >>>>> default is to host this as a "hangouts-on-air" unless there are >>>>> alternative >>>>> ideas. >>>>> >>>>> I'll record the hangout and if it isn't terrible I'll post it for those >>>>> who weren't able to make it (and for next time I'll include more European >>>>> friendly time options - Doodle wouldn't let me update it once posted). >>>>> >>>>> On Fri, Apr 14, 2017 at 11:17 AM, Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>>> >>>>>> Hi Spark Users (+ Some Spark Testing Devs on BCC), >>>>>> >>>>>> Awhile back on one of the many threads about testing in Spark there >>>>>> was some interest in having a chat about the state of Spark testing and >>>>>> what >>>>>> people want/need. >>>>>> >>>>>> So if you are interested in joining an online (with maybe an IRL >>>>>> component if enough people are SF based) chat about Spark testing please >>>>>> fill out this doodle - https://doodle.com/poll/69y6yab4pyf7u8bn >>>>>> >>>>>> I think reasonable topics of discussion could be: >>>>>> >>>>>> 1) What is the state of the different Spark testing libraries in the >>>>>> different core (Scala, Python, R, Java) and extended languages (C#, >>>>>> Javascript, etc.)? >>>>>> 2) How do we make these more easily discovered by users? >>>>>> 3) What are people looking for in their testing libraries that we are >>>>>> missing? (can be functionality, documentation, etc.) >>>>>> 4) Are there any examples of well tested open source Spark projects >>>>>> and where are they? >>>>>> >>>>>> If you have other topics that's awesome. >>>>>> >>>>>> To clarify this about libraries and best practices for people testing >>>>>> their Spark applications, and less about testing Spark's internals >>>>>> (although >>>>>> as illustrated by some of the libraries there is some strong overlap in >>>>>> what >>>>>> is required to make that work). >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Holden :) >>>>>> >>>>>> -- >>>>>> Cell : 425-233-8271 >>>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Cell : 425-233-8271 >>>>> Twitter: https://twitter.com/holdenkarau >>>> >>>> >>>> >>>> >>>> -- >>>> Cell : 425-233-8271 >>>> Twitter: https://twitter.com/holdenkarau >>> >>> >>> >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >> >> > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org