Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
Hey
 My 2 cents on CI/Cd for pyspark. You can leverage pytests + holden karau's
spark testing libs for CI  thus giving you `almost` same functionality as
Scala - I say almost as in Scala you have nice and descriptive funcspecs -

For me choice is based on expertise.having worked with teams which are 99%
python..the cost of retraining -or even hiring - is too big especially if
you have an existing project and aggressive deadlines
Plz feel free to object
Kind Regards

On Fri, Oct 23, 2020, 1:01 PM William R  wrote:

> It's really a very big discussion around Pyspark Vs Scala. I have little
> bit experience about how we can automate the CI/CD when it's a JVM based
> language.
> I would like to take this as an opportunity to understand the end-to-end
> CI/CD flow for Pyspark based ETL pipelines.
>
> Could someone please list down the steps how the pipeline automation works
> when it comes to Pyspark based pipelines in Production ?
>
> //William
>
> On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
> wim.vanleu...@highestpoint.biz> wrote:
>
>> I think Sean is right, but in your argumentation you mention that 
>> 'functionality
>> is sacrificed in favour of the availability of resources'. That's where I
>> disagree with you but agree with Sean. That is mostly not true.
>>
>> In your previous posts you also mentioned this . The only reason we
>> sometimes have to bail out to Scala is for performance with certain udfs
>>
>> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks for the feedback Sean.
>>>
>>> Kind regards,
>>>
>>> Mich
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>>
 I don't find this trolling; I agree with the observation that 'the
 skills you have' are a valid and important determiner of what tools you
 pick.
 I disagree that you just have to pick the optimal tool for everything.
 Sounds good until that comes in contact with the real world.
 For Spark, Python vs Scala just doesn't matter a lot, especially if
 you're doing DataFrame operations. By design. So I can't see there being
 one answer to this.

 On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi Mich,
>
> this is turning into a troll now, can you please stop this?
>
> No one uses Scala where Python should be used, and no one uses Python
> where Scala should be used - it all depends on requirements. Everyone
> understands polyglot programming and how to use relevant technologies best
> to their advantage.
>
>
> Regards,
> Gourav Sengupta
>
>
>>>
>
> --
> Regards,
> William R
> +919037075164
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Hi Wim,


I think we are splitting the atom here but my inference to functionality
was based on:



   1.  Spark is written in Scala, so knowing Scala programming language
   helps coders navigate into the source code, if something does not function
   as expected.
   2. Given the framework using Python increases the probability for more
   issues and bugs because translation between these two different languages
   is difficult.
   3. Using Scala for Spark provides access to the latest features of the
   Spark framework as they are first available in Scala and then ported to
   Python.
   4. Some functionalities are not available in Python. I have seen this
   few times in Spark doc.

There is an interesting write-up on this, although it does on touch on
CI/CD aspects.


 Developing Apache Spark Applications: Scala vs. Python



Regards,


Mich



On Fri, 23 Oct 2020 at 10:23, Wim Van Leuven 
wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi Mich,

 this is turning into a troll now, can you please stop this?

 No one uses Scala where Python should be used, and no one uses Python
 where Scala should be used - it all depends on requirements. Everyone
 understands polyglot programming and how to use relevant technologies best
 to their advantage.


 Regards,
 Gourav Sengupta


>>


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
It's really a very big discussion around Pyspark Vs Scala. I have little
bit experience about how we can automate the CI/CD when it's a JVM based
language.
I would like to take this as an opportunity to understand the end-to-end
CI/CD flow for Pyspark based ETL pipelines.

Could someone please list down the steps how the pipeline automation works
when it comes to Pyspark based pipelines in Production ?

//William

On Fri, Oct 23, 2020 at 11:24 AM Wim Van Leuven <
wim.vanleu...@highestpoint.biz> wrote:

> I think Sean is right, but in your argumentation you mention that 
> 'functionality
> is sacrificed in favour of the availability of resources'. That's where I
> disagree with you but agree with Sean. That is mostly not true.
>
> In your previous posts you also mentioned this . The only reason we
> sometimes have to bail out to Scala is for performance with certain udfs
>
> On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
> wrote:
>
>> Thanks for the feedback Sean.
>>
>> Kind regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>>
>>> I don't find this trolling; I agree with the observation that 'the
>>> skills you have' are a valid and important determiner of what tools you
>>> pick.
>>> I disagree that you just have to pick the optimal tool for everything.
>>> Sounds good until that comes in contact with the real world.
>>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>>> you're doing DataFrame operations. By design. So I can't see there being
>>> one answer to this.
>>>
>>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi Mich,

 this is turning into a troll now, can you please stop this?

 No one uses Scala where Python should be used, and no one uses Python
 where Scala should be used - it all depends on requirements. Everyone
 understands polyglot programming and how to use relevant technologies best
 to their advantage.


 Regards,
 Gourav Sengupta


>>

-- 
Regards,
William R
+919037075164


Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I think Sean is right, but in your argumentation you mention that
'functionality
is sacrificed in favour of the availability of resources'. That's where I
disagree with you but agree with Sean. That is mostly not true.

In your previous posts you also mentioned this . The only reason we
sometimes have to bail out to Scala is for performance with certain udfs

On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh 
wrote:

> Thanks for the feedback Sean.
>
> Kind regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:
>
>> I don't find this trolling; I agree with the observation that 'the skills
>> you have' are a valid and important determiner of what tools you pick.
>> I disagree that you just have to pick the optimal tool for everything.
>> Sounds good until that comes in contact with the real world.
>> For Spark, Python vs Scala just doesn't matter a lot, especially if
>> you're doing DataFrame operations. By design. So I can't see there being
>> one answer to this.
>>
>> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> this is turning into a troll now, can you please stop this?
>>>
>>> No one uses Scala where Python should be used, and no one uses Python
>>> where Scala should be used - it all depends on requirements. Everyone
>>> understands polyglot programming and how to use relevant technologies best
>>> to their advantage.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Thanks for the feedback Sean.

Kind regards,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 22 Oct 2020 at 20:34, Sean Owen  wrote:

> I don't find this trolling; I agree with the observation that 'the skills
> you have' are a valid and important determiner of what tools you pick.
> I disagree that you just have to pick the optimal tool for everything.
> Sounds good until that comes in contact with the real world.
> For Spark, Python vs Scala just doesn't matter a lot, especially if you're
> doing DataFrame operations. By design. So I can't see there being one
> answer to this.
>
> On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta 
> wrote:
>
>> Hi Mich,
>>
>> this is turning into a troll now, can you please stop this?
>>
>> No one uses Scala where Python should be used, and no one uses Python
>> where Scala should be used - it all depends on requirements. Everyone
>> understands polyglot programming and how to use relevant technologies best
>> to their advantage.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>



Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills
you have' are a valid and important determiner of what tools you pick.
I disagree that you just have to pick the optimal tool for everything.
Sounds good until that comes in contact with the real world.
For Spark, Python vs Scala just doesn't matter a lot, especially if you're
doing DataFrame operations. By design. So I can't see there being one
answer to this.

On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta 
wrote:

> Hi Mich,
>
> this is turning into a troll now, can you please stop this?
>
> No one uses Scala where Python should be used, and no one uses Python
> where Scala should be used - it all depends on requirements. Everyone
> understands polyglot programming and how to use relevant technologies best
> to their advantage.
>
>
> Regards,
> Gourav Sengupta
>
>
>>>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
Hi Mich,

this is turning into a troll now, can you please stop this?

No one uses Scala where Python should be used, and no one uses Python where
Scala should be used - it all depends on requirements. Everyone understands
polyglot programming and how to use relevant technologies best to their
advantage.


Regards,
Gourav Sengupta


On Thu, Oct 22, 2020 at 5:13 PM Mich Talebzadeh 
wrote:

> Today I had a discussion with a lead developer on a client site regarding
> Scala or PySpark. with Spark.
>
> They were not doing data science and reluctantly agreed that PySpark was
> used for ETL.
>
> In mitigation he mentioned that in his team he is the only one that is an
> expert on Scala (his words) and the rest are Python savvys.
>
> It shows again that at times functionality is sacrificed in favour of the
> availability of resources and reaffirms what some members were saying
> regarding the choice of the technology based on TCO, favouring Python over
> Spark.
>
> HTH,
>
> Mich
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Today I had a discussion with a lead developer on a client site regarding
Scala or PySpark. with Spark.

They were not doing data science and reluctantly agreed that PySpark was
used for ETL.

In mitigation he mentioned that in his team he is the only one that is an
expert on Scala (his words) and the rest are Python savvys.

It shows again that at times functionality is sacrificed in favour of the
availability of resources and reaffirms what some members were saying
regarding the choice of the technology based on TCO, favouring Python over
Spark.

HTH,

Mich

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
Holy war is a bit dramatic don't you think?  The difference between Scala
and Python will always be very relevant when choosing between Spark and
Pyspark. I wouldn't call it irrelevant to the original question.

br,

molotch

On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yur...@gmail.com> wrote:

> It seems that thread converted to holy war that has nothing to do with
> original question. If it is, it’s super disappointing
>
> Отправлено с iPhone
>
> > 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> >
> > I would say the pros and cons of Python vs Scala is both down to Spark,
> the
> > languages in themselves and what kind of data engineer you will get when
> you
> > try to hire for the different solutions.
> >
> > With Pyspark you get less functionality and increased complexity with the
> > py4j java interop compared to vanilla Spark. Why would you want that?
> Maybe
> > you want the Python ML tools and have a clear use case, then go for it.
> If
> > not, avoid the increased complexity and reduced functionality of Pyspark.
> >
> > Python vs Scala? Idiomatic Python is a lesson in bad programming
> > habits/ideas, there's no other way to put it. Do you really want
> programmers
> > enjoying coding i such a language hacking away at your system?
> >
> > Scala might be far from perfect with the plethora of ways to express
> > yourself. But Python < 3.5 is not fit for anything except simple
> scripting
> > IMO.
> >
> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
> like a
> > fine idea. Coding an entire ETL library including state management, the
> > whole kitchen including the sink, Scala everyday of the week.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
I'm sorry you were offended. I'm not an expert in Python and I wasn't
trying to attack you personally. It's just an opinion about what makes a
language better or worse, it's not the single source of truth. You don't
have to take offense. In the end its about context and what you're trying
to achieve under what circumstances.

I know a little about both programming and ETL. To say I know nothing is
taking it a bit far. I don't know everything worth to know, that's for sure
and goes without saying.

It's fine to love Python and good for you being able to write Python
programs wiping Java commercial stacks left and right. It's just my opinion
that mutable dynamically typed languages encourage/enforce bad habits.

The larger the application and team gets, the worse off you are (again just
an opinion). Not everyone agrees (just look at Pythons popularity) but it's
definitely a relevant aspect when deciding going Spark or Pyspark.


br,

molotch

On Sat, 17 Oct 2020 at 16:40, Sasha Kacanski  wrote:

> And you are an expert on python! Idiomatic...
> Please do everyone a favor and stop commenting on things you have no
> idea...
> I build ETL systems python that wiped java commercial stacks left and
> right. Pyspark was and is  and will be a second class citizen in spark
> world. That has nothing to do with python.
> And as far as scala is concerned good luck with it...
>
>
>
>
>
> On Sat, Oct 17, 2020, 8:53 AM Molotch  wrote:
>
>> I would say the pros and cons of Python vs Scala is both down to Spark,
>> the
>> languages in themselves and what kind of data engineer you will get when
>> you
>> try to hire for the different solutions.
>>
>> With Pyspark you get less functionality and increased complexity with the
>> py4j java interop compared to vanilla Spark. Why would you want that?
>> Maybe
>> you want the Python ML tools and have a clear use case, then go for it. If
>> not, avoid the increased complexity and reduced functionality of Pyspark.
>>
>> Python vs Scala? Idiomatic Python is a lesson in bad programming
>> habits/ideas, there's no other way to put it. Do you really want
>> programmers
>> enjoying coding i such a language hacking away at your system?
>>
>> Scala might be far from perfect with the plethora of ways to express
>> yourself. But Python < 3.5 is not fit for anything except simple scripting
>> IMO.
>>
>> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like
>> a
>> fine idea. Coding an entire ETL library including state management, the
>> whole kitchen including the sink, Scala everyday of the week.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark.  In my
experience with performance is super important you’ll end up needing to do
some of your work in the JVM, but in many situations what matters work is
what your team and company are familiar with and the ecosystem of tooling
for your domain.

Since that can change so much between people and projects I think arguing
about the one true language is likely to be unproductive.

We’re all here because we want Spark and more broadly open source data
tooling to succeed — let’s keep that in mind. There is far too much stress
in the world, and I know I’ve sometimes used word choices I regret
especially this year. Let’s all take the weekend to do something we enjoy
away from Spark :)

On Sat, Oct 17, 2020 at 7:58 AM "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
yur...@gmail.com> wrote:

> It seems that thread converted to holy war that has nothing to do with
> original question. If it is, it’s super disappointing
>
> Отправлено с iPhone
>
> > 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> >
> > I would say the pros and cons of Python vs Scala is both down to Spark,
> the
> > languages in themselves and what kind of data engineer you will get when
> you
> > try to hire for the different solutions.
> >
> > With Pyspark you get less functionality and increased complexity with the
> > py4j java interop compared to vanilla Spark. Why would you want that?
> Maybe
> > you want the Python ML tools and have a clear use case, then go for it.
> If
> > not, avoid the increased complexity and reduced functionality of Pyspark.
> >
> > Python vs Scala? Idiomatic Python is a lesson in bad programming
> > habits/ideas, there's no other way to put it. Do you really want
> programmers
> > enjoying coding i such a language hacking away at your system?
> >
> > Scala might be far from perfect with the plethora of ways to express
> > yourself. But Python < 3.5 is not fit for anything except simple
> scripting
> > IMO.
> >
> > Doing exploratory data analysis in a Jupiter notebook, Pyspark seems
> like a
> > fine idea. Coding an entire ETL library including state management, the
> > whole kitchen including the sink, Scala everyday of the week.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
It seems that thread converted to holy war that has nothing to do with original 
question. If it is, it’s super disappointing

Отправлено с iPhone

> 17 окт. 2020 г., в 15:53, Molotch  написал(а):
> 
> I would say the pros and cons of Python vs Scala is both down to Spark, the
> languages in themselves and what kind of data engineer you will get when you
> try to hire for the different solutions. 
> 
> With Pyspark you get less functionality and increased complexity with the
> py4j java interop compared to vanilla Spark. Why would you want that? Maybe
> you want the Python ML tools and have a clear use case, then go for it. If
> not, avoid the increased complexity and reduced functionality of Pyspark.
> 
> Python vs Scala? Idiomatic Python is a lesson in bad programming
> habits/ideas, there's no other way to put it. Do you really want programmers
> enjoying coding i such a language hacking away at your system?
> 
> Scala might be far from perfect with the plethora of ways to express
> yourself. But Python < 3.5 is not fit for anything except simple scripting
> IMO.
> 
> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
> fine idea. Coding an entire ETL library including state management, the
> whole kitchen including the sink, Scala everyday of the week.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Sasha Kacanski
And you are an expert on python! Idiomatic...
Please do everyone a favor and stop commenting on things you have no idea...
I build ETL systems python that wiped java commercial stacks left and
right. Pyspark was and is  and will be a second class citizen in spark
world. That has nothing to do with python.
And as far as scala is concerned good luck with it...





On Sat, Oct 17, 2020, 8:53 AM Molotch  wrote:

> I would say the pros and cons of Python vs Scala is both down to Spark, the
> languages in themselves and what kind of data engineer you will get when
> you
> try to hire for the different solutions.
>
> With Pyspark you get less functionality and increased complexity with the
> py4j java interop compared to vanilla Spark. Why would you want that? Maybe
> you want the Python ML tools and have a clear use case, then go for it. If
> not, avoid the increased complexity and reduced functionality of Pyspark.
>
> Python vs Scala? Idiomatic Python is a lesson in bad programming
> habits/ideas, there's no other way to put it. Do you really want
> programmers
> enjoying coding i such a language hacking away at your system?
>
> Scala might be far from perfect with the plethora of ways to express
> yourself. But Python < 3.5 is not fit for anything except simple scripting
> IMO.
>
> Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
> fine idea. Coding an entire ETL library including state management, the
> whole kitchen including the sink, Scala everyday of the week.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the
languages in themselves and what kind of data engineer you will get when you
try to hire for the different solutions. 

With Pyspark you get less functionality and increased complexity with the
py4j java interop compared to vanilla Spark. Why would you want that? Maybe
you want the Python ML tools and have a clear use case, then go for it. If
not, avoid the increased complexity and reduced functionality of Pyspark.

Python vs Scala? Idiomatic Python is a lesson in bad programming
habits/ideas, there's no other way to put it. Do you really want programmers
enjoying coding i such a language hacking away at your system?

Scala might be far from perfect with the plethora of ways to express
yourself. But Python < 3.5 is not fit for anything except simple scripting
IMO.

Doing exploratory data analysis in a Jupiter notebook, Pyspark seems like a
fine idea. Coding an entire ETL library including state management, the
whole kitchen including the sink, Scala everyday of the week.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-15 Thread Mich Talebzadeh
Hi,

I spent a few days converting one of my Spark/Scala scripts to Python. It
was interesting but at times looked like trench war. There is a lot of
handy stuff in Scala like case classes for defining column headers etc that
don't seem to be available in Python (possibly my lack of in-depth Python
knowledge). However, Spark documents frequently state availability of
features to Scala and Java and not Python.

Looking around everything written for Spark using Python is a work-around.
I am not considering Python for data science as my focus has been on
using Python with Spark for ETL, I published a thread on this today with
two examples of the code written in Scala and Python respectively. OK I
admit Lambda functions in Python with map is a great feature but that is
all. The rest can be achieved better with Scala. So I buy the view that
people tend to use Python with Spark for ETL (because with great respect)
they cannot be bothered to pick up Scala (I trust I am not unkind). So that
is it. When I was converting the code I remembered that I do still use a
Nokia 8210 (21 years old technology) from time to time. Old, sturdy, long
battery life and very small. Compare that one with Iphone. That is a fair
comparison between Spark on Scala with Spark on Python :)

HTH











LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 20:46, Mich Talebzadeh 
wrote:

> Hi,
>
> With regard to your statement below
>
> ".technology choices are agnostic to use cases according to you"
>
> If I may say, I do not think that was the message implied. What was said
> was that in addition to "best technology fit" there are other factors
> "equally important" that need to be considered, when a company makes a
> decision on a given product use case.
>
> As others have stated, what technology stacks you choose may not be the
> best available technology but something that provides an adequate solution
> at a reasonable TCO. Case in point if Scala in a given use case is the best
> fit but at higher TCO (labour cost), then you may opt to use Python or
> another because you have those resources available in-house at lower cost
> and your Data Scientists are eager to invest in Python. Companies these
> days are very careful where to spend their technology dollars or just
> cancel the projects totally. From my experience, the following are
> crucial in deciding what to invest in
>
>
>- Total Cost of Ownership
>- Internal Supportability & OpIerability thus avoiding single point of
>failure
>- Maximum leverage, strategic as opposed to tactical (example is
>Python considered more of a strategic product or Scala)
>-  Agile and DevOps compatible
>- Cloud-ready, flexible, scale-out
>- Vendor support
>- Documentation
>- Minimal footprint
>
> I trust this answers your point.
>
>
> Mich
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta 
> wrote:
>
>> So Mich and rest,
>>
>> technology choices are agnostic to use cases according to you? This is
>> interesting, really interesting. Perhaps I stand corrected.
>>
>> Regards,
>> Gourav
>>
>> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> if we take Spark and its massive parallel processing and in-memory
>>> cache away, then one can argue anything can do the "ETL" job. just write
>>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>>> another often using JDBC connections. However, we all concur that may not
>>> be good enough with Big Data volumes. Generally speaking, there are two
>>> ways of making a process faster:
>>>
>>>
>>>1. Do more intelligent work by creating indexes, cubes etc thus
>>>reducing the processing time
>>>2. Throw hardware and memory at it using something like Spark
>>>multi-cluster with fully managed cloud service like Google Dataproc
>>>
>>>
>>> In general, one would see an order of magnitude performance gains.
>>>
>>>
>>> HTH,
>>>
>>>
>>> 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
Hi,

With regard to your statement below

".technology choices are agnostic to use cases according to you"

If I may say, I do not think that was the message implied. What was said
was that in addition to "best technology fit" there are other factors
"equally important" that need to be considered, when a company makes a
decision on a given product use case.

As others have stated, what technology stacks you choose may not be the
best available technology but something that provides an adequate solution
at a reasonable TCO. Case in point if Scala in a given use case is the best
fit but at higher TCO (labour cost), then you may opt to use Python or
another because you have those resources available in-house at lower cost
and your Data Scientists are eager to invest in Python. Companies these
days are very careful where to spend their technology dollars or just
cancel the projects totally. From my experience, the following are
crucial in deciding what to invest in


   - Total Cost of Ownership
   - Internal Supportability & OpIerability thus avoiding single point of
   failure
   - Maximum leverage, strategic as opposed to tactical (example is Python
   considered more of a strategic product or Scala)
   -  Agile and DevOps compatible
   - Cloud-ready, flexible, scale-out
   - Vendor support
   - Documentation
   - Minimal footprint

I trust this answers your point.


Mich


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta 
wrote:

> So Mich and rest,
>
> technology choices are agnostic to use cases according to you? This is
> interesting, really interesting. Perhaps I stand corrected.
>
> Regards,
> Gourav
>
> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh 
> wrote:
>
>> if we take Spark and its massive parallel processing and in-memory
>> cache away, then one can argue anything can do the "ETL" job. just write
>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>> another often using JDBC connections. However, we all concur that may not
>> be good enough with Big Data volumes. Generally speaking, there are two
>> ways of making a process faster:
>>
>>
>>1. Do more intelligent work by creating indexes, cubes etc thus
>>reducing the processing time
>>2. Throw hardware and memory at it using something like Spark
>>multi-cluster with fully managed cloud service like Google Dataproc
>>
>>
>> In general, one would see an order of magnitude performance gains.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:
>>
>>> But when you have fairly large volume of data that is where spark comes
>>> in the party. And I assume the requirement of using spark is already
>>> established in the original qs and the discussion is to use python vs
>>> scala/java.
>>>
>>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
>>> wrote:
>>>
 If org has folks that can do python seriously why then spark in the
 first place. You can do workflow on your own, streaming or batch or what
 ever you want.
 I would not do anything else aside from python, but that is me.

 On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:

> I have one observation: is "python udf is slow due to deserialization
> penulty" still relevant? Even after arrow is used as in memory data mgmt
> and so heavy investment from spark dev community on making pandas first
> class citizen including Udfs.
>
> As I work with multiple clients, my exp is org culture and available
> people are most imp driver for this choice regardless the use case. Use
> case is relevant only when there is a feature imparity
>
> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Not quite sure how meaningful this discussion is, but in case someone
>> is really faced with this query the question still is 'what is the use
>> case'?
>> I am just a bit confused with the one size fits all deterministic
>> approach here thought that those days were over almost 10 years ago.
>> Regards
>> Gourav
>>
>> On Sat, 10 Oct 2020, 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Gourav Sengupta
So Mich and rest,

technology choices are agnostic to use cases according to you? This is
interesting, really interesting. Perhaps I stand corrected.

Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh 
wrote:

> if we take Spark and its massive parallel processing and in-memory
> cache away, then one can argue anything can do the "ETL" job. just write
> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
> another often using JDBC connections. However, we all concur that may not
> be good enough with Big Data volumes. Generally speaking, there are two
> ways of making a process faster:
>
>
>1. Do more intelligent work by creating indexes, cubes etc thus
>reducing the processing time
>2. Throw hardware and memory at it using something like Spark
>multi-cluster with fully managed cloud service like Google Dataproc
>
>
> In general, one would see an order of magnitude performance gains.
>
>
> HTH,
>
>
> Mich
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:
>
>> But when you have fairly large volume of data that is where spark comes
>> in the party. And I assume the requirement of using spark is already
>> established in the original qs and the discussion is to use python vs
>> scala/java.
>>
>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
>> wrote:
>>
>>> If org has folks that can do python seriously why then spark in the
>>> first place. You can do workflow on your own, streaming or batch or what
>>> ever you want.
>>> I would not do anything else aside from python, but that is me.
>>>
>>> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>>>
 I have one observation: is "python udf is slow due to deserialization
 penulty" still relevant? Even after arrow is used as in memory data mgmt
 and so heavy investment from spark dev community on making pandas first
 class citizen including Udfs.

 As I work with multiple clients, my exp is org culture and available
 people are most imp driver for this choice regardless the use case. Use
 case is relevant only when there is a feature imparity

 On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Not quite sure how meaningful this discussion is, but in case someone
> is really faced with this query the question still is 'what is the use
> case'?
> I am just a bit confused with the one size fits all deterministic
> approach here thought that those days were over almost 10 years ago.
> Regards
> Gourav
>
> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>
>> I agree with Wim's assessment of data engineering / ETL vs Data
>> Science.I wrote pipelines/frameworks for large companies and scala 
>> was
>> a much better choice. But for ad-hoc work interfacing directly with data
>> science experiments pyspark presents less friction.
>>
>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Many thanks everyone for their valuable contribution.
>>>
>>> We all started with Spark a few years ago where Scala was the talk
>>> of the town. I agree with the note that as long as Spark stayed nish and
>>> elite, then someone with Scala knowledge was attracting premiums. In
>>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>>> may
>>> be wrong). But the world has moved on so to speak. Python itself has 
>>> been
>>> around a long time (long being relative here). Most people either knew 
>>> UNIX
>>> Shell, C, Python or Perl or a combination of all these. I recall we had 
>>> a
>>> director a few years ago who asked our Hadoop admin for root password to
>>> log in to the edge node. Later he became head of machine learning
>>> somewhere else and he loved C and Python. So Python was a gift in 
>>> disguise.
>>> I think Python appeals to those who are very familiar with CLI and shell
>>> programming (Not GUI fan). As some members alluded to there are more 
>>> people
>>> around with Python knowledge. Most managers choose Python as the 
>>> unifying
>>> development tool because they feel comfortable with it. Frankly I have 
>>> not
>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>> disappointing to abandon Scala and switch to Python just for the sake 
>>> of it.
>>>
>>> Disclaimer: These are opinions and not facts so to speak :)
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
if we take Spark and its massive parallel processing and in-memory
cache away, then one can argue anything can do the "ETL" job. just write
some Java/Scala/SQL/Perl/python to read data and write to from one DB to
another often using JDBC connections. However, we all concur that may not
be good enough with Big Data volumes. Generally speaking, there are two
ways of making a process faster:


   1. Do more intelligent work by creating indexes, cubes etc thus reducing
   the processing time
   2. Throw hardware and memory at it using something like Spark
   multi-cluster with fully managed cloud service like Google Dataproc


In general, one would see an order of magnitude performance gains.


HTH,


Mich



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 13:33, ayan guha  wrote:

> But when you have fairly large volume of data that is where spark comes in
> the party. And I assume the requirement of using spark is already
> established in the original qs and the discussion is to use python vs
> scala/java.
>
> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski 
> wrote:
>
>> If org has folks that can do python seriously why then spark in the first
>> place. You can do workflow on your own, streaming or batch or what ever you
>> want.
>> I would not do anything else aside from python, but that is me.
>>
>> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>>
>>> I have one observation: is "python udf is slow due to deserialization
>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>> and so heavy investment from spark dev community on making pandas first
>>> class citizen including Udfs.
>>>
>>> As I work with multiple clients, my exp is org culture and available
>>> people are most imp driver for this choice regardless the use case. Use
>>> case is relevant only when there is a feature imparity
>>>
>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Not quite sure how meaningful this discussion is, but in case someone
 is really faced with this query the question still is 'what is the use
 case'?
 I am just a bit confused with the one size fits all deterministic
 approach here thought that those days were over almost 10 years ago.
 Regards
 Gourav

 On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data
> Science.I wrote pipelines/frameworks for large companies and scala was
> a much better choice. But for ad-hoc work interfacing directly with data
> science experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk
>> of the town. I agree with the note that as long as Spark stayed nish and
>> elite, then someone with Scala knowledge was attracting premiums. In
>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>> may
>> be wrong). But the world has moved on so to speak. Python itself has been
>> around a long time (long being relative here). Most people either knew 
>> UNIX
>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>> director a few years ago who asked our Hadoop admin for root password to
>> log in to the edge node. Later he became head of machine learning
>> somewhere else and he loved C and Python. So Python was a gift in 
>> disguise.
>> I think Python appeals to those who are very familiar with CLI and shell
>> programming (Not GUI fan). As some members alluded to there are more 
>> people
>> around with Python knowledge. Most managers choose Python as the unifying
>> development tool because they feel comfortable with it. Frankly I have 
>> not
>> seen a manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of 
>> it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark
>>> for ETL, for example processing data from S3 buckets into Snowflake with
>>> Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala
>>> is because they are 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread ayan guha
But when you have fairly large volume of data that is where spark comes in
the party. And I assume the requirement of using spark is already
established in the original qs and the discussion is to use python vs
scala/java.

On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski  wrote:

> If org has folks that can do python seriously why then spark in the first
> place. You can do workflow on your own, streaming or batch or what ever you
> want.
> I would not do anything else aside from python, but that is me.
>
> On Sat, Oct 10, 2020, 9:42 PM ayan guha  wrote:
>
>> I have one observation: is "python udf is slow due to deserialization
>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>> and so heavy investment from spark dev community on making pandas first
>> class citizen including Udfs.
>>
>> As I work with multiple clients, my exp is org culture and available
>> people are most imp driver for this choice regardless the use case. Use
>> case is relevant only when there is a feature imparity
>>
>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Not quite sure how meaningful this discussion is, but in case someone is
>>> really faced with this query the question still is 'what is the use case'?
>>> I am just a bit confused with the one size fits all deterministic
>>> approach here thought that those days were over almost 10 years ago.
>>> Regards
>>> Gourav
>>>
>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>>>
 I agree with Wim's assessment of data engineering / ETL vs Data
 Science.I wrote pipelines/frameworks for large companies and scala was
 a much better choice. But for ad-hoc work interfacing directly with data
 science experiments pyspark presents less friction.

 On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk
> of the town. I agree with the note that as long as Spark stayed nish and
> elite, then someone with Scala knowledge was attracting premiums. In
> fairness in 2014-2015, there was not much talk of Data Science input (I 
> may
> be wrong). But the world has moved on so to speak. Python itself has been
> around a long time (long being relative here). Most people either knew 
> UNIX
> Shell, C, Python or Perl or a combination of all these. I recall we had a
> director a few years ago who asked our Hadoop admin for root password to
> log in to the edge node. Later he became head of machine learning
> somewhere else and he loved C and Python. So Python was a gift in 
> disguise.
> I think Python appeals to those who are very familiar with CLI and shell
> programming (Not GUI fan). As some members alluded to there are more 
> people
> around with Python knowledge. Most managers choose Python as the unifying
> development tool because they feel comfortable with it. Frankly I have not
> seen a manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of 
> it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with 
>> Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala
>> is because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark
>> with Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to
>> get some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such 

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
Thanks Ayan.

I am not qualified to answer your first point. However, my experience with
Spark with Scala or Spark with Python agrees with your assertion that use
cases do not come into it. Most DEV/OPS work dealing with ETL are provided
by service companies that have workforce very familiar with Java,.
IntelliJ, Maven and latterly with Scala. Scala is their first choice where
they create Uber Jar files with IntelliJ and MVN on MacBook and shift them
into sandboxes for continuous tests. I believe this will remain a trend for
sometime as considerable investment is already made there. Then I came
across another consultancy tasked with getting raw files from S3 and
putting them into Snowflake. They wanted to use Spark with Python. So your
mileage varies.


Cheers,


Mich



On Sun, 11 Oct 2020 at 02:41, ayan guha  wrote:

> I have one observation: is "python udf is slow due to deserialization
> penulty" still relevant? Even after arrow is used as in memory data mgmt
> and so heavy investment from spark dev community on making pandas first
> class citizen including Udfs.
>
> As I work with multiple clients, my exp is org culture and available
> people are most imp driver for this choice regardless the use case. Use
> case is relevant only when there is a feature imparity
>
> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta 
> wrote:
>
>> Not quite sure how meaningful this discussion is, but in case someone is
>> really faced with this query the question still is 'what is the use case'?
>> I am just a bit confused with the one size fits all deterministic
>> approach here thought that those days were over almost 10 years ago.
>> Regards
>> Gourav
>>
>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>>
>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>> Science.I wrote pipelines/frameworks for large companies and scala was
>>> a much better choice. But for ad-hoc work interfacing directly with data
>>> science experiments pyspark presents less friction.
>>>
>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
>>> wrote:
>>>
 Many thanks everyone for their valuable contribution.

 We all started with Spark a few years ago where Scala was the talk
 of the town. I agree with the note that as long as Spark stayed nish and
 elite, then someone with Scala knowledge was attracting premiums. In
 fairness in 2014-2015, there was not much talk of Data Science input (I may
 be wrong). But the world has moved on so to speak. Python itself has been
 around a long time (long being relative here). Most people either knew UNIX
 Shell, C, Python or Perl or a combination of all these. I recall we had a
 director a few years ago who asked our Hadoop admin for root password to
 log in to the edge node. Later he became head of machine learning
 somewhere else and he loved C and Python. So Python was a gift in disguise.
 I think Python appeals to those who are very familiar with CLI and shell
 programming (Not GUI fan). As some members alluded to there are more people
 around with Python knowledge. Most managers choose Python as the unifying
 development tool because they feel comfortable with it. Frankly I have not
 seen a manager who feels at home with Scala. So in summary it is a bit
 disappointing to abandon Scala and switch to Python just for the sake of 
 it.

 Disclaimer: These are opinions and not facts so to speak :)

 Cheers,


 Mich






 On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
 wrote:

> I have come across occasions when the teams use Python with Spark for
> ETL, for example processing data from S3 buckets into Snowflake with 
> Spark.
>
> The only reason I think they are choosing Python as opposed to Scala
> is because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark
> with Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to
> get some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable 

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha
I have one observation: is "python udf is slow due to deserialization
penulty" still relevant? Even after arrow is used as in memory data mgmt
and so heavy investment from spark dev community on making pandas first
class citizen including Udfs.

As I work with multiple clients, my exp is org culture and available people
are most imp driver for this choice regardless the use case. Use case is
relevant only when there is a feature imparity

On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta 
wrote:

> Not quite sure how meaningful this discussion is, but in case someone is
> really faced with this query the question still is 'what is the use case'?
> I am just a bit confused with the one size fits all deterministic approach
> here thought that those days were over almost 10 years ago.
> Regards
> Gourav
>
> On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:
>
>> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>>   I wrote pipelines/frameworks for large companies and scala was a much
>> better choice. But for ad-hoc work interfacing directly with data science
>> experiments pyspark presents less friction.
>>
>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
>> wrote:
>>
>>> Many thanks everyone for their valuable contribution.
>>>
>>> We all started with Spark a few years ago where Scala was the talk
>>> of the town. I agree with the note that as long as Spark stayed nish and
>>> elite, then someone with Scala knowledge was attracting premiums. In
>>> fairness in 2014-2015, there was not much talk of Data Science input (I may
>>> be wrong). But the world has moved on so to speak. Python itself has been
>>> around a long time (long being relative here). Most people either knew UNIX
>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>> director a few years ago who asked our Hadoop admin for root password to
>>> log in to the edge node. Later he became head of machine learning
>>> somewhere else and he loved C and Python. So Python was a gift in disguise.
>>> I think Python appeals to those who are very familiar with CLI and shell
>>> programming (Not GUI fan). As some members alluded to there are more people
>>> around with Python knowledge. Most managers choose Python as the unifying
>>> development tool because they feel comfortable with it. Frankly I have not
>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>>
>>> Disclaimer: These are opinions and not facts so to speak :)
>>>
>>> Cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>>> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich




 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *





 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>> --
Best Regards,
Ayan Guha


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
Not quite sure how meaningful this discussion is, but in case someone is
really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach
here thought that those days were over almost 10 years ago.
Regards
Gourav

On Sat, 10 Oct 2020, 21:24 Stephen Boesch,  wrote:

> I agree with Wim's assessment of data engineering / ETL vs Data Science.
>   I wrote pipelines/frameworks for large companies and scala was a much
> better choice. But for ad-hoc work interfacing directly with data science
> experiments pyspark presents less friction.
>
> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
> wrote:
>
>> Many thanks everyone for their valuable contribution.
>>
>> We all started with Spark a few years ago where Scala was the talk of the
>> town. I agree with the note that as long as Spark stayed nish and elite,
>> then someone with Scala knowledge was attracting premiums. In fairness in
>> 2014-2015, there was not much talk of Data Science input (I may be wrong).
>> But the world has moved on so to speak. Python itself has been around
>> a long time (long being relative here). Most people either knew UNIX Shell,
>> C, Python or Perl or a combination of all these. I recall we had a director
>> a few years ago who asked our Hadoop admin for root password to log in to
>> the edge node. Later he became head of machine learning somewhere else and
>> he loved C and Python. So Python was a gift in disguise. I think Python
>> appeals to those who are very familiar with CLI and shell programming (Not
>> GUI fan). As some members alluded to there are more people around with
>> Python knowledge. Most managers choose Python as the unifying development
>> tool because they feel comfortable with it. Frankly I have not seen a
>> manager who feels at home with Scala. So in summary it is a bit
>> disappointing to abandon Scala and switch to Python just for the sake of it.
>>
>> Disclaimer: These are opinions and not facts so to speak :)
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk of the
> town. I agree with the note that as long as Spark stayed nish and elite,
> then someone with Scala knowledge was attracting premiums. In fairness in
> 2014-2015, there was not much talk of Data Science input (I may be wrong).
> But the world has moved on so to speak. Python itself has been around
> a long time (long being relative here). Most people either knew UNIX Shell,
> C, Python or Perl or a combination of all these. I recall we had a director
> a few years ago who asked our Hadoop admin for root password to log in to
> the edge node. Later he became head of machine learning somewhere else and
> he loved C and Python. So Python was a gift in disguise. I think Python
> appeals to those who are very familiar with CLI and shell programming (Not
> GUI fan). As some members alluded to there are more people around with
> Python knowledge. Most managers choose Python as the unifying development
> tool because they feel comfortable with it. Frankly I have not seen a
> manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
Many thanks everyone for their valuable contribution.

We all started with Spark a few years ago where Scala was the talk of the
town. I agree with the note that as long as Spark stayed nish and elite,
then someone with Scala knowledge was attracting premiums. In fairness in
2014-2015, there was not much talk of Data Science input (I may be wrong).
But the world has moved on so to speak. Python itself has been around
a long time (long being relative here). Most people either knew UNIX Shell,
C, Python or Perl or a combination of all these. I recall we had a director
a few years ago who asked our Hadoop admin for root password to log in to
the edge node. Later he became head of machine learning somewhere else and
he loved C and Python. So Python was a gift in disguise. I think Python
appeals to those who are very familiar with CLI and shell programming (Not
GUI fan). As some members alluded to there are more people around with
Python knowledge. Most managers choose Python as the unifying development
tool because they feel comfortable with it. Frankly I have not seen a
manager who feels at home with Scala. So in summary it is a bit
disappointing to abandon Scala and switch to Python just for the sake of it.

Disclaimer: These are opinions and not facts so to speak :)

Cheers,


Mich






On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka
I would not leave it to data scientists unless they will maintain it.

The key decision in cases I've seen was usually people
cost/availability with ETL operations cost taken into account.

Often the situation is that ETL cloud cost is small and you will not
save much. Then it is just skills cost/availability.
For Python skills you pay less and you can pick people with other
useful skills and also you can more easily train people you have
internally.

Often you have some simple ETL scripts before moving to spark and
these scripts are usually written in Python.

Best Regards,

Jacek


sob., 10 paź 2020 o 12:32 Jörn Franke  napisał(a):
>
> It really depends on what your data scientists talk. I don’t think it makes 
> sense for ad hoc data science things to impose a language on them, but let 
> them choose.
> For more complex AI engineering things you can though apply different 
> standards and criteria. And then it really depends on architecture aspects 
> etc.
>
> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
>
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
It really depends on what your data scientists talk. I don’t think it makes 
sense for ad hoc data science things to impose a language on them, but let them 
choose.
For more complex AI engineering things you can though apply different standards 
and criteria. And then it really depends on architecture aspects etc.

> Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh :
> 
> 
> I have come across occasions when the teams use Python with Spark for ETL, 
> for example processing data from S3 buckets into Snowflake with Spark.
> 
> The only reason I think they are choosing Python as opposed to Scala is 
> because they are more familiar with Python. Since Spark is written in Scala, 
> itself is an indication of why I think Scala has an edge.
> 
> I have not done one to one comparison of Spark with Scala vs Spark with 
> Python. I understand for data science purposes most libraries like TensorFlow 
> etc. are written in Python but I am at loss to understand the validity of 
> using Python with Spark for ETL purposes.
> 
> These are my understanding but they are not facts so I would like to get some 
> informed views on this if I can?
> 
> Many thanks,
> 
> Mich
> 
> 
> 
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
Hey Mich,

This is a very fair question .. I've seen many data engineering teams start
out with Scala because technically it is the best choice for many given
reasons and basically it is what Spark is.

On the other hand, almost all use cases we see these days are data science
use cases where people mostly do python. So, if you need those two worlds
collaborate and even handover code, you don't want the ideological battle
of Scala vs Python. We chose python for the sake of everybody speaking the
same language.

But it is true, if you do Spark DataFrames, because then PySpark is a thin
layer around everything on the JVM. Even the discussion of Python UDFs
don't hold up. If it works as a Python function (and most of the time it
does) why do Scala? If however, performance characteristics show you
otherwise, implement those UDFs on the JVM.

Problem with Python? Good engineering practices translated in tools are
much more rare ... a build tool like Maven for Java or SBT for Scala don't
exist ... yet? You can look at PyBuilder for this.

So, referring to the website you mention ... in practice, because of the
many data science use cases out there, I see many Spark shops prefer python
over Scala because Spark gravitates to dataframes where the downsides of
Python do not stack up. Performance of python as a driver program which is
just the glue code, becomes irrelevant compared to the processing you are
doing on the JVM. We even notice that Python is much easier and we hear
echoes that finding (good?) Scala engineers is hard(er).

So, conclusion: Python brings data engineers and data science together. If
you only do data engineering, Scala can be the better choice. It depends on
the context.

Hope this helps
-wim

On Fri, 9 Oct 2020 at 23:27, Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> 
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's 

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
What is the use case?
Unless you have unlimited funding and time to waste you would usually start
with that.

Regards,
Gourav

On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer 
wrote:

> Spark in Scala (or java) Is much more performant if you are using RDD's,
> those operations basically force you to pass lambdas, hit serialization
> between java and python types and yes hit the Global Interpreter Lock. But,
> none of those things apply to Data Frames which will generate Java code
> regardless of what language you use to describe the Dataframe operations as
> long as you don't use python lambdas. A Dataframe operation without python
> lambdas should not require any remote python code execution.
>
> TLDR, If you are using Dataframes it doesn't matter if you use Scala,
> Java, Python, R, SQL, the planning and work will all happen in the JVM.
>
> As for a repl, you can run PySpark which will start up a repl. There are
> also a slew of notebooks which provide interactive python environments as
> well.
>
>
> On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh 
> wrote:
>
>> Thanks
>>
>> So ignoring Python lambdas is it a matter of individuals familiarity with
>> the language that is the most important factor? Also I have noticed that
>> Spark document preferences have been switched from Scala to Python as the
>> first example. However, some codes for example JDBC calls are the same for
>> Scala and Python.
>>
>> Some examples like this website
>> 
>> claim that Scala performance is an order of magnitude better than Python
>> and also when it comes to concurrency Scala is a better choice. Maybe it is
>> pretty old (2018)?
>>
>> Also (and may be my ignorance I have not researched it) does Spark offer
>> REPL in the form of spark-shell with Python?
>>
>>
>> Regards,
>>
>> Mich
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
>> wrote:
>>
>>> As long as you don't use python lambdas in your Spark job there should
>>> be almost no difference between the Scala and Python dataframe code. Once
>>> you introduce python lambdas you will hit some significant serialization
>>> penalties as well as have to run actual work code in python. As long as no
>>> lambdas are used, everything will operate with Catalyst compiled java code
>>> so there won't be a big difference between python and scala.
>>>
>>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 I have come across occasions when the teams use Python with Spark for
 ETL, for example processing data from S3 buckets into Snowflake with Spark.

 The only reason I think they are choosing Python as opposed to Scala is
 because they are more familiar with Python. Since Spark is written in
 Scala, itself is an indication of why I think Scala has an edge.

 I have not done one to one comparison of Spark with Scala vs Spark with
 Python. I understand for data science purposes most libraries like
 TensorFlow etc. are written in Python but I am at loss to understand the
 validity of using Python with Spark for ETL purposes.

 These are my understanding but they are not facts so I would like to
 get some informed views on this if I can?

 Many thanks,

 Mich




 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *





 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
Spark in Scala (or java) Is much more performant if you are using RDD's,
those operations basically force you to pass lambdas, hit serialization
between java and python types and yes hit the Global Interpreter Lock. But,
none of those things apply to Data Frames which will generate Java code
regardless of what language you use to describe the Dataframe operations as
long as you don't use python lambdas. A Dataframe operation without python
lambdas should not require any remote python code execution.

TLDR, If you are using Dataframes it doesn't matter if you use Scala, Java,
Python, R, SQL, the planning and work will all happen in the JVM.

As for a repl, you can run PySpark which will start up a repl. There are
also a slew of notebooks which provide interactive python environments as
well.


On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh 
wrote:

> Thanks
>
> So ignoring Python lambdas is it a matter of individuals familiarity with
> the language that is the most important factor? Also I have noticed that
> Spark document preferences have been switched from Scala to Python as the
> first example. However, some codes for example JDBC calls are the same for
> Scala and Python.
>
> Some examples like this website
> 
> claim that Scala performance is an order of magnitude better than Python
> and also when it comes to concurrency Scala is a better choice. Maybe it is
> pretty old (2018)?
>
> Also (and may be my ignorance I have not researched it) does Spark offer
> REPL in the form of spark-shell with Python?
>
>
> Regards,
>
> Mich
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
> wrote:
>
>> As long as you don't use python lambdas in your Spark job there should be
>> almost no difference between the Scala and Python dataframe code. Once you
>> introduce python lambdas you will hit some significant serialization
>> penalties as well as have to run actual work code in python. As long as no
>> lambdas are used, everything will operate with Catalyst compiled java code
>> so there won't be a big difference between python and scala.
>>
>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
>> wrote:
>>
>>> I have come across occasions when the teams use Python with Spark for
>>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>>
>>> The only reason I think they are choosing Python as opposed to Scala is
>>> because they are more familiar with Python. Since Spark is written in
>>> Scala, itself is an indication of why I think Scala has an edge.
>>>
>>> I have not done one to one comparison of Spark with Scala vs Spark with
>>> Python. I understand for data science purposes most libraries like
>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>> validity of using Python with Spark for ETL purposes.
>>>
>>> These are my understanding but they are not facts so I would like to get
>>> some informed views on this if I can?
>>>
>>> Many thanks,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
Thanks

So ignoring Python lambdas is it a matter of individuals familiarity with
the language that is the most important factor? Also I have noticed that
Spark document preferences have been switched from Scala to Python as the
first example. However, some codes for example JDBC calls are the same for
Scala and Python.

Some examples like this website

claim that Scala performance is an order of magnitude better than Python
and also when it comes to concurrency Scala is a better choice. Maybe it is
pretty old (2018)?

Also (and may be my ignorance I have not researched it) does Spark offer
REPL in the form of spark-shell with Python?


Regards,

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 9 Oct 2020 at 21:59, Russell Spitzer 
wrote:

> As long as you don't use python lambdas in your Spark job there should be
> almost no difference between the Scala and Python dataframe code. Once you
> introduce python lambdas you will hit some significant serialization
> penalties as well as have to run actual work code in python. As long as no
> lambdas are used, everything will operate with Catalyst compiled java code
> so there won't be a big difference between python and scala.
>
> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Russell Spitzer
As long as you don't use python lambdas in your Spark job there should be
almost no difference between the Scala and Python dataframe code. Once you
introduce python lambdas you will hit some significant serialization
penalties as well as have to run actual work code in python. As long as no
lambdas are used, everything will operate with Catalyst compiled java code
so there won't be a big difference between python and scala.

On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh 
wrote:

> I have come across occasions when the teams use Python with Spark for ETL,
> for example processing data from S3 buckets into Snowflake with Spark.
>
> The only reason I think they are choosing Python as opposed to Scala is
> because they are more familiar with Python. Since Spark is written in
> Scala, itself is an indication of why I think Scala has an edge.
>
> I have not done one to one comparison of Spark with Scala vs Spark with
> Python. I understand for data science purposes most libraries like
> TensorFlow etc. are written in Python but I am at loss to understand the
> validity of using Python with Spark for ETL purposes.
>
> These are my understanding but they are not facts so I would like to get
> some informed views on this if I can?
>
> Many thanks,
>
> Mich
>
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>