Re: Scala vs Python for ETL with Spark

Mich Talebzadeh Thu, 15 Oct 2020 13:48:24 -0700

Hi,

I spent a few days converting one of my Spark/Scala scripts to Python. It
was interesting but at times looked like trench war. There is a lot of
handy stuff in Scala like case classes for defining column headers etc that
don't seem to be available in Python (possibly my lack of in-depth Python
knowledge). However, Spark documents frequently state availability of
features to Scala and Java and not Python.


Looking around everything written for Spark using Python is a work-around.
I am not considering Python for data science as my focus has been on
using Python with Spark for ETL, I published a thread on this today with
two examples of the code written in Scala and Python respectively. OK I
admit Lambda functions in Python with map is a great feature but that is
all. The rest can be achieved better with Scala. So I buy the view that
people tend to use Python with Spark for ETL (because with great respect)
they cannot be bothered to pick up Scala (I trust I am not unkind). So that
is it. When I was converting the code I remembered that I do still use a
Nokia 8210 (21 years old technology) from time to time. Old, sturdy, long
battery life and very small. Compare that one with Iphone. That is a fair
comparison between Spark on Scala with Spark on Python :)

HTH











LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 20:46, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> With regard to your statement below
>
> ".technology choices are agnostic to use cases according to you...."
>
> If I may say, I do not think that was the message implied. What was said
> was that in addition to "best technology fit" there are other factors
> "equally important" that need to be considered, when a company makes a
> decision on a given product use case.
>
> As others have stated, what technology stacks you choose may not be the
> best available technology but something that provides an adequate solution
> at a reasonable TCO. Case in point if Scala in a given use case is the best
> fit but at higher TCO (labour cost), then you may opt to use Python or
> another because you have those resources available in-house at lower cost
> and your Data Scientists are eager to invest in Python. Companies these
> days are very careful where to spend their technology dollars or just
> cancel the projects totally. From my experience, the following are
> crucial in deciding what to invest in
>
>
>    - Total Cost of Ownership
>    - Internal Supportability & OpIerability thus avoiding single point of
>    failure
>    - Maximum leverage, strategic as opposed to tactical (example is
>    Python considered more of a strategic product or Scala)
>    -  Agile and DevOps compatible
>    - Cloud-ready, flexible, scale-out
>    - Vendor support
>    - Documentation
>    - Minimal footprint
>
> I trust this answers your point.
>
>
> Mich
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> So Mich and rest,
>>
>> technology choices are agnostic to use cases according to you? This is
>> interesting, really interesting. Perhaps I stand corrected.
>>
>> Regards,
>> Gourav
>>
>> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> if we take Spark and its massive parallel processing and in-memory
>>> cache away, then one can argue anything can do the "ETL" job. just write
>>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>>> another often using JDBC connections. However, we all concur that may not
>>> be good enough with Big Data volumes. Generally speaking, there are two
>>> ways of making a process faster:
>>>
>>>
>>>    1. Do more intelligent work by creating indexes, cubes etc thus
>>>    reducing the processing time
>>>    2. Throw hardware and memory at it using something like Spark
>>>    multi-cluster with fully managed cloud service like Google Dataproc
>>>
>>>
>>> In general, one would see an order of magnitude performance gains.
>>>
>>>
>>> HTH,
>>>
>>>
>>> Mich
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 11 Oct 2020 at 13:33, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> But when you have fairly large volume of data that is where spark comes
>>>> in the party. And I assume the requirement of using spark is already
>>>> established in the original qs and the discussion is to use python vs
>>>> scala/java.
>>>>
>>>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <skacan...@gmail.com>
>>>> wrote:
>>>>
>>>>> If org has folks that can do python seriously why then spark in the
>>>>> first place. You can do workflow on your own, streaming or batch or what
>>>>> ever you want.
>>>>> I would not do anything else aside from python, but that is me.
>>>>>
>>>>> On Sat, Oct 10, 2020, 9:42 PM ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> I have one observation: is "python udf is slow due to deserialization
>>>>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>>>>> and so heavy investment from spark dev community on making pandas first
>>>>>> class citizen including Udfs.
>>>>>>
>>>>>> As I work with multiple clients, my exp is org culture and available
>>>>>> people are most imp driver for this choice regardless the use case. Use
>>>>>> case is relevant only when there is a feature imparity
>>>>>>
>>>>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>>
>>>>>>> Not quite sure how meaningful this discussion is, but in case
>>>>>>> someone is really faced with this query the question still is 'what is 
>>>>>>> the
>>>>>>> use case'?
>>>>>>> I am just a bit confused with the one size fits all deterministic
>>>>>>> approach here thought that those days were over almost 10 years ago.
>>>>>>> Regards
>>>>>>> Gourav
>>>>>>>
>>>>>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>>>>>>> Science.    I wrote pipelines/frameworks for large companies and scala 
>>>>>>>> was
>>>>>>>> a much better choice. But for ad-hoc work interfacing directly with 
>>>>>>>> data
>>>>>>>> science experiments pyspark presents less friction.
>>>>>>>>
>>>>>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Many thanks everyone for their valuable contribution.
>>>>>>>>>
>>>>>>>>> We all started with Spark a few years ago where Scala was the talk
>>>>>>>>> of the town. I agree with the note that as long as Spark stayed nish 
>>>>>>>>> and
>>>>>>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>>>>>>> fairness in 2014-2015, there was not much talk of Data Science input 
>>>>>>>>> (I may
>>>>>>>>> be wrong). But the world has moved on so to speak. Python itself has 
>>>>>>>>> been
>>>>>>>>> around a long time (long being relative here). Most people either 
>>>>>>>>> knew UNIX
>>>>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we 
>>>>>>>>> had a
>>>>>>>>> director a few years ago who asked our Hadoop admin for root password 
>>>>>>>>> to
>>>>>>>>> log in to the edge node. Later he became head of machine learning
>>>>>>>>> somewhere else and he loved C and Python. So Python was a gift in 
>>>>>>>>> disguise.
>>>>>>>>> I think Python appeals to those who are very familiar with CLI and 
>>>>>>>>> shell
>>>>>>>>> programming (Not GUI fan). As some members alluded to there are more 
>>>>>>>>> people
>>>>>>>>> around with Python knowledge. Most managers choose Python as the 
>>>>>>>>> unifying
>>>>>>>>> development tool because they feel comfortable with it. Frankly I 
>>>>>>>>> have not
>>>>>>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>>>>>>> disappointing to abandon Scala and switch to Python just for the sake 
>>>>>>>>> of it.
>>>>>>>>>
>>>>>>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mich
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I have come across occasions when the teams use Python with Spark
>>>>>>>>>> for ETL, for example processing data from S3 buckets into Snowflake 
>>>>>>>>>> with
>>>>>>>>>> Spark.
>>>>>>>>>>
>>>>>>>>>> The only reason I think they are choosing Python as opposed to
>>>>>>>>>> Scala is because they are more familiar with Python. Since Spark is 
>>>>>>>>>> written
>>>>>>>>>> in Scala, itself is an indication of why I think Scala has an edge.
>>>>>>>>>>
>>>>>>>>>> I have not done one to one comparison of Spark with Scala vs
>>>>>>>>>> Spark with Python. I understand for data science purposes most 
>>>>>>>>>> libraries
>>>>>>>>>> like TensorFlow etc. are written in Python but I am at loss to 
>>>>>>>>>> understand
>>>>>>>>>> the validity of using Python with Spark for ETL purposes.
>>>>>>>>>>
>>>>>>>>>> These are my understanding but they are not facts so I would like
>>>>>>>>>> to get some informed views on this if I can?
>>>>>>>>>>
>>>>>>>>>> Many thanks,
>>>>>>>>>>
>>>>>>>>>> Mich
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LinkedIn * 
>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>

Re: Scala vs Python for ETL with Spark

Reply via email to