Re: Scala vs Python for ETL with Spark

ayan guha Sun, 11 Oct 2020 05:33:32 -0700

But when you have fairly large volume of data that is where spark comes in
the party. And I assume the requirement of using spark is already
established in the original qs and the discussion is to use python vs
scala/java.


On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[email protected]> wrote:

> If org has folks that can do python seriously why then spark in the first
> place. You can do workflow on your own, streaming or batch or what ever you
> want.
> I would not do anything else aside from python, but that is me.
>
> On Sat, Oct 10, 2020, 9:42 PM ayan guha <[email protected]> wrote:
>
>> I have one observation: is "python udf is slow due to deserialization
>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>> and so heavy investment from spark dev community on making pandas first
>> class citizen including Udfs.
>>
>> As I work with multiple clients, my exp is org culture and available
>> people are most imp driver for this choice regardless the use case. Use
>> case is relevant only when there is a feature imparity
>>
>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>> [email protected]> wrote:
>>
>>> Not quite sure how meaningful this discussion is, but in case someone is
>>> really faced with this query the question still is 'what is the use case'?
>>> I am just a bit confused with the one size fits all deterministic
>>> approach here thought that those days were over almost 10 years ago.
>>> Regards
>>> Gourav
>>>
>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[email protected]> wrote:
>>>
>>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>>> Science.    I wrote pipelines/frameworks for large companies and scala was
>>>> a much better choice. But for ad-hoc work interfacing directly with data
>>>> science experiments pyspark presents less friction.
>>>>
>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> Many thanks everyone for their valuable contribution.
>>>>>
>>>>> We all started with Spark a few years ago where Scala was the talk
>>>>> of the town. I agree with the note that as long as Spark stayed nish and
>>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>>>>> may
>>>>> be wrong). But the world has moved on so to speak. Python itself has been
>>>>> around a long time (long being relative here). Most people either knew 
>>>>> UNIX
>>>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>>>> director a few years ago who asked our Hadoop admin for root password to
>>>>> log in to the edge node. Later he became head of machine learning
>>>>> somewhere else and he loved C and Python. So Python was a gift in 
>>>>> disguise.
>>>>> I think Python appeals to those who are very familiar with CLI and shell
>>>>> programming (Not GUI fan). As some members alluded to there are more 
>>>>> people
>>>>> around with Python knowledge. Most managers choose Python as the unifying
>>>>> development tool because they feel comfortable with it. Frankly I have not
>>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>>> disappointing to abandon Scala and switch to Python just for the sake of 
>>>>> it.
>>>>>
>>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I have come across occasions when the teams use Python with Spark for
>>>>>> ETL, for example processing data from S3 buckets into Snowflake with 
>>>>>> Spark.
>>>>>>
>>>>>> The only reason I think they are choosing Python as opposed to Scala
>>>>>> is because they are more familiar with Python. Since Spark is written in
>>>>>> Scala, itself is an indication of why I think Scala has an edge.
>>>>>>
>>>>>> I have not done one to one comparison of Spark with Scala vs Spark
>>>>>> with Python. I understand for data science purposes most libraries like
>>>>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>>>>> validity of using Python with Spark for ETL purposes.
>>>>>>
>>>>>> These are my understanding but they are not facts so I would like to
>>>>>> get some informed views on this if I can?
>>>>>>
>>>>>> Many thanks,
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>> Best Regards,
>> Ayan Guha
>>
> --
Best Regards,
Ayan Guha

Re: Scala vs Python for ETL with Spark

Reply via email to