Re: Scala vs Python for ETL with Spark

Gourav Sengupta Sun, 11 Oct 2020 09:39:36 -0700

So Mich and rest,

technology choices are agnostic to use cases according to you? This is
interesting, really interesting. Perhaps I stand corrected.


Regards,
Gourav

On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <[email protected]>
wrote:

> if we take Spark and its massive parallel processing and in-memory
> cache away, then one can argue anything can do the "ETL" job. just write
> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
> another often using JDBC connections. However, we all concur that may not
> be good enough with Big Data volumes. Generally speaking, there are two
> ways of making a process faster:
>
>
>    1. Do more intelligent work by creating indexes, cubes etc thus
>    reducing the processing time
>    2. Throw hardware and memory at it using something like Spark
>    multi-cluster with fully managed cloud service like Google Dataproc
>
>
> In general, one would see an order of magnitude performance gains.
>
>
> HTH,
>
>
> Mich
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 11 Oct 2020 at 13:33, ayan guha <[email protected]> wrote:
>
>> But when you have fairly large volume of data that is where spark comes
>> in the party. And I assume the requirement of using spark is already
>> established in the original qs and the discussion is to use python vs
>> scala/java.
>>
>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <[email protected]>
>> wrote:
>>
>>> If org has folks that can do python seriously why then spark in the
>>> first place. You can do workflow on your own, streaming or batch or what
>>> ever you want.
>>> I would not do anything else aside from python, but that is me.
>>>
>>> On Sat, Oct 10, 2020, 9:42 PM ayan guha <[email protected]> wrote:
>>>
>>>> I have one observation: is "python udf is slow due to deserialization
>>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>>> and so heavy investment from spark dev community on making pandas first
>>>> class citizen including Udfs.
>>>>
>>>> As I work with multiple clients, my exp is org culture and available
>>>> people are most imp driver for this choice regardless the use case. Use
>>>> case is relevant only when there is a feature imparity
>>>>
>>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>>> [email protected]> wrote:
>>>>
>>>>> Not quite sure how meaningful this discussion is, but in case someone
>>>>> is really faced with this query the question still is 'what is the use
>>>>> case'?
>>>>> I am just a bit confused with the one size fits all deterministic
>>>>> approach here thought that those days were over almost 10 years ago.
>>>>> Regards
>>>>> Gourav
>>>>>
>>>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <[email protected]> wrote:
>>>>>
>>>>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>>>>> Science.    I wrote pipelines/frameworks for large companies and scala 
>>>>>> was
>>>>>> a much better choice. But for ad-hoc work interfacing directly with data
>>>>>> science experiments pyspark presents less friction.
>>>>>>
>>>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Many thanks everyone for their valuable contribution.
>>>>>>>
>>>>>>> We all started with Spark a few years ago where Scala was the talk
>>>>>>> of the town. I agree with the note that as long as Spark stayed nish and
>>>>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>>>>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>>>>>>> may
>>>>>>> be wrong). But the world has moved on so to speak. Python itself has 
>>>>>>> been
>>>>>>> around a long time (long being relative here). Most people either knew 
>>>>>>> UNIX
>>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we had 
>>>>>>> a
>>>>>>> director a few years ago who asked our Hadoop admin for root password to
>>>>>>> log in to the edge node. Later he became head of machine learning
>>>>>>> somewhere else and he loved C and Python. So Python was a gift in 
>>>>>>> disguise.
>>>>>>> I think Python appeals to those who are very familiar with CLI and shell
>>>>>>> programming (Not GUI fan). As some members alluded to there are more 
>>>>>>> people
>>>>>>> around with Python knowledge. Most managers choose Python as the 
>>>>>>> unifying
>>>>>>> development tool because they feel comfortable with it. Frankly I have 
>>>>>>> not
>>>>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>>>>> disappointing to abandon Scala and switch to Python just for the sake 
>>>>>>> of it.
>>>>>>>
>>>>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>
>>>>>>> Mich
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I have come across occasions when the teams use Python with Spark
>>>>>>>> for ETL, for example processing data from S3 buckets into Snowflake 
>>>>>>>> with
>>>>>>>> Spark.
>>>>>>>>
>>>>>>>> The only reason I think they are choosing Python as opposed to
>>>>>>>> Scala is because they are more familiar with Python. Since Spark is 
>>>>>>>> written
>>>>>>>> in Scala, itself is an indication of why I think Scala has an edge.
>>>>>>>>
>>>>>>>> I have not done one to one comparison of Spark with Scala vs Spark
>>>>>>>> with Python. I understand for data science purposes most libraries like
>>>>>>>> TensorFlow etc. are written in Python but I am at loss to understand 
>>>>>>>> the
>>>>>>>> validity of using Python with Spark for ETL purposes.
>>>>>>>>
>>>>>>>> These are my understanding but they are not facts so I would like
>>>>>>>> to get some informed views on this if I can?
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>>
>>>>>>>> Mich
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * 
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>

Re: Scala vs Python for ETL with Spark

Reply via email to