Re: Scala vs Python for ETL with Spark

Mich Talebzadeh Sun, 11 Oct 2020 09:00:32 -0700

if we take Spark and its massive parallel processing and in-memory
cache away, then one can argue anything can do the "ETL" job. just write
some Java/Scala/SQL/Perl/python to read data and write to from one DB to
another often using JDBC connections. However, we all concur that may not
be good enough with Big Data volumes. Generally speaking, there are two
ways of making a process faster:



   1. Do more intelligent work by creating indexes, cubes etc thus reducing
   the processing time
   2. Throw hardware and memory at it using something like Spark
   multi-cluster with fully managed cloud service like Google Dataproc


In general, one would see an order of magnitude performance gains.


HTH,


Mich



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 13:33, ayan guha <guha.a...@gmail.com> wrote:

> But when you have fairly large volume of data that is where spark comes in
> the party. And I assume the requirement of using spark is already
> established in the original qs and the discussion is to use python vs
> scala/java.
>
> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <skacan...@gmail.com>
> wrote:
>
>> If org has folks that can do python seriously why then spark in the first
>> place. You can do workflow on your own, streaming or batch or what ever you
>> want.
>> I would not do anything else aside from python, but that is me.
>>
>> On Sat, Oct 10, 2020, 9:42 PM ayan guha <guha.a...@gmail.com> wrote:
>>
>>> I have one observation: is "python udf is slow due to deserialization
>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>> and so heavy investment from spark dev community on making pandas first
>>> class citizen including Udfs.
>>>
>>> As I work with multiple clients, my exp is org culture and available
>>> people are most imp driver for this choice regardless the use case. Use
>>> case is relevant only when there is a feature imparity
>>>
>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Not quite sure how meaningful this discussion is, but in case someone
>>>> is really faced with this query the question still is 'what is the use
>>>> case'?
>>>> I am just a bit confused with the one size fits all deterministic
>>>> approach here thought that those days were over almost 10 years ago.
>>>> Regards
>>>> Gourav
>>>>
>>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com> wrote:
>>>>
>>>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>>>> Science.    I wrote pipelines/frameworks for large companies and scala was
>>>>> a much better choice. But for ad-hoc work interfacing directly with data
>>>>> science experiments pyspark presents less friction.
>>>>>
>>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Many thanks everyone for their valuable contribution.
>>>>>>
>>>>>> We all started with Spark a few years ago where Scala was the talk
>>>>>> of the town. I agree with the note that as long as Spark stayed nish and
>>>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>>>> fairness in 2014-2015, there was not much talk of Data Science input (I 
>>>>>> may
>>>>>> be wrong). But the world has moved on so to speak. Python itself has been
>>>>>> around a long time (long being relative here). Most people either knew 
>>>>>> UNIX
>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we had a
>>>>>> director a few years ago who asked our Hadoop admin for root password to
>>>>>> log in to the edge node. Later he became head of machine learning
>>>>>> somewhere else and he loved C and Python. So Python was a gift in 
>>>>>> disguise.
>>>>>> I think Python appeals to those who are very familiar with CLI and shell
>>>>>> programming (Not GUI fan). As some members alluded to there are more 
>>>>>> people
>>>>>> around with Python knowledge. Most managers choose Python as the unifying
>>>>>> development tool because they feel comfortable with it. Frankly I have 
>>>>>> not
>>>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>>>> disappointing to abandon Scala and switch to Python just for the sake of 
>>>>>> it.
>>>>>>
>>>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> I have come across occasions when the teams use Python with Spark
>>>>>>> for ETL, for example processing data from S3 buckets into Snowflake with
>>>>>>> Spark.
>>>>>>>
>>>>>>> The only reason I think they are choosing Python as opposed to Scala
>>>>>>> is because they are more familiar with Python. Since Spark is written in
>>>>>>> Scala, itself is an indication of why I think Scala has an edge.
>>>>>>>
>>>>>>> I have not done one to one comparison of Spark with Scala vs Spark
>>>>>>> with Python. I understand for data science purposes most libraries like
>>>>>>> TensorFlow etc. are written in Python but I am at loss to understand the
>>>>>>> validity of using Python with Spark for ETL purposes.
>>>>>>>
>>>>>>> These are my understanding but they are not facts so I would like to
>>>>>>> get some informed views on this if I can?
>>>>>>>
>>>>>>> Many thanks,
>>>>>>>
>>>>>>> Mich
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Scala vs Python for ETL with Spark

Reply via email to