But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java.
On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <skacan...@gmail.com> wrote: > If org has folks that can do python seriously why then spark in the first > place. You can do workflow on your own, streaming or batch or what ever you > want. > I would not do anything else aside from python, but that is me. > > On Sat, Oct 10, 2020, 9:42 PM ayan guha <guha.a...@gmail.com> wrote: > >> I have one observation: is "python udf is slow due to deserialization >> penulty" still relevant? Even after arrow is used as in memory data mgmt >> and so heavy investment from spark dev community on making pandas first >> class citizen including Udfs. >> >> As I work with multiple clients, my exp is org culture and available >> people are most imp driver for this choice regardless the use case. Use >> case is relevant only when there is a feature imparity >> >> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Not quite sure how meaningful this discussion is, but in case someone is >>> really faced with this query the question still is 'what is the use case'? >>> I am just a bit confused with the one size fits all deterministic >>> approach here thought that those days were over almost 10 years ago. >>> Regards >>> Gourav >>> >>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com> wrote: >>> >>>> I agree with Wim's assessment of data engineering / ETL vs Data >>>> Science. I wrote pipelines/frameworks for large companies and scala was >>>> a much better choice. But for ad-hoc work interfacing directly with data >>>> science experiments pyspark presents less friction. >>>> >>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Many thanks everyone for their valuable contribution. >>>>> >>>>> We all started with Spark a few years ago where Scala was the talk >>>>> of the town. I agree with the note that as long as Spark stayed nish and >>>>> elite, then someone with Scala knowledge was attracting premiums. In >>>>> fairness in 2014-2015, there was not much talk of Data Science input (I >>>>> may >>>>> be wrong). But the world has moved on so to speak. Python itself has been >>>>> around a long time (long being relative here). Most people either knew >>>>> UNIX >>>>> Shell, C, Python or Perl or a combination of all these. I recall we had a >>>>> director a few years ago who asked our Hadoop admin for root password to >>>>> log in to the edge node. Later he became head of machine learning >>>>> somewhere else and he loved C and Python. So Python was a gift in >>>>> disguise. >>>>> I think Python appeals to those who are very familiar with CLI and shell >>>>> programming (Not GUI fan). As some members alluded to there are more >>>>> people >>>>> around with Python knowledge. Most managers choose Python as the unifying >>>>> development tool because they feel comfortable with it. Frankly I have not >>>>> seen a manager who feels at home with Scala. So in summary it is a bit >>>>> disappointing to abandon Scala and switch to Python just for the sake of >>>>> it. >>>>> >>>>> Disclaimer: These are opinions and not facts so to speak :) >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> Mich >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> I have come across occasions when the teams use Python with Spark for >>>>>> ETL, for example processing data from S3 buckets into Snowflake with >>>>>> Spark. >>>>>> >>>>>> The only reason I think they are choosing Python as opposed to Scala >>>>>> is because they are more familiar with Python. Since Spark is written in >>>>>> Scala, itself is an indication of why I think Scala has an edge. >>>>>> >>>>>> I have not done one to one comparison of Spark with Scala vs Spark >>>>>> with Python. I understand for data science purposes most libraries like >>>>>> TensorFlow etc. are written in Python but I am at loss to understand the >>>>>> validity of using Python with Spark for ETL purposes. >>>>>> >>>>>> These are my understanding but they are not facts so I would like to >>>>>> get some informed views on this if I can? >>>>>> >>>>>> Many thanks, >>>>>> >>>>>> Mich >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>> -- >> Best Regards, >> Ayan Guha >> > -- Best Regards, Ayan Guha