So Mich and rest, technology choices are agnostic to use cases according to you? This is interesting, really interesting. Perhaps I stand corrected.
Regards, Gourav On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > if we take Spark and its massive parallel processing and in-memory > cache away, then one can argue anything can do the "ETL" job. just write > some Java/Scala/SQL/Perl/python to read data and write to from one DB to > another often using JDBC connections. However, we all concur that may not > be good enough with Big Data volumes. Generally speaking, there are two > ways of making a process faster: > > > 1. Do more intelligent work by creating indexes, cubes etc thus > reducing the processing time > 2. Throw hardware and memory at it using something like Spark > multi-cluster with fully managed cloud service like Google Dataproc > > > In general, one would see an order of magnitude performance gains. > > > HTH, > > > Mich > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 11 Oct 2020 at 13:33, ayan guha <guha.a...@gmail.com> wrote: > >> But when you have fairly large volume of data that is where spark comes >> in the party. And I assume the requirement of using spark is already >> established in the original qs and the discussion is to use python vs >> scala/java. >> >> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <skacan...@gmail.com> >> wrote: >> >>> If org has folks that can do python seriously why then spark in the >>> first place. You can do workflow on your own, streaming or batch or what >>> ever you want. >>> I would not do anything else aside from python, but that is me. >>> >>> On Sat, Oct 10, 2020, 9:42 PM ayan guha <guha.a...@gmail.com> wrote: >>> >>>> I have one observation: is "python udf is slow due to deserialization >>>> penulty" still relevant? Even after arrow is used as in memory data mgmt >>>> and so heavy investment from spark dev community on making pandas first >>>> class citizen including Udfs. >>>> >>>> As I work with multiple clients, my exp is org culture and available >>>> people are most imp driver for this choice regardless the use case. Use >>>> case is relevant only when there is a feature imparity >>>> >>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>>> Not quite sure how meaningful this discussion is, but in case someone >>>>> is really faced with this query the question still is 'what is the use >>>>> case'? >>>>> I am just a bit confused with the one size fits all deterministic >>>>> approach here thought that those days were over almost 10 years ago. >>>>> Regards >>>>> Gourav >>>>> >>>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com> wrote: >>>>> >>>>>> I agree with Wim's assessment of data engineering / ETL vs Data >>>>>> Science. I wrote pipelines/frameworks for large companies and scala >>>>>> was >>>>>> a much better choice. But for ad-hoc work interfacing directly with data >>>>>> science experiments pyspark presents less friction. >>>>>> >>>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Many thanks everyone for their valuable contribution. >>>>>>> >>>>>>> We all started with Spark a few years ago where Scala was the talk >>>>>>> of the town. I agree with the note that as long as Spark stayed nish and >>>>>>> elite, then someone with Scala knowledge was attracting premiums. In >>>>>>> fairness in 2014-2015, there was not much talk of Data Science input (I >>>>>>> may >>>>>>> be wrong). But the world has moved on so to speak. Python itself has >>>>>>> been >>>>>>> around a long time (long being relative here). Most people either knew >>>>>>> UNIX >>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we had >>>>>>> a >>>>>>> director a few years ago who asked our Hadoop admin for root password to >>>>>>> log in to the edge node. Later he became head of machine learning >>>>>>> somewhere else and he loved C and Python. So Python was a gift in >>>>>>> disguise. >>>>>>> I think Python appeals to those who are very familiar with CLI and shell >>>>>>> programming (Not GUI fan). As some members alluded to there are more >>>>>>> people >>>>>>> around with Python knowledge. Most managers choose Python as the >>>>>>> unifying >>>>>>> development tool because they feel comfortable with it. Frankly I have >>>>>>> not >>>>>>> seen a manager who feels at home with Scala. So in summary it is a bit >>>>>>> disappointing to abandon Scala and switch to Python just for the sake >>>>>>> of it. >>>>>>> >>>>>>> Disclaimer: These are opinions and not facts so to speak :) >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> >>>>>>> Mich >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> I have come across occasions when the teams use Python with Spark >>>>>>>> for ETL, for example processing data from S3 buckets into Snowflake >>>>>>>> with >>>>>>>> Spark. >>>>>>>> >>>>>>>> The only reason I think they are choosing Python as opposed to >>>>>>>> Scala is because they are more familiar with Python. Since Spark is >>>>>>>> written >>>>>>>> in Scala, itself is an indication of why I think Scala has an edge. >>>>>>>> >>>>>>>> I have not done one to one comparison of Spark with Scala vs Spark >>>>>>>> with Python. I understand for data science purposes most libraries like >>>>>>>> TensorFlow etc. are written in Python but I am at loss to understand >>>>>>>> the >>>>>>>> validity of using Python with Spark for ETL purposes. >>>>>>>> >>>>>>>> These are my understanding but they are not facts so I would like >>>>>>>> to get some informed views on this if I can? >>>>>>>> >>>>>>>> Many thanks, >>>>>>>> >>>>>>>> Mich >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> LinkedIn * >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> -- >> Best Regards, >> Ayan Guha >> >