The use of Python ParamSpec in PySpark

2025-04-08 Thread Rafał Wojdyła
Hi, I wanted to highlight the usefulness of now closed (unmerged) PR `[SPARK-49008][PYTHON] Use ParamSpec to propagate func signature in transform` - https://github.com/apache/spark/pull/47493. This change would add type-checking for the DataFrame `transform` method in PySpark using Python

Re: Drop Python 2 support from GraphFrames?

2025-02-01 Thread Russell Jurney
No, I’m pulling that code from Python/run-tests.sh and anywhere else I see it. On Sat, Feb 1, 2025 at 9:13 AM Ángel wrote: > Well, if Spark no longer supports Python 2, is there anything left to > discuss? > > El sáb, 1 feb 2025 a las 8:06, Mich Talebzadeh () > escribió: >

Re: Drop Python 2 support from GraphFrames?

2025-02-01 Thread Ángel
Well, if Spark no longer supports Python 2, is there anything left to discuss? El sáb, 1 feb 2025 a las 8:06, Mich Talebzadeh () escribió: > +1 long overdue > > Dr Mich Talebzadeh, > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > >view my Linkedi

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Mich Talebzadeh
settle it, then. I am now converting all > the nose tests to pytests which will allow GraphFrames to run Python 3.11. > > Russell > > On Fri, Jan 31, 2025 at 5:20 PM Holden Karau > wrote: > >> We no longer support Python 2 in Spark >> >> Twitter: https://t

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Russell Jurney
Oh, wonderful! That should about settle it, then. I am now converting all the nose tests to pytests which will allow GraphFrames to run Python 3.11. Russell On Fri, Jan 31, 2025 at 5:20 PM Holden Karau wrote: > We no longer support Python 2 in Spark > > Twitter: https://twitter.com/ho

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Jules Damji
On Fri, 31 Jan 2025 at 5:21 PM, Holden Karau wrote: We no longer support Python 2 in Spark > +1 > Excuse the thumb typos > > Twitter: https://twitter.com/holdenkarau > Fight Health Insurance: https://www.fighthealthinsurance.com/ > <https://www.fighthealthinsurance.com

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Holden Karau
We no longer support Python 2 in Spark Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ <https://www.fighthealthinsurance.com/?q=hk_email> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.t

Drop Python 2 support from GraphFrames?

2025-01-31 Thread Russell Jurney
So... including the Spark user list for a broader perspective on Python 2 PySpark users. I want to remove Python 2 support from GraphFrames so I don't have to think about it or work in Python 2... I wrote this up in the issue Drop support for Python 2 <https://github.com/graphframes/gra

AWS Glue and Python

2024-06-26 Thread Perez
Hi Team I am facing one issue here https://stackoverflow.com/questions/78673228/unable-to-read-text-file-in-glue-job TIA

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
t; Thanks for sharing. > > On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, > wrote: > >> Python for the kids. Slightly off-topic but worthwhile sharing. >> >> One of the things that may benefit kids is starting to learn something >> new. Basically anything that can f

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
Mich, this is absolutely amazing. Thanks for sharing. On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, wrote: > Python for the kids. Slightly off-topic but worthwhile sharing. > > One of the things that may benefit kids is starting to learn something > new. Basically anything that can

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Varun Shah
, Varun Shah On Fri, Mar 15, 2024, 03:10 Mich Talebzadeh wrote: > Hi, > > When you create a DataFrame from Python objects using > spark.createDataFrame, here it goes: > > > *Initial Local Creation:* > The DataFrame is initially created in the memory of the driver nod

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually res

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as pipelin

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Op

Python library that generates fake data using Faker

2024-03-16 Thread Mich Talebzadeh
I came across this a few weeks ago. II a nutshell you can use it for generating test data and other scenarios where you need realistic-looking but not necessarily real data. With so many regulations and copyrights etc it is a viable alternative. I used it to generate 1000 lines of mixed true and f

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, &#x

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actio

[FYI] SPARK-45981: Improve Python language test coverage

2023-12-01 Thread Dongjoon Hyun
Hi, All. As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community starts to have test coverage for all supported Python versions from Today. - https://github.com/apache/spark/actions/runs/7061665420 Here is a summary. 1. Main CI: All PRs and commits on `master` branch are

Re: pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Mich Talebzadeh
Hi, Have you set python environment variables correctly? PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON? You can print the environment variables within your PySpark script to verify this: import os print("PYTHONPATH:", os.environ.get("PYTHONPATH")) print("PYSPARK

Re: pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Harry Jamison
That did not paste well, let me try again I am using python3.7 and spark 2.4.7 I am trying to figure out why my job is using the wrong python version This is how it is starting up the logs confirm that I am using python 3.7But I later see the error message showing it is trying to us 3.8, and I

pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Harry Jamison
I am using python3.7 and spark 2.4.7 I am trying to figure out why my job is using the wrong python version This is how it is starting up the logs confirm that I am using python 3.7But I later see the error message showing it is trying to us 3.8, and I am not sure where it is picking that up

Managing python modules in docker for PySpark?

2023-08-16 Thread Mich Talebzadeh
Hi, This is a bit of an old hat but worth getting opinions on it. Current options that I believe apply are: 1. Installing them individually via pip in the docker build process 2. Installing them together via pip in the build process via requirments.txt 3. Installing them to a volume

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
che/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_01/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path: If I use HDFS file: spark

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
:350: > > RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] > specified in 'spark.submit.pyFiles' to Python path: > > If I use HDFS file: > > spark-submit --master yarn --deploy-mode cluster --py-files > hdfs://hadoop-namenode:9000/tmp/app-submodul

[PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
-local-dir/usercache/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_01/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path: If I use HDFS file: sp

Custom Session Windowing in Spark using Scala/Python

2023-08-03 Thread Ravi Teja
Hi, I am new to Spark and looking for help regarding the session windowing in Spark. I want to create session windows on a user activity stream with a gap duration of `x` minutes and also have a

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for AWS EKS to work It is a bit of nightmare compared to the same on Google SDK which is simpler Anyhow you will require additional jar files to be added to $SPARK_HOME/jars. These two files will be picked up after you build

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
bmit to eks cluster, I use the standard code to submit to >> the cluster as below: >> >> spark-submit --verbose \ >>--master k8s://$KUBERNETES_MASTER_IP:443 \ >>--deploy-mode cluster \ >>--name sparkOnEks \ >> --py-files local://$CO

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
>--master k8s://$KUBERNETES_MASTER_IP:443 \ >--deploy-mode cluster \ >--name sparkOnEks \ >--py-files local://$CODE_DIRECTORY/spark_on_eks.zip \ > local:///home/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py > > In Google Kubernetes Engine (GKE

Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
/hduser/dba/bin/python/spark_on_eks/src/RandomDataBigQuery.py In Google Kubernetes Engine (GKE) I simply load them from gs:// storage bucket.and it works fine. I am getting the following error in driver pod + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.dri

Fwd: 自动回复: Re: [DISCUSS] Show Python code examples first in Spark documentation

2023-02-26 Thread Mich Talebzadeh
ary damages arising from such loss, damage or destruction. -- Forwarded message - From: xyqiao Date: Sun, 26 Feb 2023 at 22:42 Subject: 自动回复: Re: [DISCUSS] Show Python code examples first in Spark documentation To: Mich Talebzadeh 这是来自QQ邮箱的假期自动回复邮件。 您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

How to update TaskMetrics from Python?

2022-06-16 Thread Shay Elbaz
Hi All, I have some data output source which can only be written to by a specific Python API. For that I am (ab)using foreachPartition(writing_func) from PySpark which works pretty well. I wonder if its possible to somehow update the task metrics - specifically setBytesWritten - at the end

Re: How do I read parquet with python object

2022-05-09 Thread Sean Owen
That's a parquet library error. It might be this: https://issues.apache.org/jira/browse/PARQUET-1633 That's fixed in recent versions of Parquet. You didn't say what versions of libraries you are using, but try the latest Spark. On Mon, May 9, 2022 at 8:49 AM wrote: > #

How do I read parquet with python object

2022-05-09 Thread ben
# python: import pandas as pd a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b']) a.to_parquet('a.parquet') # pyspark: d2 = spark.read.parquet('a.parquet') will return error: An error was encountered: An error occurred while calling o277.showString

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Xavier Gervilla
Thank you for the flatten function, it has a bigger functionality than what I need for my project but the examples (which were really, really useful) helped me find a solution. Instead of accessing the confidence and entity attributes (metadata.confidence and metadata.entity) I was accessing

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-20 Thread Bjørn Jørgensen
Glad to hear that it works :) Your dataframe is nested with both map, array and struct. I`m using this function to flatten a nested dataframe to rows and columns. from pyspark.sql.types import * from pyspark.sql.functions import * def flatten_test(df, sep="_"): """Returns a flattened dataf

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet *change spark = sparknlp.start()* to spark = sparknlp.start(spark32=True) tir. 19. apr. 2022 kl. 21:10 skrev Bjørn Jørgensen : > Yes, there are some that have that issue. > > Please open a new issue at > https://github.com/JohnSnowLa

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Bjørn Jørgensen
Yes, there are some that have that issue. Please open a new issue at https://github.com/JohnSnowLabs/spark-nlp/issues and they will help you. tir. 19. apr. 2022 kl. 20:33 skrev Xavier Gervilla < xavier.gervi...@datapta.com>: > Thank you for your advice, I had small knowledge of Spark NLP and

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-19 Thread Jungtaek Lim
I have no context on ML, but your "streaming" query exposes the possibility of memory issues. *flattenedNER.registerTempTable(**"df"**) >>> >>> >>> querySelect = **"SELECT col as entity, avg(sentiment) as sentiment, >>> count(col) as count FROM df GROUP BY col"** >>> finalDF = spark.sql(querySele

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Bjørn Jørgensen
When did SpaCy have support for Spark? Try Spark NLP it`s made for spark. They have a lot of notebooks at https://github.com/JohnSnowLabs/spark-nlp and they public user guides at https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-component

Re: [Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Sean Owen
It looks good, are you sure it even starts? the problem I see is that you send a copy of the model from the driver for every task. Try broadcasting the model instead. I'm not sure if that resolves it but would be a good practice. On Mon, Apr 18, 2022 at 9:10 AM Xavier Gervilla wrote: > Hi Team,

[Spark Streaming] [Debug] Memory error when using NER model in Python

2022-04-18 Thread Xavier Gervilla
Hi Team,https://stackoverflow.com/questions/71841814/is-there-a-way-to-prevent-excessive-ram-consumption-with-the-spark-configuration I'm developing a project that retrieves tweets on a 'host' app, streams them to Spark and with different operations with DataFrames obtains the Sentiment of th

python API in Spark-streaming-kafka spark 3.2.1

2022-03-07 Thread Wiśniewski Michał
Hi, I've read in the documentation, that since spark 3.2.1 python API for spark-streaming-kafka is back in the game. https://spark.apache.org/docs/3.2.1/streaming-programming-guide.html#advanced-sources But in the Kafka Integration Guide there is no documentation for the python API.

Spark 3.1.3 docker pre-built with Python Data science packages

2022-02-23 Thread Mich Talebzadeh
Some people asked me whether it was possible to create a docker file (spark 3.1.3) with Python packages geared towards DS etc., having the following pre-built packages pyyaml TensorFlow Theano Pandas Keras NumPy SciPy Scrapy SciKit-Learn XGBoost Matplotlib Seaborn Bokeh Plotly pydot Statsmodels

Re: Python performance

2022-02-06 Thread Hinko Kocevar
Thanks for your input guys! //hinko On 4 Feb 2022, at 14:58, Sean Owen wrote:  Yes, in the sense that any transformation that can be expressed in the SQL-like DataFrame API will push down to the JVM, and take advantage of other optimizations, avoiding the data movement to/from Python and

Re: Python performance

2022-02-04 Thread Sean Owen
Yes, in the sense that any transformation that can be expressed in the SQL-like DataFrame API will push down to the JVM, and take advantage of other optimizations, avoiding the data movement to/from Python and more. But you can't do this if you're expressing operations that are

Re: Python performance

2022-02-04 Thread Bitfox
Please see my this test: https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ Don’t use Python RDD, using dataframe instead. Regards On Fri, Feb 4, 2022 at 5:02 PM Hinko Kocevar wrote: > I'm looking into using Python interface with Spark and came across t

Python performance

2022-02-04 Thread Hinko Kocevar
I'm looking into using Python interface with Spark and came across this [1] chart showing some performance hit when going with Python RDD. Data is ~ 7 years and for older version of Spark. Is this still the case with more recent Spark releases? I'm trying to understand what to e

triggering spark python app using native REST api

2022-01-24 Thread Michael Williams (SSI)
Hello, I've been trying to work out how to replicate execution of a python app using spark-submit via the CLI using the native spark REST api (http://localhost:6066/v1/submissions/create) for a couple of weeks without success. The environment is docker using the latest docker for spar

Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see

Re: Conda Python Env in K8S

2021-12-06 Thread Mich Talebzadeh
RK-33530 > > https://issues.apache.org/jira/browse/SPARK-33615 > > > > Best, > > Meikel > > > > > > *From:* Mich Talebzadeh > *Sent:* Samstag, 4. Dezember 2021 18:36 > *To:* Bode, Meikel, NMA-CFD > *Cc:* dev ; user@spark.apache.org > *Subject:* R

RE: Conda Python Env in K8S

2021-12-06 Thread Bode, Meikel, NMA-CFD
RK-33615 Best, Meikel From: Mich Talebzadeh Sent: Samstag, 4. Dezember 2021 18:36 To: Bode, Meikel, NMA-CFD Cc: dev ; user@spark.apache.org Subject: Re: Conda Python Env in K8S Hi Meikel In the past I tried with --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/

Re: Conda Python Env in K8S

2021-12-04 Thread Mich Talebzadeh
image size. It is also advisable to install all packages in one line. Every time you put RUN statement it creates an intermediate container and hence it increases the build time. So reduce it by putting all packages in one line. Log in to the docker image and check for Python packages installed

Re: Conda Python Env in K8S

2021-12-04 Thread Gourav Sengupta
ist and I want to understand what the issue > is… > > Any hints on that? > > > > Best, > > Meikel > > > > *From:* Mich Talebzadeh > *Sent:* Freitag, 3. Dezember 2021 13:27 > *To:* Bode, Meikel, NMA-CFD > *Cc:* dev ; user@spark.apache.org > *Subject:* Re

RE: Conda Python Env in K8S

2021-12-03 Thread Bode, Meikel, NMA-CFD
these options exist and I want to understand what the issue is... Any hints on that? Best, Meikel From: Mich Talebzadeh Sent: Freitag, 3. Dezember 2021 13:27 To: Bode, Meikel, NMA-CFD Cc: dev ; user@spark.apache.org Subject: Re: Conda Python Env in K8S Build python packages into the docker

Re: Conda Python Env in K8S

2021-12-03 Thread Mich Talebzadeh
Build python packages into the docker image itself first with pip install RUN pip install panda . . —no-cache HTH On Fri, 3 Dec 2021 at 11:58, Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hello, > > > > I am trying to run spark jobs using Spark Kubernetes O

Conda Python Env in K8S

2021-12-03 Thread Bode, Meikel, NMA-CFD
Hello, I am trying to run spark jobs using Spark Kubernetes Operator. But when I try to bundle a conda python environment using the following resource description the python interpreter is only unpack to the driver and not to the executors. apiVersion: "sparkoperator.k8s.io/v1beta2&

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Mich Talebzadeh
ng, > use Pandas.. > > -- ND > > On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote: > > Hello team > > Someone asked me regarding well developed Python code with Panda dataframe > and comparing that to PySpark. > > Under what situations one choose PySpark instead of Python and Pandas. > > Appreciate > > > AK > > > > >

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Artemis User
use Pandas.. -- ND On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote: Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID
Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK  

Re: Python level of knowledge for Spark and PySpark

2021-04-14 Thread Mich Talebzadeh
Hi, I believe both Ayan and Jayesh have made valuable points. From my experience the companies that traditionally have a Python team in house would like to add Spark capabilities to their inventory not least because 1. There is more Python code running around that will benefit from modern

Re: Python level of knowledge for Spark and PySpark

2021-04-14 Thread Lalwani, Jayesh
are paying you to learn more XYZ. You want to know if you know Python enough to do PySpark, look at what companies are asking for. Go for interviews. Just speaking from experience, most jobs that call for Python + Spark tend to be data science jobs. These jobs also require you to have a data

Re: Python level of knowledge for Spark and PySpark

2021-04-14 Thread ayan guha
LID wrote: > Hi gurus, > > I have knowledge of Java, Scala and good enough knowledge of Spark, Spark > SQL and Spark Functional programing with Scala. > > I have started using Python with Spark PySpark. > > Wondering, in order to be proficient in PySpark, how much good knowledg

Python level of knowledge for Spark and PySpark

2021-04-14 Thread ashok34...@yahoo.com.INVALID
Hi gurus, I have knowledge of Java, Scala and good enough knowledge of Spark, Spark SQL and Spark Functional programing with Scala. I have started using Python with Spark PySpark. Wondering, in order to be proficient in PySpark, how much good knowledge of Python programing is needed? I know the

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Patrick McCarthy
ubleshooting-importerror.html > > Please note and check the following: > > * The Python version is: Python3.7 from "/usr/bin/python3" > * The NumPy version is: "1.19.4" > > and make sure that they are the versions you expect. > Please carefully study the d

Re: Issue while installing dependencies Python Spark

2020-12-18 Thread Sachit Murarka
troubleshooting tips at: https://numpy.org/devdocs/user/troubleshooting-importerror.html Please note and check the following: * The Python version is: Python3.7 from "/usr/bin/python3" * The NumPy version is: "1.19.4" and make sure that they are the versions you expe

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Artemis User
PYTHONPATH for Python apps. In other words, you can't run spark-submit in a virtual environment like a regular python program, since it is NOT a regular python script.  But you can package your python spark project (including all dependency libs) as a zip or egg file and make it available to

Re: Issue while installing dependencies Python Spark

2020-12-17 Thread Patrick McCarthy
I'm not very familiar with the environments on cloud clusters, but in general I'd be reluctant to lean on setuptools or other python install mechanisms. In the worst case, you might encounter /usr/bin/pip not having permissions to install new packages, or even if you do a package mig

Issue while installing dependencies Python Spark

2020-12-17 Thread Sachit Murarka
Hi Users I have a wheel file , while creating it I have mentioned dependencies in setup.py file. Now I have 2 virtual envs, 1 was already there . another one I created just now. I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. Co

Pyspark application hangs (no error messages) on Python RDD .map

2020-11-10 Thread Daniel Stojanov
Hi, This code will hang indefinitely at the last line (the .map()). Interestingly, if I run the same code at the beginning of my application (removing the .write step) it executes as expected. Otherwise, the code appears further along in my application which is where it hangs. The debugging messag

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Sofia’s World
which are 99% python..the cost of retraining -or even hiring - is too big especially if you have an existing project and aggressive deadlines Plz feel free to object Kind Regards On Fri, Oct 23, 2020, 1:01 PM William R wrote: > It's really a very big discussion around Pyspark Vs Scala.

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
Python increases the probability for more issues and bugs because translation between these two different languages is difficult. 3. Using Scala for Spark provides access to the latest features of the Spark framework as they are first available in Scala and then ported to Python. 4

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread William R
>> >> On Thu, 22 Oct 2020 at 20:34, Sean Owen wrote: >> >>> I don't find this trolling; I agree with the observation that 'the >>> skills you have' are a valid and important determiner of what tools you >>> pick. >>> I disagree that you

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Wim Van Leuven
I don't find this trolling; I agree with the observation that 'the skills >> you have' are a valid and important determiner of what tools you pick. >> I disagree that you just have to pick the optimal tool for everything. >> Sounds good until that comes in contact with the

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
il that comes in contact with the real world. > For Spark, Python vs Scala just doesn't matter a lot, especially if you're > doing DataFrame operations. By design. So I can't see there being one > answer to this. > > On Thu, Oct 22, 2020 at 2:23 PM Gourav Sengupta &g

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills you have' are a valid and important determiner of what tools you pick. I disagree that you just have to pick the optimal tool for everything. Sounds good until that comes in contact with the real world. For S

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
Hi Mich, this is turning into a troll now, can you please stop this? No one uses Scala where Python should be used, and no one uses Python where Scala should be used - it all depends on requirements. Everyone understands polyglot programming and how to use relevant technologies best to their

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
the rest are Python savvys. It shows again that at times functionality is sacrificed in favour of the availability of resources and reaffirms what some members were saying regarding the choice of the technology based on TCO, favouring Python over Spark. HTH, Mich *Disclaimer:* Use it at your

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
Holy war is a bit dramatic don't you think? 🙂 The difference between Scala and Python will always be very relevant when choosing between Spark and Pyspark. I wouldn't call it irrelevant to the original question. br, molotch On Sat, 17 Oct 2020 at 16:57, "Yuri Oleynikov (‫יו

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Magnus Nilsson
I'm sorry you were offended. I'm not an expert in Python and I wasn't trying to attack you personally. It's just an opinion about what makes a language better or worse, it's not the single source of truth. You don't have to take offense. In the end its about con

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark. In my experience with performance is super important you’ll end up needing to do some of your work in the JVM, but in many situations what matters work is what your team and company are familiar with and the ecosystem of tooling

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
It seems that thread converted to holy war that has nothing to do with original question. If it is, it’s super disappointing Отправлено с iPhone > 17 окт. 2020 г., в 15:53, Molotch написал(а): > > I would say the pros and cons of Python vs Scala is both down to Spark, the >

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Sasha Kacanski
And you are an expert on python! Idiomatic... Please do everyone a favor and stop commenting on things you have no idea... I build ETL systems python that wiped java commercial stacks left and right. Pyspark was and is and will be a second class citizen in spark world. That has nothing to do with

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Molotch
I would say the pros and cons of Python vs Scala is both down to Spark, the languages in themselves and what kind of data engineer you will get when you try to hire for the different solutions. With Pyspark you get less functionality and increased complexity with the py4j java interop compared

Re: Scala vs Python for ETL with Spark

2020-10-15 Thread Mich Talebzadeh
Hi, I spent a few days converting one of my Spark/Scala scripts to Python. It was interesting but at times looked like trench war. There is a lot of handy stuff in Scala like case classes for defining column headers etc that don't seem to be available in Python (possibly my lack of in-

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
se case is the best fit but at higher TCO (labour cost), then you may opt to use Python or another because you have those resources available in-house at lower cost and your Data Scientists are eager to invest in Python. Companies these days are very careful where to spend their technology dolla

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Gourav Sengupta
ory > cache away, then one can argue anything can do the "ETL" job. just write > some Java/Scala/SQL/Perl/python to read data and write to from one DB to > another often using JDBC connections. However, we all concur that may not > be good enough with Big Data volumes. Generall

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
if we take Spark and its massive parallel processing and in-memory cache away, then one can argue anything can do the "ETL" job. just write some Java/Scala/SQL/Perl/python to read data and write to from one DB to another often using JDBC connections. However, we all concur that may n

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread ayan guha
But when you have fairly large volume of data that is where spark comes in the party. And I assume the requirement of using spark is already established in the original qs and the discussion is to use python vs scala/java. On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski wrote: > If org

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
Thanks Ayan. I am not qualified to answer your first point. However, my experience with Spark with Scala or Spark with Python agrees with your assertion that use cases do not come into it. Most DEV/OPS work dealing with ETL are provided by service companies that have workforce very familiar with

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread ayan guha
I have one observation: is "python udf is slow due to deserialization penulty" still relevant? Even after arrow is used as in memory data mgmt and so heavy investment from spark dev community on making pandas first class citizen including Udfs. As I work with multiple clients, my

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
agree with the note that as long as Spark stayed nish and elite, >> then someone with Scala knowledge was attracting premiums. In fairness in >> 2014-2015, there was not much talk of Data Science input (I may be wrong). >> But the world has moved on so to speak. Python itself has be

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
remiums. In fairness in > 2014-2015, there was not much talk of Data Science input (I may be wrong). > But the world has moved on so to speak. Python itself has been around > a long time (long being relative here). Most people either knew UNIX Shell, > C, Python or Perl or a combination

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
not much talk of Data Science input (I may be wrong). But the world has moved on so to speak. Python itself has been around a long time (long being relative here). Most people either knew UNIX Shell, C, Python or Perl or a combination of all these. I recall we had a director a few years ago who asked

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jacek Pliszka
cost/availability. For Python skills you pay less and you can pick people with other useful skills and also you can more easily train people you have internally. Often you have some simple ETL scripts before moving to spark and these scripts are usually written in Python. Best Regards, Jacek

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Jörn Franke
architecture aspects etc. > Am 09.10.2020 um 22:57 schrieb Mich Talebzadeh : > >  > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets into Snowflake with Spark. > > The only reason I think they are choosing

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Wim Van Leuven
where people mostly do python. So, if you need those two worlds collaborate and even handover code, you don't want the ideological battle of Scala vs Python. We chose python for the sake of everybody speaking the same language. But it is true, if you do Spark DataFrames, because then PySpark is a

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
y force you to pass lambdas, hit serialization > between java and python types and yes hit the Global Interpreter Lock. But, > none of those things apply to Data Frames which will generate Java code > regardless of what language you use to describe the Dataframe operations as > long

  1   2   3   4   5   6   7   8   9   10   >