RE: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread Bhatt, Kashyap
>> option("path", "s3://bucketname") Shouldn’t the schema prefix be s3a instead of s3? Information Classification: General From: 刘唯 Sent: Tuesday, August 5, 2025 5:34 PM To: Kleckner, Jade Cc: user@spark.apache.org Subject: Re: [PySpark] [Beginner] [Debug] Doe

Re: [PySpark] [Beginner] [Debug] Does Spark ReadStream support reading from a MinIO bucket?

2025-08-05 Thread 刘唯
This is not necessarily about the readStream / read API. As long as you correctly imported the needed dependencies and set up spark config, you should be able to readStream from s3 path. See https://stackoverflow.com/questions/46740670/no-filesystem-for-scheme-s3-with-pyspark Kleckner, Jade 于202

Re: [PYSPARK] df.collect throws exception for MapType with ArrayType as key

2025-05-23 Thread Soumasish
This looks like a bug to me. https://github.com/apache/spark/blob/master/python/pyspark/serializers.py When cloudpickle.loads() tries to deserialize {["A", "B"]: "foo"} -> List as a dict type will break. Tuple("A", "B") : Python Input -> ArrayData -> works fine ArrayData ->List["A", "B"] -> Breaks

Re: pyspark dataframe join with two different data type

2024-05-16 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["column1"] df = spark.createDataFrame(incoming_data

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ | a

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in t

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values of

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ] column_na

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3. https://stackoverflow.com/questions/3

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually res

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as pipelin

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Op

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark applies

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark = SparkSession.bui

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
Hi, Have you come back with some ideas for implementing this? Specifically integrating Spark Structured Streaming with REST API? FYI, I did some work on it as it can have potential wider use cases, i.e. the seamless integration of Spark Structured Streaming with Flask REST API for real-time data i

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, Dec

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
; > *From:* Mich Talebzadeh > *Sent:* Thursday, December 28, 2023 5:14 PM > *To:* Hyukjin Kwon > *Cc:* Поротиков Станислав Вячеславович ; > user@spark.apache.org > *Subject:* Re: Pyspark UDF as a data source for streaming > > > > You can work around this issue by trying to

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much! Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Thursday, December 28, 2023 5:14 PM To: Hyukjin Kwon Cc: Поротиков Станислав Вячеславович ; user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming You can work around this issue by

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
ков Станислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, December 27, 2023 9:43 PM >> *Cc:* us

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
av Porotikov > > > > *From:* Mich Talebzadeh > *Sent:* Wednesday, December 27, 2023 9:43 PM > *Cc:* user@spark.apache.org > *Subject:* Re: Pyspark UDF as a data source for streaming > > > > Is this generated data actual data or you are testing the application? >

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
tikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
v Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https:/

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi Michail, > > with spark.conf

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L, col3#127, col4#132L] The en

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
Hi Michail, observations as well as ordinary accumulators only observe / process rows that are iterated / consumed by downstream stages. If the query plan decides to skip one side of the join, that one will be removed from the final plan completely. Then, the Observation will not retrieve any met

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node, th

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi Raghavendr

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]: https://docs.delta.io/latest/optimizations-oss.html#language-pyt

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: > >

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks! On Wed, Sep 20, 2023 at 1:04 PM Sean Owen wrote: > [ External sender. Exercise caution. ] > > I think the announcement mentioned there were some issues with pypi and > the upload size this time. I am sure it's intended to be there when > possible. > > On Wed, Sep 20, 20

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( > https://pypi.org/project/pyspar

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
I pasted your text to chatgtp and this is what I got back Your problem arises due to how Apache Spark serializes Python objects to be used in Spark tasks. When a User-Defined Function (UDF) is defined, Spark uses Python's `pickle` library to serialize the Python function and any required objects s

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
Yes, ls -l /tmp/app-submodules.zip, hdfs dfs -ls /tmp/app-submodules.zip can show the file. 在 2023/8/9 22:48, Mich Talebzadeh 写道: If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
If you are running in the cluster mode, that zip file should exist in all the nodes! Is that the case? HTH Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin profile https://en.everybodyw

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
Finally I was able to solve this issue by setting this conf. "spark.driver.extraJavaOptions=-Dorg.xerial.snappy.tempdir=/my_user/temp_ folder" Thanks all! On Sat, 8 Jul 2023 at 3:45 AM, Brian Huynh wrote: > Hi Khalid, > > Elango mentioned the file is working fine in our another environment wi

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
Hi Khalid,Elango mentioned the file is working fine in our another environment with the same driver and executor memoryBrianOn Jul 7, 2023, at 10:18 AM, Khalid Mammadov wrote:Perhaps that parquet file is corrupted or got that is in that folder?To check, try to read that file with pandas or other

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
Perhaps that parquet file is corrupted or got that is in that folder? To check, try to read that file with pandas or other tools to see if you can read without Spark. On Wed, 5 Jul 2023, 07:25 elango vaidyanathan, wrote: > > Hi team, > > Any updates on this below issue > > On Mon, 3 Jul 2023 at

Re: PySpark error java.lang.IllegalArgumentException

2023-07-04 Thread elango vaidyanathan
Hi team, Any updates on this below issue On Mon, 3 Jul 2023 at 6:18 PM, elango vaidyanathan wrote: > > > Hi all, > > I am reading a parquet file like this and it gives > java.lang.IllegalArgumentException. > However i can work with other parquet files (such as nyc taxi parquet > files) without

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Sorry, I didn't try that. On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney wrote: > Oliver, just curious: did you get a clean error message when you broke it > out into separate statements? > > Thanks, > Russell Jurney @rjurney > russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Russell Jurney
Oliver, just curious: did you get a clean error message when you broke it out into separate statements? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com Book a time on Ca

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-24 Thread Oliver Ruebenacker
Hello, Thanks for the advice. First of all, it looks like I used the wrong *max* function, but *pyspark.sql.functions.max* isn't right either, because it finds the maximum of a given column over groups of rows. To find the maximum among multiple columns, I need *pyspark.sql.functions.greate

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That's pretty impressive. I'm not sure it's quite right - not clear that the intent is taking a minimum of absolute values (is it? that'd be wild). But I think it might have pointed in the right direction. I'm not quite sure why that error pops out, but I think 'max' is the wrong function. That's a

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Bjørn Jørgensen
I'm trying to learn how to use chatgpt for coding. So after a lite chat I got this. The code you provided seems to calculate the distance between a gene and a variant by finding the maximum value between the difference of the variant position and the gene start position, the difference of the ge

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Russell Jurney
Usually, the solution to these problems is to do less per line, break it out and perform each minute operation as a field, then combine those into a final answer. Can you do that here? Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Oliver Ruebenacker
Here is the complete error: ``` Traceback (most recent call last): File "nearest-gene.py", line 74, in main() File "nearest-gene.py", line 62, in main distances = joined.withColumn("distance", max(col("start") - col("position"), col("position") - col("end"), 0)) File "/mnt/yarn/user

Re: [PySpark SQL] New column with the maximum of multiple terms?

2023-02-23 Thread Sean Owen
That error sounds like it's from pandas not spark. Are you sure it's this line? On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I'm trying to calculate the distance between a gene (with start and end) > and a variant (with position), so

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-18 Thread Oliver Ruebenacker
Awesome, thanks, this was exactly what I needed! On Tue, Jan 17, 2023 at 5:23 PM Sean Owen wrote: > I think you want array_contains: > > https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html > > On Tue, Jan 17, 2023 at 4:18 PM Oliver

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Sean Owen
I think you want array_contains: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I have data originally stored as JSON.

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen
Right, nothing wrong with a for loop here. Seems like just the right thing. On Fri, Jan 6, 2023, 3:20 PM Joris Billen wrote: > Hello Community, > I am working in pyspark with sparksql and have a very similar very complex > list of dataframes that Ill have to execute several times for all the > “

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
So I think now that my problem is Spark-related after all. It looks like my bootstrap script installs SciPy just fine in a regular environment, but somehow interaction with PySpark breaks it. On Fri, Jan 6, 2023 at 12:39 PM Bjørn Jørgensen wrote: > Create a Dockerfile > > FROM fedora > > RUN sud

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
Create a Dockerfile FROM fedora RUN sudo yum install -y python3-devel RUN sudo pip3 install -U Cython && \ sudo pip3 install -U pybind11 && \ sudo pip3 install -U pythran && \ sudo pip3 install -U numpy && \ sudo pip3 install -U scipy docker build --pull --rm -f "Dockerfile" -t fedoratest:l

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Mich Talebzadeh
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own ri

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
Thank you for the link. I already tried most of what was suggested there, but without success. On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen wrote: > > > > https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp > > > > > fre.

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > > Hello, > > I'm trying to install SciPy using a bootstrap script and then use it

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Oliver Ruebenacker
Wow, thank you so much! On Wed, Dec 21, 2022 at 10:27 AM Mich Talebzadeh wrote: > OK let us try this > > 1) we have a csv file as below called cities.csv > > country,city,population > Germany,Berlin,3520031 > Germany,Hamburg,1787408 > Germany,Munich,1450381 > Turkey,Ankara,4587558 > Turkey,Istan

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Mich Talebzadeh
OK let us try this 1) we have a csv file as below called cities.csv country,city,population Germany,Berlin,3520031 Germany,Hamburg,1787408 Germany,Munich,1450381 Turkey,Ankara,4587558 Turkey,Istanbul,14025646 Turkey,Izmir,2847691 United States,Chicago IL,2670406 United States,Los Angeles CA,08501

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello,   Let's say th

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20, 2022

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Thank you for the suggestion. This would, however, involve converting my Dataframe to an RDD (and back later), which involves additional costs. On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh wrote: > you can groupBy(country). and use mapPartitions method in which you can > iterate over all r

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Hello, Let's say the data is like this: +---+---++ | country | city | population | +---+---++ | Germany | Berlin| 3520031| | Germany | Hamburg | 1787408

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Mich Talebzadeh
Hi, Windowing functions were invented to avoid doing lengthy group by etc. As usual there is a lot of heat but little light Please provide: 1. Sample input. I gather this data is stored in some csv, tsv, table format 2. The output that you would like to see. Have a look at this arti

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
Post an example dataframe and how you will have the result. man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you, that is an interesting idea. Instead of finding the maximum > population, we are finding the maximum (population, city name) tuple. > > On

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Thank you, that is an interesting idea. Instead of finding the maximum population, we are finding the maximum (population, city name) tuple. On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen wrote: > We have pandas API on spark >

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
We have pandas API on spark which is very good. from pyspark import pandas as ps You can use pdf = df.pandas_api() Where df is your pyspark dataframe. [image: image.png] Does this help you? df.groupby(['Count

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA' Coun

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
If we only wanted to know the biggest population, max function would suffice. The problem is I also want the name of the city with the biggest population. On Mon, Dec 19, 2022 at 11:58 AM Sean Owen wrote: > As Mich says, isn't this just max by population partitioned by country in > a window func

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a window function? On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be inefficie

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Hello, Thank you for the response! I can think of two ways to get the largest city by country, but both seem to be inefficient: (1) I could group by country, sort each group by population, add the row number within each group, and then retain only cities with a row number equal to 1.

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Mich Talebzadeh
In spark you can use windowing function s to achieve this HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
Thank you, yes, it would be great if this could be extended to use an index. In our case, we're reading files from Amazon S3. S3 does offer the option to request only a chunk out of a file, and any efficient solution would need to use this rather than downloading the file multiple times. On

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
Take a look at https://github.com/nielsbasjes/splittablegzip :D On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Holden, > > Thank you for the response, but what is "splittable gzip"? > > Best, Oliver > > On Tue, Dec 6, 2022 at 9:22 AM H

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
Hello Holden, Thank you for the response, but what is "splittable gzip"? Best, Oliver On Tue, Dec 6, 2022 at 9:22 AM Holden Karau wrote: > There is the splittable gzip Hadoop input format, maybe someone could > extend that to use support bgzip? > > On Tue, Dec 6, 2022 at 1:43 PM Ol

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could extend that to use support bgzip? On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Chris, > > Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but > to s

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
Hello Chris, Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but to start reading from somewhere other than the beginning of the file, you would need to use an index to tell you where the blocks start. Originally, a Tabix index was used and is still the popular choice, a

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Chris Nauroth
Sorry, I misread that in the original email. This is my first time looking at bgzip. I see from the documentation that it is putting some additional framing around gzip and producing a series of small blocks, such that you can create an index of the file and decompress individual blocks instead of

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Oliver Ruebenacker
Hello, Thanks for the response, but I mean compressed with bgzip , not bzip2. Best, Oliver On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth wrote: > Hello Oliver, > > Yes, Spark makes this possible using the Hadoop compression codecs and the > Hado

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-02 Thread Chris Nauroth
Hello Oliver, Yes, Spark makes this possible using the Hadoop compression codecs and the Hadoop-compatible FileSystem interface [1]. Here is an example of reading: df = spark.read.text("gs:///data/shakespeare-bz2") df.show(10) This is using a test data set of the complete works of Shakespeare, s

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-28 Thread Oliver Ruebenacker
Hello, Thanks, I can do that. What I was hoping to hear is whether what I'm trying to do is even considered possible, and what would be the correct 'how' parameter? Best, Oliver On Sun, Nov 27, 2022 at 2:50 PM Artemis User wrote: > What if you just do a join with the first conditio

Re: [PySpark] Join using condition where each record may be joined multiple times

2022-11-27 Thread Artemis User
What if you just do a join with the first condition (equal chromosome) and append a select with the rest of the conditions after join?  This will allow you to test your query step by step, maybe with a visual inspection to figure out what the problem is. It may be a data quality problem as well

Re: Pyspark ML model Save Error

2022-11-16 Thread Raja bhupati
Share more details on error to help suggesting solutions On Wed, Nov 16, 2022, 22:13 Artemis User wrote: > What problems did you encounter? Most likely your problem may be > related to saving the model object in different partitions. If that the > case, just apply the dataframe's coalesce(1) m

Re: Pyspark ML model Save Error

2022-11-16 Thread Artemis User
What problems did you encounter?  Most likely your problem may be related to saving the model object in different partitions.  If that the case, just apply the dataframe's coalesce(1) method before saving the model to a shared disk drive... On 11/16/22 1:51 AM, Vajiha Begum S A wrote: Hi, Thi

Re: pyspark connect to spark thrift server port

2022-10-21 Thread Artemis User
I guess there are some confusions here between the metastore and the actual Hive database.  Spark (as well as Apache Hive) requires two databases for Hive DB operations.  Metastore is used for storing metadata only (e.g., schema info), whereas the actual Hive database, accessible through Thrift

Re: pyspark connect to spark thrift server port

2022-10-20 Thread second_co...@yahoo.com.INVALID
Hello Artemis,   Understand, if i gave hive metastore uri to anyone to connect using pyspark. the port 9083 is open for anyone without authentication feature. The only way pyspark able to connect to hive is through 9083 and not through port 1. On Friday, October 21, 2022 at 04:06:38 A

Re: pyspark connect to spark thrift server port

2022-10-20 Thread Artemis User
By default, Spark uses Apache Derby (running in embedded mode with store content defined in local files) for hosting the Hive metastore.  You can externalize the metastore on a JDBC-compliant database (e.g., PostgreSQL) and use the database authentication provided by the database.  The JDBC con

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
That isn't the issue - the table does not exist anyway, but the storage path does. On Tue, Aug 2, 2022 at 6:48 AM Stelios Philippou wrote: > HI Kumba. > > SQL Structure is a bit different for > CREATE OR REPLACE TABLE > > > You can only do the following > CREATE TABLE IF NOT EXISTS > > > > > htt

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Stelios Philippou
HI Kumba. SQL Structure is a bit different for CREATE OR REPLACE TABLE You can only do the following CREATE TABLE IF NOT EXISTS https://spark.apache.org/docs/3.3.0/sql-ref-syntax-ddl-create-table-datasource.html On Tue, 2 Aug 2022 at 14:38, Sean Owen wrote: > I don't think "CREATE OR REPLA

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread Sean Owen
I don't think "CREATE OR REPLACE TABLE" exists (in SQL?); this isn't a VIEW. Delete the path first; that's simplest. On Tue, Aug 2, 2022 at 12:55 AM Kumba Janga wrote: > Thanks Sean! That was a simple fix. I changed it to "Create or Replace > Table" but now I am getting the following error. I am

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-02 Thread ayan guha
Hi I strongly suggest to use print prepared sqls and try them in raw form. The error you posted points to a syntax error. On Tue, 2 Aug 2022 at 3:56 pm, Kumba Janga wrote: > Thanks Sean! That was a simple fix. I changed it to "Create or Replace > Table" but now I am getting the following error.

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Kumba Janga
Thanks Sean! That was a simple fix. I changed it to "Create or Replace Table" but now I am getting the following error. I am still researching solutions but so far no luck. ParseException: mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'A

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Sean Owen
Pretty much what it says? you are creating a table over a path that already has data in it. You can't do that without mode=overwrite at least, if that's what you intend. On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga wrote: > > >- Component: Spark Delta, Spark SQL >- Level: Beginner >- S

Re: PySpark cores

2022-07-29 Thread Gourav Sengupta
Hi, Agree with above response, but in case you are using arrow and transferring data from JVM to python and back, then please try to check how are things getting executed in python. Please let me know what is the processing you are trying to do while using arrow. Regards, Gourav Sengupta On Fri

Re: PySpark cores

2022-07-29 Thread Jacob Lynn
I think you are looking for the spark.task.cpus configuration parameter. Op vr 29 jul. 2022 om 07:41 schreef Andrew Melo : > Hello, > > Is there a way to tell Spark that PySpark (arrow) functions use > multiple cores? If we have an executor with 8 cores, we would like to > have a single PySpark f

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
Pool.map requires 2 arguments. 1st a function and 2nd an iterable i.e. list, set etc. Check out examples from official docs how to use it: https://docs.python.org/3/library/multiprocessing.html On Thu, 21 Jul 2022, 21:25 Bjørn Jørgensen, wrote: > Thank you. > The reason for using spark local is

Re: Pyspark and multiprocessing

2022-07-21 Thread Bjørn Jørgensen
Thank you. The reason for using spark local is to test the code, and as in this case I find the bottlenecks and fix them before I spinn up a K8S cluster. I did test it now with 16 cores and 10 files import time tic = time.perf_counter() json_to_norm_with_null("/home/jovyan/notebooks/falk/test",

Re: Pyspark and multiprocessing

2022-07-21 Thread Khalid Mammadov
One quick observation is that you allocate all your local CPUs to Spark then execute that app with 10 Threads i.e 10 spark apps and so you will need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create CPU bottleneck? Also on the side note, why you need Spark if you use that on lo

Re: [Pyspark] [Linear Regression] Can't Fit Data

2022-03-17 Thread Sean Owen
The error points you to the answer. Somewhere in your code you are parsing dates, and the date format is no longer valid / supported. These changes are doc'ed in the docs it points you to. It is not related to the regression itself. On Thu, Mar 17, 2022 at 11:35 AM Bassett, Kenneth wrote: > Hell

  1   2   3   4   5   6   7   8   >