subject:"Dataframes"

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Varun Shah

No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Optimization: --> Delaying data transf

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty

On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually res

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh

eyan Chakravarty wrote: > > On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh > wrote: > >> >> No Data Transfer During Creation: --> Data transfer occurs only when an >> action is triggered. >> Distributed Processing: --> DataFrames are distributed

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty

On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver no

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh

disk. look at stages tab in UI (4040) *In summary:* No Data Transfer During Creation: --> Data transfer occurs only when an action is triggered. Distributed Processing: --> DataFrames are distributed for parallel execution, not stored entirely on the driver node. Lazy Evaluation Optimi

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty

I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, &#x

Re: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-28 Thread Vibhor Gupta

m Verma Sent: Monday, December 26, 2022 8:08 PM To: Russell Jurney Cc: Gurunandan ; user@spark.apache.org Subject: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes EXTERNAL: Report suspicious emails to Email Abuse. I tried sorting the repartitioned dataframes on the pa

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-26 Thread Shivam Verma

I tried sorting the repartitioned dataframes on the partition key before saving them as parquet files, however, when I read those repartitioned-sorted dataframes and join them on the partition key, the spark plan still shows `Exchange hashpartitioning` step, which I want to avoid

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Russell Jurney

ut I can see it in both > the experiments: > 1. Using repartitioned dataframes > 2. Using initial dataframes > > Does that mean that the repartitioned dataframes are not actually > "co-partitioned"? > If that's the case, I have two more questions: > > 1

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Shivam Verma

Hi Gurunandan, Thanks for the reply! I do see the exchange operator in the SQL tab, but I can see it in both the experiments: 1. Using repartitioned dataframes 2. Using initial dataframes Does that mean that the repartitioned dataframes are not actually "co-partitioned"? If that

Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-13 Thread Shivam Verma

Hello folks, I have a use case where I save two pyspark dataframes as parquet files and then use them later to join with each other or with other tables and perform multiple aggregations. Since I know the column being used in the downstream joins and groupby, I was hoping I could use co

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Hollis

is the reason you got the IOM and analysis exception. my suggestion is you need checkpoint the dataframe when joined 200 dataframes. so you can trancate the lineage. so the optimizer only analysis the 200 dataframe. this will reduce the pressure of spark engine. | | Hollis | Replied

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta

dataframes may apply and RDD are used, but for UDF's I prefer SQL as well, but that may be a personal idiosyncrasy. The Oreilly book on data algorithms using SPARK, pyspark uses dataframes and RDD API's :) Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 6:11 PM Sean Owen wrote:

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen

operators will be faster. Sometimes you have to go outside SQL where necessary, like in UDFs or complex aggregation logic. Then you can't use SQL. On Fri, Dec 24, 2021 at 12:05 PM Gourav Sengupta wrote: > Hi, > > yeah I think that in practice you will always find that dataframes can &

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta

Hi, yeah I think that in practice you will always find that dataframes can give issues regarding a lot of things, and then you can argue. In the SPARK conference, I think last year, it was shown that more than 92% or 95% use the SPARK SQL API, if I am not mistaken. I think that you can do the

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson

Hi Sean and Gourav Thanks for the suggestions. I thought that both the sql and dataframe apis are wrappers around the same frame work? Ie. catalysts. I tend to mix and match my code. Sometimes I find it easier to write using sql some times dataframes. What is considered best practices? Here

Re: Merge two dataframes

2021-05-19 Thread ayan guha

the same >> ordering, and a sort/hash merge doesn't need to be done? >> >> Thanks >> Andrew >> >> On Wed, May 12, 2021 at 11:07 AM Sean Owen wrote: >> > >> > Yeah I don't think that's going to work - you aren't guarantee

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh

ed to be done? > > Thanks > Andrew > > On Wed, May 12, 2021 at 11:07 AM Sean Owen wrote: > > > > Yeah I don't think that's going to work - you aren't guaranteed to get > 1, 2, 3, etc. I think row_number() might be what you need to generate a > join ID.

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh

tent is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 18 May 2021 at 16:39, kushagra deep >> wrote: >> >>> The use cas

Re: Merge two dataframes

2021-05-18 Thread kushagra deep

; > > On Tue, 18 May 2021 at 16:39, kushagra deep > wrote: > >> The use case is to calculate PSI/CSI values . And yes the union is one to >> one row as you showed. >> >> On Tue, May 18, 2021, 20:39 Mich Talebzadeh >> wrote: >> >>> >>>

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh

Kushagra, >> >> A bit late on this but what is the business use case for this merge? >> >> You have two data frames each with one column and you want to UNION them >> in a certain way but the correlation is not known. In other words this >> UNION is as is? >> >>

Re: Merge two dataframes

2021-05-18 Thread kushagra deep

shagra deep > wrote: > >> Hi All, >> >> I have two dataframes >> >> df1 >> >> amount_6m >> 100 >> 200 >> 300 >> 400 >> 500 >> >> And a second data df2 below >> >> amount_9m >>

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh

500 200 600 HTH On Wed, 12 May 2021 at 13:51, kushagra deep wrote: > Hi All, > > I have two dataframes > > df1 > > amount_6m > 100 > 200 > 300 > 400 > 500 > > And a second data df2 below > > amount_9m > 500 > 600

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo

d. In our initial implementation, we had > a series of r one per rule. For N rules, we created N dataframes that had the > rows that satisfied the rules. The we unioned the N data frames. Horrible > performance that didn't scale with N. We reimplemented to add N Boolean > c

Re: Merge two dataframes

2021-05-17 Thread Lalwani, Jayesh

ation, we had a series of r one per rule. For N rules, we created N dataframes that had the rows that satisfied the rules. The we unioned the N data frames. Horrible performance that didn't scale with N. We reimplemented to add N Boolean columns; one per rule; that indicated if the rule was sat

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo

In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged. df.w

Re: Merge two dataframes

2021-05-17 Thread Sean Owen

Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type o

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo

drew > > On Wed, May 12, 2021 at 11:07 AM Sean Owen wrote: > > > > Yeah I don't think that's going to work - you aren't guaranteed to get 1, > > 2, 3, etc. I think row_number() might be what you need to generate a join > > ID. > > > > R

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo

think row_number() might be what you need to generate a join ID. > > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You > could .zip two RDDs you get from DataFrames and manually convert the Rows > back to a single Row and back to DataFrame.

Re: Merge two dataframes

2021-05-12 Thread Sean Owen

Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and

Re: Merge two dataframes

2021-05-12 Thread kushagra deep

Thanks Raghvendra Will the ids for corresponding columns be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh

ot;), "inner").drop("id").show() +-+-+ |amount_6m|amount_9m| +-+-+ | 100| 500| | 200| 600| | 300| 700| | 400| 800| | 500| 900| +-+-+ -- Raghavendra On Wed, May 12, 2021 at 6:20 PM kushagra deep w

Merge two dataframes

2021-05-12 Thread kushagra deep

Hi All, I have two dataframes df1 amount_6m 100 200 300 400 500 And a second data df2 below amount_9m 500 600 700 800 900 The number of rows is same in both dataframes. Can I merge the two dataframes to achieve below df df3 amount_6m | amount_9m 100

Find difference between two dataframes in spark structured streaming

2020-12-16 Thread act_coder

I am creating a spark structured streaming job, where I need to find the difference between two dataframes. Dataframe 1 : [1, item1, value1] [2, item2, value2] [3, item3, value3] [4, item4, value4] [5, item5, value5] Dataframe 2: [4, item4, value4] [5, item5, value5] New Dataframe with

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Lalwani, Jayesh

incurring IO overhead on every microbatch. From: Arti Pande Date: Friday, November 13, 2020 at 2:19 PM To: "Lalwani, Jayesh" Cc: "user@spark.apache.org" Subject: RE: [EXTERNAL] Refreshing Data in Spark Memory (DataFrames) CAUTION: This email originated from outside of the

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Arti Pande

old > computation, your results don’t change. > There might be scenarios where you want to correct old reference data. In > this case you update your reference table, and rerun your computation. > > > > Now, if you are talking about streaming applications, then it’s a > diff

Re: Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Lalwani, Jayesh

reference data. Spark reloads the dataframes from batch sources at the beginning of every microbatch. As long as you are reading the data from from a non-streaming source, it will get refreshed in every microbatch. Alternatively, you can send updates to reference data through a stream, and then merge

Refreshing Data in Spark Memory (DataFrames)

2020-11-13 Thread Arti Pande

Hi In the financial systems world, if some data is being updated too frequently, and that data is to be used as reference data by a Spark job that runs for 6/7 hours, most likely Spark job may read that data at the beginning and keep it in memory as DataFrame and will keep running for remaining 6/

Bloom Filter to filter huge dataframes with PySpark

2020-09-23 Thread Breno Arosa

Hello, I need to filter one huge table using others huge tables. I could not avoid sort operation using `WHERE IN` or `INNER JOIN`. Can this be avoided? As I'm ok with false positives maybe Bloom filter is an alternative. I saw that Scala has a builtin dataframe function (https://spark.apache.o

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-06-11 Thread Tanveer Ahmad - EWI

Hi Jorge, Thank you. This union function is better alternative for my work. Regards, Tanveer Ahmad From: Jorge Machado Sent: Monday, May 25, 2020 3:56:04 PM To: Tanveer Ahmad - EWI Cc: Spark Group Subject: Re: Arrow RecordBatches/Pandas Dataframes to (Arrow

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-05-25 Thread Jorge Machado

Hey, from what I know you can try to Union them df.union(df2) Not sure if this is what you need > On 25. May 2020, at 13:53, Tanveer Ahmad - EWI wrote: > > Hi all, > > I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow > enabled) Spark Dataframe c

Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

2020-05-25 Thread Tanveer Ahmad - EWI

Hi all, I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversions. Here the example explains very well how to convert a single Pandas Dataframe to Spark Dataframe [1]. But in my case, some external applications are generating Arrow

Re: How to split a dataframe into two dataframes based on count

2020-05-18 Thread Vipul Rajan

fle. Spark automatically caches when a data shuffle happens. Let me know if you get it to work. Regards On Mon, May 18, 2020 at 10:27 PM Mohit Durgapal wrote: > Dear All, > > I would like to know how, in spark 2.0, can I split a dataframe into two > dataframes when I know the exact cou

How to split a dataframe into two dataframes based on count

2020-05-18 Thread Mohit Durgapal

Dear All, I would like to know how, in spark 2.0, can I split a dataframe into two dataframes when I know the exact counts the two dataframes should have. I tried using limit but got quite weird results. Also, I am looking for exact counts in child dfs, not the approximate % based split

[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren

Hi all, I'm using ML Pipeline to construct a flow of transformation. I'm wondering if it is possible to set multiple dataframes as the input of a transformer? For example I need to join two dataframes together in a transformer, then feed into the estimator for training. If not, is ther

[Spark MLlib]: Multiple input dataframes and non-linear ML pipeline

2020-04-09 Thread Qingsheng Ren

Hi all, I'm using ML Pipeline to construct a flow of transformation. I'm wondering if it is possible to set multiple dataframes as the input of a transformer? For example I need to join two dataframes together in a transformer, then feed into the estimator for training. If not, is ther

Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann

Hi all! Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins? What I did: buse.count res0: Long = 20554365 buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count res1: Long = 20554365 buse.alias("buse").join(bdef.alia

Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS

> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) I wonder if avro is a better candidate for this because it's row oriented it should be faster to write/read for such a task. Never heard about checkpoint. Enrico Minack writes: > It is not about very large or small, it is

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Enrico Minack

It is not about very large or small, it is about how large your cluster is w.r.t. your data. Caching is only useful if you have the respective memory available across your executors. Otherwise you could either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed have to do t

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS

> .dropDuplicates() \ .cache() | > Since df_actions is cached, you can count inserts and updates quickly > with only that one join in df_actions: Hi Enrico. I am wondering if this is ok for very large tables ? Is caching faster than recomputing both insert/update ? Thanks Enrico Minack writes

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff

Hi, Thank you both for your suggestions! These have been eyeopeners for me. Just to clarify, I need the counts for logging and auditing purposes otherwise I would exclude the step. I should have also mentioned that while I am processing around 30 GB of raw data, the individual outputs are relat

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Enrico Minack

Ashley, I want to suggest a few optimizations. The problem might go away but at least performance should improve. The freeze problems could have many reasons, the Spark UI SQL pages and stages detail pages would be useful. You can send them privately, if you wish. 1. the repartition(1) shoul

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards

Hi ashley, Apologies reading this on my phone as work l laptop doesn't let me access personal email. Are you actually doing anything with the counts (printing to log, writing to table?) If you're not doing anything with them get rid of them and the caches entirely. If you do want to do somethin

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff

Thanks David, I did experiment with the .cache() keyword and have to admit I didn't see any marked improvement on the sample that I was running, so yes I am a bit apprehensive including it (not even sure why I actually left it in). When you say "do the count as the final step", are you referring

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards

Hi Ashley, I'm not an expert but think this is because spark does lazy execution and doesn't actually perform any actions until you do some kind of write, count or other operation on the dataframe. If you remove the count steps it will work out a more efficient execution plan reducing the number

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff

Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading

Re: Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread Zong-han, Xie

Dear Yeikel I checked my code and it uses getOrCreate to create a SparkSession. Therefore, I should be retrieving the same SparkSession instance everytime I call that method. Thanks for your reminding. Best regard -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ ---

Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread yeikel valdes

d HDFS with given parameters. This function returns a pyspark dataframe and the SparkContext it used. With client's increasing demands, I need to merge data from multiple query. I tested using "union" function to merge the pyspark dataframes returned by different function calls d

union two pyspark dataframes from different SparkSessions

2020-01-29 Thread Zong-han, Xie

ot; function to merge the pyspark dataframes returned by different function calls directly and it worked. This surprised me that pyspark dataframe can actually union dataframes from different SparkSession. I am using pyspark 2.3.1 and Python 3.5. I wonder if this is a good practice or I

Re: Loop through Dataframes

2019-10-06 Thread Holden Karau

ions there. On Sun, Oct 6, 2019 at 2:49 PM KhajaAsmath Mohammed wrote: > > Hi, > > What is the best approach to loop through 3 dataframes in scala based on > some keys instead of using collect. > > Thanks, > Asmath > -- Twitter: https://twitter.com/holdenkarau Books (L

Loop through Dataframes

2019-10-06 Thread KhajaAsmath Mohammed

Hi, What is the best approach to loop through 3 dataframes in scala based on some keys instead of using collect. Thanks, Asmath

Cogrouping in Streaming Datasets/DataFrames is not supported ?

2019-08-23 Thread Kushagra Deep

Hi , I have a use case where I have to cogroup two streams using cogroup in streaming. However when I do so I get an exception that “Cogrouping in streaming is not supported in DataFrame/Dataset”. Please clarify. Regards , Kushagra Deep

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Russell Spitzer

ts of the plan refer to static pieces of >> data ..."* Could you elaborate a bit more on what does this static >> piece of data refer to? Are you referring to the 10 records that had >> already arrived at T1 and are now sitting as old static data in the >> unbounded t

[Structured Streaming]: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi

tatic pieces of >> data ..."* Could you elaborate a bit more on what does this static >> piece of data refer to? Are you referring to the 10 records that had >> already arrived at T1 and are now sitting as old static data in the >> unbounded table? >> >> Regard

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi

t; ..."* Could you elaborate a bit more on what does this static piece of > data refer to? Are you referring to the 10 records that had already arrived > at T1 and are now sitting as old static data in the unbounded table? > > Regards > Sheel > > > On Thu, May 16, 2019 at 3

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-16 Thread Sheel Pancholi

AM Russell Spitzer wrote: > Dataframes describe the calculation to be done, but the underlying > implementation is an "Incremental Query". That is that the dataframe code > is executed repeatedly with Catalyst adjusting the final execution plan on > each run. Some parts of th

Re: Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Russell Spitzer

Dataframes describe the calculation to be done, but the underlying implementation is an "Incremental Query". That is that the dataframe code is executed repeatedly with Catalyst adjusting the final execution plan on each run. Some parts of the plan refer to static pieces of data, other

Are Spark Dataframes mutable in Structured Streaming?

2019-05-15 Thread Sheel Pancholi

Hi Structured Streaming treats a stream as an unbounded table in the form of a DataFrame. Continuously flowing data from the stream keeps getting added to this DataFrame (which is the unbounded table) which warrants a change to the DataFrame which violates the vary basic nature of a DataFrame since

Re: Standardized Join Types for DataFrames

2019-02-22 Thread Jules Damji

am new to spark and want to start contributing to Apache spark to know more > about it. > I found this JIRA to have "Standardized Join Types for DataFrames", which I > feel could be a good starter task for me. I wanted to confirm if this is a > relevant/actionable task and i

Standardized Join Types for DataFrames

2019-02-22 Thread Pooja Agrawal

Hi, I am new to spark and want to start contributing to Apache spark to know more about it. I found this JIRA to have "Standardized Join Types for DataFrames", which I feel could be a good starter task for me. I wanted to confirm if this is a relevant/actionable task and if I can start

RE: What are the alternatives to nested DataFrames?

2018-12-29 Thread email

: What are the alternatives to nested DataFrames? 2 options I can think of: 1- Can you perform a union of dfs returned by elastic research queries. It would still be distributed but I don't know if you will run out of how many union operations you can perform at a time. 2- Can you used

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus

m...@yeikel.com > *Cc:* Shahab Yunus ; user > *Subject:* Re: What are the alternatives to nested DataFrames? > > > > Could you join() the DFs on a common key? > > > > On Fri, Dec 28, 2018 at 18:35 wrote: > > Shabad , I am not sure what you are trying to say. Could

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email

iginal DF and returns a new dataframe including all the matching terms From: Andrew Melo Sent: Friday, December 28, 2018 8:48 PM To: em...@yeikel.com Cc: Shahab Yunus ; user Subject: Re: What are the alternatives to nested DataFrames? Could you join() the DFs on a common key?

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Andrew Melo

tString(0)* > > > > * val qb = QueryBuilders.matchQuery("name", > city).operator(Operator.AND)* > > * print(qb.toString)* > > > > * val dfs = sqlContext.esDF("cities/docs", qb.toString) // null > pointer* > > > > * dfs.show()* > > > &g

RE: What are the alternatives to nested DataFrames?

2018-12-28 Thread email

uery("name", city).operator(Operator.AND) print(qb.toString) val dfs = sqlContext.esDF("cities/docs", qb.toString) // null pointer dfs.show() }) From: Shahab Yunus Sent: Friday, December 28, 2018 12:34 PM To: em...@yeikel.com Cc: user Sub

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus

not support the nesting of > DataFrames , but what are the options? > > > > I have the following scenario : > > > > dataFrame1 = List of Cities > > > > dataFrame2 = Created after searching in ElasticSearch for each city in > dataFrame1 > > > &g

What are the alternatives to nested DataFrames?

2018-12-27 Thread email

Hi community , As shown in other answers online , Spark does not support the nesting of DataFrames , but what are the options? I have the following scenario : dataFrame1 = List of Cities dataFrame2 = Created after searching in ElasticSearch for each city in dataFrame1 I&#x

Spark column combinations and combining multiple dataframes (pyspark)

2018-11-26 Thread Christopher Petrino

s added to a python array like the_dfs.append(df.select(cols).toDF(*cols).cache()) the_dfs[len(the_dfs)].count() The dataframes are finally combined using df_all = reduce(DataFrame.union, the_dfs).cache() df_all.count() THE CURRENT STATE: The proof of concept works on a smaller amount of data

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel

e where P1 = key._1 and P2 = key._2") > } > > Regards, > Nirav > > > On Wed, Aug 1, 2018 at 4:18 PM, Koert Kuipers wrote: > >> this works for dataframes with spark 2.3 by changing a global setting, >> and will be configurable per write in 2.4 >&g

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-02 Thread Nirav Patel

t;) } Regards, Nirav On Wed, Aug 1, 2018 at 4:18 PM, Koert Kuipers wrote: > this works for dataframes with spark 2.3 by changing a global setting, and > will be configurable per write in 2.4 > see: > https://issues.apache.org/jira/browse/SPARK-20236 > https://issues.apache.org/ji

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Koert Kuipers

this works for dataframes with spark 2.3 by changing a global setting, and will be configurable per write in 2.4 see: https://issues.apache.org/jira/browse/SPARK-20236 https://issues.apache.org/jira/browse/SPARK-24860 On Wed, Aug 1, 2018 at 3:11 PM, Nirav Patel wrote: > Hi Peay, > >

Re: Saving dataframes with partitionBy: append partitions, overwrite within each

2018-08-01 Thread Nirav Patel

Hi Peay, Have you find better solution yet? I am having same issue. Following says it works with spark 2.1 onward but only when you use sqlContext and not Dataframe https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a Thanks, Nirav On Mon, Oct 2, 2017 at 4:37 AM,

Zstd codec for writing dataframes

2018-06-18 Thread Nikhil Goyal

Hi guys, I was wondering if there is a way to compress files using zstd. It seems zstd compression can be used for shuffle data, is there a way to use it for output data as well? Thanks Nikhil

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-28 Thread Jacek Laskowski

d Streaming(SSS) and explode >>> the data and flatten all data into single record using DataFrame joins >>> and >>> land into a relational database table(DB2). >>> >>> But we are getting the following error when we write into d

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-15 Thread रविशंकर नायर

ng DataFrame joins and >> land into a relational database table(DB2). >> >> But we are getting the following error when we write into db using JDBC. >> >> “org.apache.spark.sql.AnalysisException: Inner join between two streaming >> DataFrames/Datasets

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-13 Thread Jacek Laskowski

using DataFrame joins and > land into a relational database table(DB2). > > But we are getting the following error when we write into db using JDBC. > > “org.apache.spark.sql.AnalysisException: Inner join between two streaming > DataFrames/Datasets is not supported;” > >

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread ThomasThomas

Thanks for the quick response...I'm able to inner join the dataframes with regular spark session. The issue is only with the spark streaming session. BTW I'm using Spark 2.2.0 version... -- Sent from: http://apache-spark-user-list.1001560.n3.

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread रविशंकर नायर

en we write into db using JDBC. > > “org.apache.spark.sql.AnalysisException: Inner join between two streaming > DataFrames/Datasets is not supported;” > > Any help would be greatly appreciated. > > Thanks, > Thomas Thomas > Mastermind Solutions LLC. > > > >

Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-12 Thread ThomasThomas

table(DB2). But we are getting the following error when we write into db using JDBC. “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;” Any help would be greatly appreciated.  Thanks, Thomas Thomas Mastermind Solutions LLC. -- Sent

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky

I suppose that it's hardly possible that this issue is connected with the string encoding, because - "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code - "profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke

Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > > scala.MatchError: > profiles

DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky

Hello guys, I'm using Spark 2.2.0 and from time to time my job fails printing into the log the following errors scala.MatchError: profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ scala.MatchError: pr^?files.10056.10040 (of class java.lang.String) sc

Re: strange behavior of joining dataframes

2018-03-23 Thread Shiyuan

ne B df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1) df = df.join(df_t.select("ID"),["ID"]) df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1&

strange behavior of joining dataframes

2018-03-20 Thread Shiyuan

Hi Spark-users: I have a dataframe "df_t" which was generated from other dataframes by several transformations. And then I did something very simple, just counting the rows, that is the following code: (A) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnR

Caching dataframes and overwrite

2017-11-21 Thread Michael Artz

I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to Stac

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Shixiong(Ryan) Zhu

You are using Spark Streaming Kafka package. The correct package name is " org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" On Mon, Nov 20, 2017 at 4:15 PM, salemi wrote: > Yes, we are using --packages > > $SPARK_HOME/bin/spark-submit --packages > org.apache.spark:spark-streaming-kafka-0-10_2.1

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi

Yes, we are using --packages $SPARK_HOME/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 --py-files shell.py -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscrib

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau

org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 ... On Mon, Nov 20, 2017 at 3:07 PM, salemi wrote: > Hi All, > > we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0. > We followed the instruction on the wiki > https://spark.apache.org/docs/latest/struc

PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread salemi

Hi All, we are trying to use DataFrames approach with Kafka 0.10 and PySpark 2.2.0. We followed the instruction on the wiki https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html. We coded something similar to the code below using Python: df = spark \ .read \ .format

Union of streaming dataframes

2017-11-17 Thread Lalwani, Jayesh

Is union of 2 Structured streaming dataframes from different sources supported in 2.2? We have done a union of 2 streaming dataframes that are from the same source. I wanted to know if multiple streams are supported or going to be supported in the future

1 2 3 4 5 6 >

1 - 100 of 543 matches

Mail list logo