I am trying to read .csv.gz file in S3 and write it to S3 in Parquet format
through a Spark job. Currently I am using AWS EMR service for this. Spark
Job is execute as a step in EMR cluster. For some .csv.gz files I have
encounter below issue. I have used both spark 3.4 and 3.5 versions and
still
Hello,
We are hoping someone can help us understand the spark behavior
for scenarios listed below.
Q. *Will spark running queries fail when S3 parquet object changes
underneath with S3A remote file change detection enabled? Is it 100%? *
Our understanding is that S3A has a
will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Mon, 24 Jul 2023 at 10:56, Pralabh Kumar wrote:
> Spark3.3 in OSS built with parquet 1.12. Just compiling with parquet
> 1.10 results in build failure , so just wondering if a
Hi Spark Users .
I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is
built with parquet 1.12.
However, anyone tried Spark 3.3 with parquet 1.10 .
We are at Uber , planning to migrate Spark 3.3 but we have limitations of
using parquet 1.10 . Has anyone tried building Spark
Hello,
I'm trying to write to parquet some RDD[T] where T is a protobuf message,
in scala.
I am wondering what is the best option to do this, and I would be
interested by your lights.
So far, I see two possibilities:
- use PairRDD method *saveAsNewAPIHadoopFile*, and I guess I need to
retty effortless operation.
>
> Best,
> Evyatar
>
> On Mon, 7 Nov 2022 at 13:37, Naresh Peshwe
> wrote:
>
>> Hi Evyatar,
>> Yes, directly reading the parquet data works. Since we use hive metastore
>> to obfuscate the underlying datastore details, we want to a
Hi Evyatar,
Yes, directly reading the parquet data works. Since we use hive metastore
to obfuscate the underlying datastore details, we want to avoid directly
accessing the files.
I guess then the only option is to either change the data or change the
schema of the hive metastore as you suggested
r,
> Yes, directly reading the parquet data works. Since we use hive metastore
> to obfuscate the underlying datastore details, we want to avoid directly
> accessing the files.
> I guess then the only option is to either change the data or change the
> schema of the hive metastore as
Hi Naresh,
Have you tried any of the following in order to resolve your issue:
1. Reading the Parquet files (directly, not via Hive [i.e,
spark.read.parquet()]), casting to LongType and creating the hive
table based on this dataframe? Hive's BigInt and Spark's Long should h
Hi all,
I am trying to read data (using spark sql) via a hive metastore which has a
column of type bigint. Underlying parquet data has int as the datatype for
the same column. I am getting the following error while trying to read the
data using spark sql -
java.lang.ClassCastException
Solved this issue. Thank You On 2022/10/19 05:26:51 Nipuna Shantha wrote:> Hi all,> > I am writing to a parquet file using Impala version 3.4.0 and try to read> same parquet file from Spark 3.3.0 to a DataFrame.> > var df = spark.read.parquet(parquet_file_name)> > But when
Hi all,
I am writing to a parquet file using Impala version 3.4.0 and try to read
same parquet file from Spark 3.3.0 to a DataFrame.
var df = spark.read.parquet(parquet_file_name)
But when I show the DataFrame it has some encoded values for string data
type as below.
For string "TAX"
Hi,
> below sounds like something that someone will have experienced...
> I have external tables of parquet files with a hive table defined on top
> of the data. I dont manage/know the details of how the data lands.
> For some tables no issues when querying through spark.
> But for
Hi,
below sounds like something that someone will have experienced...
I have external tables of parquet files with a hive table defined on top of the
data. I dont manage/know the details of how the data lands.
For some tables no issues when querying through spark.
But for others there is an issue
Hi.
I’ve spent the last couple of hours trying to chase down an issue with
writing/reading parquet files. I was trying to save (and then read in) a
parquet file with a schema that sets my non-nullability details correctly.
After having no success for some time, I posted to Stack Overflow
That's a parquet library error. It might be this:
https://issues.apache.org/jira/browse/PARQUET-1633 That's fixed in recent
versions of Parquet. You didn't say what versions of libraries you are
using, but try the latest Spark.
On Mon, May 9, 2022 at 8:49 AM wrote:
> #
# python:
import pandas as pd
a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b'])
a.to_parquet('a.parquet')
# pyspark:
d2 = spark.read.parquet('a.parquet')
will return error:
An error was encountered: An error occurred while calling o277.showString. :
org.apache.spark.SparkException: Job
:
> I am not sure how to set the records limit. Let me check. I couldn’t find
> parquet row group size configuration in spark.
>
> For now, I increased the number if shuffle partitions to reduce the
> records processed by task to avoid OOM.
>
>
>
> Regards,
>
> Ani
I am not sure how to set the records limit. Let me check. I couldn’t find
parquet row group size configuration in spark.
For now, I increased the number if shuffle partitions to reduce the records
processed by task to avoid OOM.
Regards,
Anil
From: Gourav Sengupta
Date: Saturday, March 5
;
>
>
> Regards
>
>
>
> *From: *Gourav Sengupta
> *Date: *Thursday, March 3, 2022 at 2:24 AM
> *To: *Anil Dasari
> *Cc: *Yang,Jie(INF) , user@spark.apache.org <
> user@spark.apache.org>
> *Subject: *Re: {EXT} Re: Spark Parquet write OOM
>
> Hi,
>
Hi Gourav,
Tried increasing shuffle partitions number and higher executor memory. Both
didn’t work.
Regards
From: Gourav Sengupta
Date: Thursday, March 3, 2022 at 2:24 AM
To: Anil Dasari
Cc: Yang,Jie(INF) , user@spark.apache.org
Subject: Re: {EXT} Re: Spark Parquet write OOM
Hi,
I do not
c: *Yang,Jie(INF) , user@spark.apache.org <
> user@spark.apache.org>
> *Subject: *Re: {EXT} Re: Spark Parquet write OOM
>
> Hi Anil,
>
>
>
> I was trying to work out things for a while yesterday, but may need your
> kind help.
>
>
>
> Can you please shar
Answers in the context. Thanks.
From: Gourav Sengupta
Date: Thursday, March 3, 2022 at 12:13 AM
To: Anil Dasari
Cc: Yang,Jie(INF) , user@spark.apache.org
Subject: Re: {EXT} Re: Spark Parquet write OOM
Hi Anil,
I was trying to work out things for a while yesterday, but may need your kind
Hi Anil,
I was trying to work out things for a while yesterday, but may need your
kind help.
Can you please share the code for the following steps?
-
Create DF from hive (from step #c)
- Deduplicate spark DF by primary key
- Write DF to s3 in parquet format
- Write metadata to s3
Regards,
Gourav
2nd attempt..
Any suggestions to troubleshoot and fix the problem ? thanks in advance.
Regards,
Anil
From: Anil Dasari
Date: Wednesday, March 2, 2022 at 7:00 AM
To: Gourav Sengupta , Yang,Jie(INF)
Cc: user@spark.apache.org
Subject: Re: {EXT} Re: Spark Parquet write OOM
Hi Gourav and Yang
(from step #c)
5. Deduplicate spark DF by primary key
6. Write DF to s3 in parquet format
7. Write metadata to s3
The failure is from spark batch job
3. Is your pipeline going to change or evolve soon, or the data volumes going
to vary, or particularly increase, over time?
[AD] : Data volume
the length of memory to be allocated, because
> `-XX:MaxDirectMemorySize` and `-Xmx` should have the same capacity by
> default.
>
>
>
>
>
> *发件人**: *Anil Dasari
> *日期**: *2022年3月2日 星期三 09:45
> *收件人**: *"user@spark.apache.org"
> *主题**: *Spark Parquet write OOM
hould have the same capacity by default.
发件人: Anil Dasari
日期: 2022年3月2日 星期三 09:45
收件人: "user@spark.apache.org"
主题: Spark Parquet write OOM
Hello everyone,
We are writing Spark Data frame to s3 in parquet and it is failing with below
exception.
I wanted to try following to
Hello everyone,
We are writing Spark Data frame to s3 in parquet and it is failing with below
exception.
I wanted to try following to avoid OOM
1. increase the default sql shuffle partitions to reduce load on parquet
writer tasks to avoid OOM and
2. Increase user memory (reduce memory
Hello all,
We use Apache Spark 3.2.0 and our data stored on Apache Hadoop with parquet
format.
One of the advantages of the parquet format is the presence of the predicate
pushdown filter feature, which allows only the necessary data to be read. This
feature is well provided by Spark. For
ill in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 22 Nov 2021 at 16:00, Daniel de Oliveira Mantovani <
>> daniel.oliveira.mantov...@gmail.com> wrote:
>>
>>&g
2021 at 16:00, Daniel de Oliveira Mantovani <
> daniel.oliveira.mantov...@gmail.com> wrote:
>
>> Hi Spark Team,
>>
>> I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML
>> using a DSL(Domain Specific Language) in Apache Spark. You actually
a Mantovani <
daniel.oliveira.mantov...@gmail.com> wrote:
> Hi Spark Team,
>
> I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML
> using a DSL(Domain Specific Language) in Apache Spark. You actually don't
> even need to write the DSL, you can generate it as well :)
&
Hi Spark Team,
I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML
using a DSL(Domain Specific Language) in Apache Spark. You actually don't
even need to write the DSL, you can generate it as well :)
I've written an article to teach how to use:
htt
Hi,
I am trying to convert json file into parquet format using spark and json file
contains a map where key and value are defined and actual key is scriptId. It
fails with below exception-
java.lang.ClassCastException: optional binary scriptId (UTF8) is not a group
at
jects dealing with transactional write from
> multiple writes, (alphabetically) Apache Iceberg, Delta Lake, and so on.
> You may want to check them out.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Sep 2, 2021 at 10:04 PM wrote:
>
> Hi all,
> I recen
; I recently stumbled about a rather strange problem with
> > streaming sources in one of my tests. I am writing a Parquet file
> > from a streaming source and subsequently try to append the same
> > data but this time from a static dataframe. Surprisingly, the
> > n
10:04 PM wrote:
> Hi all,
> I recently stumbled about a rather strange problem with streaming
> sources in one of my tests. I am writing a Parquet file from a streaming
> source and subsequently try to append the same data but this time from a
> static dataframe. Surprisingly, t
All,
This is very surprising and I am sure I might be doing something wrong. The
issue is, the following code is taking 8 hours. It reads a CSV file, takes
the phone number column, extracts the first four digits and then
partitions based on the four digits (phoneseries) and writes to Parquet.
Any
Hi all,
I recently stumbled about a rather strange problem with streaming
sources in one of my tests. I am writing a Parquet file from a
streaming source and subsequently try to append the same data but this
time from a static dataframe. Surprisingly, the number of rows in the
Parquet file
ere bugs in 3.0.1 which have been addressed in 3.1.1.
> --
> *From:* Gourav Sengupta
> *Sent:* 05 August 2021 10:17
> *To:* user @spark
> *Subject:* [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated
> parquet in SPARK 2.4.x
>
> *Cauti
RRECTED")
But otherwise, it's a change for good. Performance seems better.
Also, there were bugs in 3.0.1 which have been addressed in 3.1.1.
From: Gourav Sengupta
Sent: 05 August 2021 10:17
To: user @spark
Subject: [EXTERNAL] [Marketing Mail] Reading SPA
Hi everyone,
I am using spark-core-2.4 and spark-sql-2.4 (java spark). While reading 40K
parquet part files from a single HDFS directory, somehow spark is spanning
only 20037 parallel tasks, which is weird.
My initial experience with spark is that while reading number of total
tasks are equal to
Hi,
we are trying to migrate some of the data lake pipelines to run in SPARK
3.x, where as the dependent pipelines using those tables will be still
running in SPARK 2.4.x for sometime to come.
Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK
g such an requirement.
>
> Is it possible to define a parquet schema, that contains technical column
> names and a list of translations for a certain column name into different
> languages?
>
>
>
> I give an example:
>
> Technical: “custnr” would translate to { EN:”Custom
Hi folks,
Maybe not the right audience but maybe you came along such an requirement.
Is it possible to define a parquet schema, that contains technical column names
and a list of translations for a certain column name into different languages?
I give an example:
Technical: "custnr&q
for and launch a job per directory.
What am I missing?
Boris
From: Eric Beabes
Sent: Wednesday, 26 May 2021 0:34
To: Sean Owen
Cc: Silvio Fiorito ; spark-user
Subject: Re: Reading parquet files in parallel on the cluster
Right... but the problem is still the same, no? Those N Jobs (aka
unsubscribe
2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성:
> I keep getting the following exception when I am trying to read a Parquet
> file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
>
> java.lang.NullPointerException
was, this will distribute the load amongst Spark Executors &
>>>> will scale better. But this throws the NullPointerException shown in the
>>>> original email.
>>>>
>>>> Is there a better way to do this?
>>>>
>>>>
>>>> On Tue, Ma
r. But this throws the NullPointerException shown in the
>>> original email.
>>>
>>> Is there a better way to do this?
>>>
>>>
>>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
>&
>>
>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>> Why not just read from Spark as normal? Do these files have different or
>>> incompatible schemas?
>>>
>>>
>>>
>&g
ue, May 25, 2021 at 1:10 PM Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>> Why not just read from Spark as normal? Do these files have different or
>> incompatible schemas?
>>
>>
>>
>> val df = spark.read.option(“mergeSchema”, “true”
>
> val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
>
>
>
> *From: *Eric Beabes
> *Date: *Tuesday, May 25, 2021 at 1:24 PM
> *To: *spark-user
> *Subject: *Reading parquet files in parallel on the cluster
>
>
>
> I've a use case in which
Why not just read from Spark as normal? Do these files have different or
incompatible schemas?
val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths)
From: Eric Beabes
Date: Tuesday, May 25, 2021 at 1:24 PM
To: spark-user
Subject: Reading parquet files in parallel on the cluster
Right, you can't use Spark within Spark.
Do you actually need to read Parquet like this vs spark.read.parquet?
that's also parallel of course.
You'd otherwise be reading the files directly in your function with the
Parquet APIs.
On Tue, May 25, 2021 at 12:24 PM Eric Beabes
wrote
I've a use case in which I need to read Parquet files in parallel from over
1000+ directories. I am doing something like this:
val df = list.toList.toDF()
df.foreach(c => {
val config = *getConfigs()*
doSomething(spark, config)
})
In the doSomething method, when
I keep getting the following exception when I am trying to read a Parquet
file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
java.lang.NullPointerException
at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at
I found the filter can not be pushed which is like 'item[0].id > 1', anyone
konws?
--
Penglei Shi
to Spark and this question may be a possible duplicate of the issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 We have a large dataset partitioned by calendar date, and within each date partition, we are storing the data as parquet files in 128 parts.We are trying to run ag
;
>> Hello Team
>>
>> I am new to Spark and this question may be a possible duplicate of the
>> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
>>
>> We have a large dataset partitioned by calendar date, and within each
>> date parti
gt;
> We have a large dataset partitioned by calendar date, and within each date
> partition, we are storing the data as *parquet* files in 128 parts.
>
> We are trying to run aggregation on this dataset for 366 dates at a time
> with Spark SQL on spark version 2.3.0, hence our Spark
Hello Team
I am new to Spark and this question may be a possible duplicate of the
issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
We have a large dataset partitioned by calendar date, and within each date
partition, we are storing the data as *parquet* files in 128 parts
Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures
(SPARK-17636) Does it also support predicate pushdown for an array of structs?
For example, say I have a spark table 'individuals' (in parquet format) with
the following schema
root |-- individual_
Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures
(SPARK-17636) Does it also support predicate pushdown for an array of
structures? For example, say I have a spark table 'individuals' (in parquet
format) with the following schema
root |-- individual_
Hi,
This sounds like a bug. It works if I put an *arbitrary limit on insert*
INSERT INTO TABLE test.randomData
SELECT
ID
, CLUSTERED
, SCATTERED
, RANDOMISED
, RANDOM_STRING
, SMALL_VC
, PADDING
FROM tmp
LIMIT 1000
This works fi
Hi,
Upgraded Spark from 2.11.12 to Spark 3.0.1
Hive version 3.1.1 and Hadoop version 3.1.1
The following used to work with Spark 2.11.12
scala> sqltext = s"""
| INSERT INTO TABLE ${fullyQualifiedTableName}
| SELECT
| ID
| , CLUSTERED
| ,
Hi Chetan
I'm having the exact same issue with spark structured streaming and kafka
trying to write to HDFS.
Can you please tell me how did you fixed it?
I'm ussing spark 3.0.1 and hadoop 3.3.0
Thanks!
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
Hi all,
I am implementing a custom data sourceV1 and would like to enforce a pushdown
filter for every query. But when I run a simple count query df.count(), Spark
will ignore the filter and use the metadata in the parquet footer to accumulate
the count of each block directly, which will
Hi,
We have many Spark jobs that create multiple small files. We would like to
improve analyst reading performance, doing so I'm testing the parquet
optimal file size.
I've found that the optimal file size should be around 1GB, and not less
than 128MB, depending on the size of the dat
>> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>>
>>
>> Hi,
>>
>> I would like to develop a process that merges parquet files.
>> My first intention was to develop it with PySpark using coalesce(1) - to
>> create only 1 file.
>> This proc
Hi,
When migrating to Spark 3, I'm getting a NoSuchElementException exception
when getting partitions for a parquet dataframe -
The code I'm trying to execute is -
val df = sparkSession.read.parquet(inputFilePath)
val partitions = df.rdd.partitions
and the spark session
al terabytes is bad.
>
> It depends on your use case but you might look also at partitions etc.
>
> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>
>
> Hi,
>
> I would like to develop a process that merges parquet files.
> My first intention was to develop it with PySpa
.
> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>
>
> Hi,
>
> I would like to develop a process that merges parquet files.
> My first intention was to develop it with PySpark using coalesce(1) - to
> create only 1 file.
> This process is going to run on a huge amoun
Hi,
I would like to develop a process that merges parquet files.
My first intention was to develop it with PySpark using coalesce(1) - to
create only 1 file.
This process is going to run on a huge amount of files.
I wanted your advice on what is the best way to implement it (PySpark isn't
a
t.
>
> Customer
> - name - string
> - accounts - List
>
> Account
> - type - String
> - Adressess - List
>
> Address
> -name - String
>
> ----
>
> ---
>
>
> And it goes on.
>
> These file are in parquet ,
>
>
> All 3 input datasets
are in parquet ,
All 3 input datasets are having some details , which need to merge.
And build one dataset , which has all the information ( i know the files
which need to merge )
I want to know , how should I proceed on this ??
- my approach is to build case class of actual output and parse
Hi All,
Just following up on below to see if anyone has any suggestions. Appreciate
your help in advance.
Thanks,
Rishi
On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah wrote:
> Hi All,
>
> I use the following to read a set of parquet file paths when files are
> scattered across many man
Hi All,
I use the following to read a set of parquet file paths when files are
scattered across many many partitions.
paths = ['p1', 'p2', ... 'p1']
df = spark.read.parquet(*paths)
Above method feels like is sequentially reading those files & not really
pa
Thanks Mich, Nilesh.
What is also working is create schema object and provide at .schema(X) in
spark.read. statement.
Thanks a lot.
On Sun, May 10, 2020 at 2:37 AM Nilesh Kuchekar
wrote:
> Hi Chetan,
>
> You can have a static parquet file created, and when you
> c
Hi Chetan,
You can have a static parquet file created, and when you
create a data frame you can pass the location of both the files, with
option mergeSchema true. This will always fetch you a dataframe even if the
original file is not present.
Kuchekar, Nilesh
On Sat, May 9
arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Sat, 9 May 2020 at 22:51, Chetan Khatri
wrote:
> Hi Spark Users,
>
> I've a spark job
Hi Spark Users,
I've a spark job where I am reading the parquet path, and that parquet path
data is generated by other systems, some of the parquet paths doesn't
contains any data which is possible. is there a any way to read the parquet
if no data found I can create a dummy datafr
Hi,
Spark documentation says:
"When writing Parquet files, all columns are automatically converted to be
nullable for compatibility reasons."
Could you elaborate on the reasons for this choice?
Is this for a similar reason as Protobuf which gets rid of "required"
fields
Hi,
The advantage of Parquet is that it only scans the required columns, it is a
file in a column storage format.
The fewer columns you select, the less memory is required.
Developers do not need to care about the details of loading data, they are
well-designed and imperceptible to users
I have a parquet file with millions of records and hundreds of fields that I
will be extracting from a cluster with more resources. I need to take that
data,derive a set of tables from only some of the fields and import them
using a smaller cluster
The smaller cluster cannot load in memory the
lucas.g...@gmail.com> wrote:
>>
>>> Or AWS glue catalog if you're in AWS
>>>
>>> On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote:
>>>
>>>> Google hive metastore.
>>>>
>>>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li
gt;>
>>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
>>>
>>>> Hi all,
>>>>
>>>> Has anyone explored efforts to have a centralized storage of schemas of
>>>> different parquet files? I know there is schema management for Avro, but
>>>> couldn’t find solutions for parquet schema management. Thanks!
>>>> --
>>>> Cheers,
>>>> Ruijing Li
>>>>
>>> --
> Cheers,
> Ruijing Li
>
10:35, Magnus Nilsson wrote:
>
>> Google hive metastore.
>>
>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
>>
>>> Hi all,
>>>
>>> Has anyone explored efforts to have a centralized storage of schemas of
>>> different parquet files
Or AWS glue catalog if you're in AWS
On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote:
> Google hive metastore.
>
> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
>
>> Hi all,
>>
>> Has anyone explored efforts to have a centralized storage of schemas of
Google hive metastore.
On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote:
> Hi all,
>
> Has anyone explored efforts to have a centralized storage of schemas of
> different parquet files? I know there is schema management for Avro, but
> couldn’t find solutions for parquet sc
Hi all,
Has anyone explored efforts to have a centralized storage of schemas of
different parquet files? I know there is schema management for Avro, but
couldn’t find solutions for parquet schema management. Thanks!
--
Cheers,
Ruijing Li
> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint)
I wonder if avro is a better candidate for this because it's row
oriented it should be faster to write/read for such a task. Never heard
about checkpoint.
Enrico Minack writes:
> It is not about very large or s
It is not about very large or small, it is about how large your cluster
is w.r.t. your data. Caching is only useful if you have the respective
memory available across your executors. Otherwise you could either
materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed
have to do
ates() \ .cache() |
>
> Since df_actions is cached, you can count inserts and updates quickly
> with only that one join in df_actions:
>
> |inserts_count = df_actions|||.where(col('_action') ===
> 'Insert')|.count()||updates_count = df_actions|||.where(col('_action
Enrico, I have tried your suggestions and I can see some wins as well. I
have to re-design and rebuild some of my solution to get them to work.
When this project was started, I was asked to provide single partitioned
parquet files (in the same sort of way you would see being outputted by
Pandas)
in in df_actions:
|inserts_count = df_actions|||.where(col('_action') ===
'Insert')|.count()||updates_count = df_actions|||.where(col('_action') ===
'Update')|.count()|
And you can get rid of the union:
|df_output = df_actions.where(col('_action')
want to do something with the counts you could try removing the
individual counts and caches.
Put a single cache on the df_output
df_output = df_inserts.union(df_updates).cache()
Then output a count group by type on this df before writing out the parquet.
Hope that helps
Dave
On Thu, 13 Feb 2020, 06
e you referring to getting
the counts of the individual data frames, or from the already outputted
parquet?
Thanks and I appreciate your reply
On Thu, Feb 13, 2020 at 4:15 PM David Edwards
wrote:
> Hi Ashley,
>
> I'm not an expert but think this is because spark does lazy exe
Hi,
>
> I am currently working on an app using PySpark to produce an insert and
> update daily delta capture, being outputted as Parquet. This is running on
> a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of
> 2GB memory each) running Spark 2.4.3.
>
> Th
Hi,
I am currently working on an app using PySpark to produce an insert and
update daily delta capture, being outputted as Parquet. This is running on
a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of
2GB memory each) running Spark 2.4.3.
This is being achieved by reading
1 - 100 of 1617 matches
Mail list logo