Spark Job Fails while writing data to a S3 location in Parquet

2024-08-18 Thread Nipuna Shantha
I am trying to read .csv.gz file in S3 and write it to S3 in Parquet format through a Spark job. Currently I am using AWS EMR service for this. Spark Job is execute as a step in EMR cluster. For some .csv.gz files I have encounter below issue. I have used both spark 3.4 and 3.5 versions and still

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

2024-05-22 Thread Raghvendra Yadav
Hello, We are hoping someone can help us understand the spark behavior for scenarios listed below. Q. *Will spark running queries fail when S3 parquet object changes underneath with S3A remote file change detection enabled? Is it 100%? * Our understanding is that S3A has a

Re: Spark 3.3 + parquet 1.10

2023-07-24 Thread Mich Talebzadeh
will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 24 Jul 2023 at 10:56, Pralabh Kumar wrote: > Spark3.3 in OSS built with parquet 1.12. Just compiling with parquet > 1.10 results in build failure , so just wondering if a

Spark3.3 with parquet 1.10.x

2023-07-24 Thread Pralabh Kumar
Hi Spark Users . I have a quick question with respect to Spark 3.3. Currently Spark 3.3 is built with parquet 1.12. However, anyone tried Spark 3.3 with parquet 1.10 . We are at Uber , planning to migrate Spark 3.3 but we have limitations of using parquet 1.10 . Has anyone tried building Spark

Writing protobuf RDD to parquet

2023-01-20 Thread David Diebold
Hello, I'm trying to write to parquet some RDD[T] where T is a protobuf message, in scala. I am wondering what is the best option to do this, and I would be interested by your lights. So far, I see two possibilities: - use PairRDD method *saveAsNewAPIHadoopFile*, and I guess I need to

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Naresh Peshwe
retty effortless operation. > > Best, > Evyatar > > On Mon, 7 Nov 2022 at 13:37, Naresh Peshwe > wrote: > >> Hi Evyatar, >> Yes, directly reading the parquet data works. Since we use hive metastore >> to obfuscate the underlying datastore details, we want to a

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Naresh Peshwe
Hi Evyatar, Yes, directly reading the parquet data works. Since we use hive metastore to obfuscate the underlying datastore details, we want to avoid directly accessing the files. I guess then the only option is to either change the data or change the schema of the hive metastore as you suggested

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Evy M
r, > Yes, directly reading the parquet data works. Since we use hive metastore > to obfuscate the underlying datastore details, we want to avoid directly > accessing the files. > I guess then the only option is to either change the data or change the > schema of the hive metastore as

Re: ClassCastException while reading parquet data via Hive metastore

2022-11-07 Thread Evy M
Hi Naresh, Have you tried any of the following in order to resolve your issue: 1. Reading the Parquet files (directly, not via Hive [i.e, spark.read.parquet()]), casting to LongType and creating the hive table based on this dataframe? Hive's BigInt and Spark's Long should h

ClassCastException while reading parquet data via Hive metastore

2022-11-06 Thread Naresh Peshwe
Hi all, I am trying to read data (using spark sql) via a hive metastore which has a column of type bigint. Underlying parquet data has int as the datatype for the same column. I am getting the following error while trying to read the data using spark sql - java.lang.ClassCastException

RE: Encoded data retrieved when reading Parquet file

2022-10-19 Thread Nipuna Shantha
Solved this issue. Thank You On 2022/10/19 05:26:51 Nipuna Shantha wrote:> Hi all,> > I am writing to a parquet file using Impala version 3.4.0 and try to read> same parquet file from Spark 3.3.0 to a DataFrame.> > var df = spark.read.parquet(parquet_file_name)> > But when

Encoded data retrieved when reading Parquet file

2022-10-19 Thread Nipuna Shantha
Hi all, I am writing to a parquet file using Impala version 3.4.0 and try to read same parquet file from Spark 3.3.0 to a DataFrame. var df = spark.read.parquet(parquet_file_name) But when I show the DataFrame it has some encoded values for string data type as below. For string "TAX"

Re: external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-24 Thread Gourav Sengupta
Hi, > below sounds like something that someone will have experienced... > I have external tables of parquet files with a hive table defined on top > of the data. I dont manage/know the details of how the data lands. > For some tables no issues when querying through spark. > But for

external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-20 Thread Joris Billen
Hi, below sounds like something that someone will have experienced... I have external tables of parquet files with a hive table defined on top of the data. I dont manage/know the details of how the data lands. For some tables no issues when querying through spark. But for others there is an issue

Reading parquet strips non-nullability from schema

2022-07-05 Thread Greg Kopff
Hi. I’ve spent the last couple of hours trying to chase down an issue with writing/reading parquet files. I was trying to save (and then read in) a parquet file with a schema that sets my non-nullability details correctly. After having no success for some time, I posted to Stack Overflow

Re: How do I read parquet with python object

2022-05-09 Thread Sean Owen
That's a parquet library error. It might be this: https://issues.apache.org/jira/browse/PARQUET-1633 That's fixed in recent versions of Parquet. You didn't say what versions of libraries you are using, but try the latest Spark. On Mon, May 9, 2022 at 8:49 AM wrote: > #

How do I read parquet with python object

2022-05-09 Thread ben
# python: import pandas as pd a = pd.DataFrame([[1, [2.3, 1.2]]], columns=['a', 'b']) a.to_parquet('a.parquet') # pyspark: d2 = spark.read.parquet('a.parquet') will return error: An error was encountered: An error occurred while calling o277.showString. : org.apache.spark.SparkException: Job

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
: > I am not sure how to set the records limit. Let me check. I couldn’t find > parquet row group size configuration in spark. > > For now, I increased the number if shuffle partitions to reduce the > records processed by task to avoid OOM. > > > > Regards, > > Ani

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Anil Dasari
I am not sure how to set the records limit. Let me check. I couldn’t find parquet row group size configuration in spark. For now, I increased the number if shuffle partitions to reduce the records processed by task to avoid OOM. Regards, Anil From: Gourav Sengupta Date: Saturday, March 5

Re: {EXT} Re: Spark Parquet write OOM

2022-03-05 Thread Gourav Sengupta
; > > > Regards > > > > *From: *Gourav Sengupta > *Date: *Thursday, March 3, 2022 at 2:24 AM > *To: *Anil Dasari > *Cc: *Yang,Jie(INF) , user@spark.apache.org < > user@spark.apache.org> > *Subject: *Re: {EXT} Re: Spark Parquet write OOM > > Hi, >

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Anil Dasari
Hi Gourav, Tried increasing shuffle partitions number and higher executor memory. Both didn’t work. Regards From: Gourav Sengupta Date: Thursday, March 3, 2022 at 2:24 AM To: Anil Dasari Cc: Yang,Jie(INF) , user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi, I do not

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Gourav Sengupta
c: *Yang,Jie(INF) , user@spark.apache.org < > user@spark.apache.org> > *Subject: *Re: {EXT} Re: Spark Parquet write OOM > > Hi Anil, > > > > I was trying to work out things for a while yesterday, but may need your > kind help. > > > > Can you please shar

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Anil Dasari
Answers in the context. Thanks. From: Gourav Sengupta Date: Thursday, March 3, 2022 at 12:13 AM To: Anil Dasari Cc: Yang,Jie(INF) , user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi Anil, I was trying to work out things for a while yesterday, but may need your kind

Re: {EXT} Re: Spark Parquet write OOM

2022-03-03 Thread Gourav Sengupta
Hi Anil, I was trying to work out things for a while yesterday, but may need your kind help. Can you please share the code for the following steps? - Create DF from hive (from step #c) - Deduplicate spark DF by primary key - Write DF to s3 in parquet format - Write metadata to s3 Regards, Gourav

Re: {EXT} Re: Spark Parquet write OOM

2022-03-02 Thread Anil Dasari
2nd attempt.. Any suggestions to troubleshoot and fix the problem ? thanks in advance. Regards, Anil From: Anil Dasari Date: Wednesday, March 2, 2022 at 7:00 AM To: Gourav Sengupta , Yang,Jie(INF) Cc: user@spark.apache.org Subject: Re: {EXT} Re: Spark Parquet write OOM Hi Gourav and Yang

Re: {EXT} Re: Spark Parquet write OOM

2022-03-02 Thread Anil Dasari
(from step #c) 5. Deduplicate spark DF by primary key 6. Write DF to s3 in parquet format 7. Write metadata to s3 The failure is from spark batch job 3. Is your pipeline going to change or evolve soon, or the data volumes going to vary, or particularly increase, over time? [AD] : Data volume

Re: Spark Parquet write OOM

2022-03-02 Thread Gourav Sengupta
the length of memory to be allocated, because > `-XX:MaxDirectMemorySize` and `-Xmx` should have the same capacity by > default. > > > > > > *发件人**: *Anil Dasari > *日期**: *2022年3月2日 星期三 09:45 > *收件人**: *"user@spark.apache.org" > *主题**: *Spark Parquet write OOM

Re: Spark Parquet write OOM

2022-03-01 Thread Yang,Jie(INF)
hould have the same capacity by default. 发件人: Anil Dasari 日期: 2022年3月2日 星期三 09:45 收件人: "user@spark.apache.org" 主题: Spark Parquet write OOM Hello everyone, We are writing Spark Data frame to s3 in parquet and it is failing with below exception. I wanted to try following to

Spark Parquet write OOM

2022-03-01 Thread Anil Dasari
Hello everyone, We are writing Spark Data frame to s3 in parquet and it is failing with below exception. I wanted to try following to avoid OOM 1. increase the default sql shuffle partitions to reduce load on parquet writer tasks to avoid OOM and 2. Increase user memory (reduce memory

[Spark Core] Does Spark support parquet predicate pushdown for big lists?

2021-12-16 Thread Amin Borjian
Hello all, We use Apache Spark 3.2.0 and our data stored on Apache Hadoop with parquet format. One of the advantages of the parquet format is the presence of the predicate pushdown filter feature, which allows only the necessary data to be read. This feature is well provided by Spark. For

Re: How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Mich Talebzadeh
ill in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Mon, 22 Nov 2021 at 16:00, Daniel de Oliveira Mantovani < >> daniel.oliveira.mantov...@gmail.com> wrote: >> >>&g

Re: How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
2021 at 16:00, Daniel de Oliveira Mantovani < > daniel.oliveira.mantov...@gmail.com> wrote: > >> Hi Spark Team, >> >> I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML >> using a DSL(Domain Specific Language) in Apache Spark. You actually

Re: How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Mich Talebzadeh
a Mantovani < daniel.oliveira.mantov...@gmail.com> wrote: > Hi Spark Team, > > I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML > using a DSL(Domain Specific Language) in Apache Spark. You actually don't > even need to write the DSL, you can generate it as well :) &

How to Flatten JSON/XML/Parquet/etc in Apache Spark With Quenya-DSL

2021-11-22 Thread Daniel de Oliveira Mantovani
Hi Spark Team, I've written a library for Apache Spark to flatten JSON/Avro/Parquet/XML using a DSL(Domain Specific Language) in Apache Spark. You actually don't even need to write the DSL, you can generate it as well :) I've written an article to teach how to use: htt

json to parquet failure

2021-10-10 Thread Deepak Goel
Hi, I am trying to convert json file into parquet format using spark and json file contains a map where key and value are defined and actual key is scriptId. It fails with below exception- java.lang.ClassCastException: optional binary scriptId (UTF8) is not a group    at

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-06 Thread Jungtaek Lim
jects dealing with transactional write from > multiple writes, (alphabetically) Apache Iceberg, Delta Lake, and so on. > You may want to check them out. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Thu, Sep 2, 2021 at 10:04 PM wrote: > > Hi all, > I recen

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-05 Thread eugen . wintersberger
;   I recently stumbled about a rather strange  problem with > > streaming sources in one of my tests. I am writing a Parquet file > > from a streaming source and subsequently try to append the same > > data but this time from a static dataframe. Surprisingly, the > > n

Re: Appending a static dataframe to a stream create Parquet file fails

2021-09-02 Thread Jungtaek Lim
10:04 PM wrote: > Hi all, > I recently stumbled about a rather strange problem with streaming > sources in one of my tests. I am writing a Parquet file from a streaming > source and subsequently try to append the same data but this time from a > static dataframe. Surprisingly, t

Reading CSV and Transforming to Parquet Issue

2021-09-02 Thread ☼ R Nair
All, This is very surprising and I am sure I might be doing something wrong. The issue is, the following code is taking 8 hours. It reads a CSV file, takes the phone number column, extracts the first four digits and then partitions based on the four digits (phoneseries) and writes to Parquet. Any

Appending a static dataframe to a stream create Parquet file fails

2021-09-02 Thread eugen . wintersberger
Hi all,   I recently stumbled about a rather strange  problem with streaming sources in one of my tests. I am writing a Parquet file from a streaming source and subsequently try to append the same data but this time from a static dataframe. Surprisingly, the number of rows in the Parquet file

Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-12 Thread Gourav Sengupta
ere bugs in 3.0.1 which have been addressed in 3.1.1. > -- > *From:* Gourav Sengupta > *Sent:* 05 August 2021 10:17 > *To:* user @spark > *Subject:* [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated > parquet in SPARK 2.4.x > > *Cauti

Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-12 Thread Saurabh Gulati
RRECTED") But otherwise, it's a change for good. Performance seems better. Also, there were bugs in 3.0.1 which have been addressed in 3.1.1. From: Gourav Sengupta Sent: 05 August 2021 10:17 To: user @spark Subject: [EXTERNAL] [Marketing Mail] Reading SPA

Facing weird problem while reading Parquet

2021-08-10 Thread Prateek Rajput
Hi everyone, I am using spark-core-2.4 and spark-sql-2.4 (java spark). While reading 40K parquet part files from a single HDFS directory, somehow spark is spanning only 20037 parallel tasks, which is weird. My initial experience with spark is that while reading number of total tasks are equal to

Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-05 Thread Gourav Sengupta
Hi, we are trying to migrate some of the data lake pipelines to run in SPARK 3.x, where as the dependent pipelines using those tables will be still running in SPARK 2.4.x for sometime to come. Does anyone know of any issues that can happen: 1. when reading Parquet files written in 3.1.x in SPARK

Re: Parquet Metadata

2021-06-23 Thread Sam
g such an requirement. > > Is it possible to define a parquet schema, that contains technical column > names and a list of translations for a certain column name into different > languages? > > > > I give an example: > > Technical: “custnr” would translate to { EN:”Custom

Parquet Metadata

2021-06-23 Thread Bode, Meikel, NMA-CFD
Hi folks, Maybe not the right audience but maybe you came along such an requirement. Is it possible to define a parquet schema, that contains technical column names and a list of translations for a certain column name into different languages? I give an example: Technical: "custnr&q

RE: Reading parquet files in parallel on the cluster

2021-05-30 Thread Boris Litvak
for and launch a job per directory. What am I missing? Boris From: Eric Beabes Sent: Wednesday, 26 May 2021 0:34 To: Sean Owen Cc: Silvio Fiorito ; spark-user Subject: Re: Reading parquet files in parallel on the cluster Right... but the problem is still the same, no? Those N Jobs (aka

Re: NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread YEONWOO BAEK
unsubscribe 2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성: > I keep getting the following exception when I am trying to read a Parquet > file from a Path on S3 in Spark/Scala. Note: I am running this on EMR. > > java.lang.NullPointerException

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
was, this will distribute the load amongst Spark Executors & >>>> will scale better. But this throws the NullPointerException shown in the >>>> original email. >>>> >>>> Is there a better way to do this? >>>> >>>> >>>> On Tue, Ma

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
r. But this throws the NullPointerException shown in the >>> original email. >>> >>> Is there a better way to do this? >>> >>> >>> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >>> silvio.fior...@granturing.com> wrote: >>> >&

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
>> >> On Tue, May 25, 2021 at 1:10 PM Silvio Fiorito < >> silvio.fior...@granturing.com> wrote: >> >>> Why not just read from Spark as normal? Do these files have different or >>> incompatible schemas? >>> >>> >>> >&g

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
ue, May 25, 2021 at 1:10 PM Silvio Fiorito < > silvio.fior...@granturing.com> wrote: > >> Why not just read from Spark as normal? Do these files have different or >> incompatible schemas? >> >> >> >> val df = spark.read.option(“mergeSchema”, “true”

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
> > val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) > > > > *From: *Eric Beabes > *Date: *Tuesday, May 25, 2021 at 1:24 PM > *To: *spark-user > *Subject: *Reading parquet files in parallel on the cluster > > > > I've a use case in which

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Silvio Fiorito
Why not just read from Spark as normal? Do these files have different or incompatible schemas? val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) From: Eric Beabes Date: Tuesday, May 25, 2021 at 1:24 PM To: spark-user Subject: Reading parquet files in parallel on the cluster

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Sean Owen
Right, you can't use Spark within Spark. Do you actually need to read Parquet like this vs spark.read.parquet? that's also parallel of course. You'd otherwise be reading the files directly in your function with the Parquet APIs. On Tue, May 25, 2021 at 12:24 PM Eric Beabes wrote

Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when

NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread Eric Beabes
I keep getting the following exception when I am trying to read a Parquet file from a Path on S3 in Spark/Scala. Note: I am running this on EMR. java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) at

Does spark3.1.1 support parquet nested column predicate pushdown for array type and map type column

2021-05-18 Thread 石鹏磊
I found the filter can not be pushed which is like 'item[0].id > 1', anyone konws? -- Penglei Shi

Re: [jira] [Commented] (SPARK-34648) Reading Parquet F =?utf-8?Q?iles_in_Spark_Extremely_Slow_for_Large_Number_of_Files??=

2021-03-10 Thread Kent Yao
to Spark and this question may be a possible duplicate of the issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 We have a large dataset partitioned by calendar date, and within each date partition, we are storing the data as parquet files in 128 parts.We are trying to run ag

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-10 Thread 钟雨
; >> Hello Team >> >> I am new to Spark and this question may be a possible duplicate of the >> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 >> >> We have a large dataset partitioned by calendar date, and within each >> date parti

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
gt; > We have a large dataset partitioned by calendar date, and within each date > partition, we are storing the data as *parquet* files in 128 parts. > > We are trying to run aggregation on this dataset for 366 dates at a time > with Spark SQL on spark version 2.3.0, hence our Spark

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-06 Thread Pankaj Bhootra
Hello Team I am new to Spark and this question may be a possible duplicate of the issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 We have a large dataset partitioned by calendar date, and within each date partition, we are storing the data as *parquet* files in 128 parts

[SPARK-SQL] Does Spark 3.0 support parquet predicate pushdown for array of structs?

2021-02-14 Thread Haijia Zhou
Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures (SPARK-17636) Does it also support predicate pushdown for an array of structs?   For example, say I have a spark table 'individuals' (in parquet format) with the following schema  root |-- individual_

Does Spark 3.0 support parquet predicate pushdown for array of structures

2021-02-14 Thread Haijia Zhou
Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures (SPARK-17636) Does it also support predicate pushdown for an array of structures?  For example, say I have a spark table 'individuals' (in parquet format) with the following schema  root |-- individual_

Re: Spark 3.0.1 fails to insert into Hive Parquet table but Spark 2.11.12 used to work

2020-12-21 Thread Mich Talebzadeh
Hi, This sounds like a bug. It works if I put an *arbitrary limit on insert* INSERT INTO TABLE test.randomData SELECT ID , CLUSTERED , SCATTERED , RANDOMISED , RANDOM_STRING , SMALL_VC , PADDING FROM tmp LIMIT 1000 This works fi

Spark 3.0.1 fails to insert into Hive Parquet table but Spark 2.11.12 used to work

2020-12-19 Thread Mich Talebzadeh
Hi, Upgraded Spark from 2.11.12 to Spark 3.0.1 Hive version 3.1.1 and Hadoop version 3.1.1 The following used to work with Spark 2.11.12 scala> sqltext = s""" | INSERT INTO TABLE ${fullyQualifiedTableName} | SELECT | ID | , CLUSTERED | ,

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2020-11-19 Thread AlbertoMarq
Hi Chetan I'm having the exact same issue with spark structured streaming and kafka trying to write to HDFS. Can you please tell me how did you fixed it? I'm ussing spark 3.0.1 and hadoop 3.3.0 Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Disable parquet metadata for count

2020-11-17 Thread Gary Li
Hi all, I am implementing a custom data sourceV1 and would like to enforce a pushdown filter for every query. But when I run a simple count query df.count(), Spark will ignore the filter and use the metadata in the parquet footer to accumulate the count of each block directly, which will

Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi, We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size. I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the dat

Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
>> Am 31.08.2020 um 16:17 schrieb Tzahi File : >> >>  >> Hi, >> >> I would like to develop a process that merges parquet files. >> My first intention was to develop it with PySpark using coalesce(1) - to >> create only 1 file. >> This proc

Error while getting RDD partitions for a parquet dataframe in Spark 3

2020-09-01 Thread Albert Butterscotch
Hi, When migrating to Spark 3, I'm getting a NoSuchElementException exception when getting partitions for a parquet dataframe - The code I'm trying to execute is - val df = sparkSession.read.parquet(inputFilePath) val partitions = df.rdd.partitions and the spark session

Re: Merging Parquet Files

2020-08-31 Thread Tzahi File
al terabytes is bad. > > It depends on your use case but you might look also at partitions etc. > > Am 31.08.2020 um 16:17 schrieb Tzahi File : > >  > Hi, > > I would like to develop a process that merges parquet files. > My first intention was to develop it with PySpa

Re: Merging Parquet Files

2020-08-31 Thread Jörn Franke
. > Am 31.08.2020 um 16:17 schrieb Tzahi File : > >  > Hi, > > I would like to develop a process that merges parquet files. > My first intention was to develop it with PySpark using coalesce(1) - to > create only 1 file. > This process is going to run on a huge amoun

Merging Parquet Files

2020-08-31 Thread Tzahi File
Hi, I would like to develop a process that merges parquet files. My first intention was to develop it with PySpark using coalesce(1) - to create only 1 file. This process is going to run on a huge amount of files. I wanted your advice on what is the best way to implement it (PySpark isn't a

Re: help on use case - spark parquet processing

2020-08-13 Thread Amit Sharma
t. > > Customer > - name - string > - accounts - List > > Account > - type - String > - Adressess - List > > Address > -name - String > > ---- > > --- > > > And it goes on. > > These file are in parquet , > > > All 3 input datasets

help on use case - spark parquet processing

2020-08-13 Thread manjay kumar
are in parquet , All 3 input datasets are having some details , which need to merge. And build one dataset , which has all the information ( i know the files which need to merge ) I want to know , how should I proceed on this ?? - my approach is to build case class of actual output and parse

Re: [PySpark 2.3+] Reading parquet entire path vs a set of file paths

2020-06-03 Thread Rishi Shah
Hi All, Just following up on below to see if anyone has any suggestions. Appreciate your help in advance. Thanks, Rishi On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah wrote: > Hi All, > > I use the following to read a set of parquet file paths when files are > scattered across many man

[PySpark 2.3+] Reading parquet entire path vs a set of file paths

2020-06-01 Thread Rishi Shah
Hi All, I use the following to read a set of parquet file paths when files are scattered across many many partitions. paths = ['p1', 'p2', ... 'p1'] df = spark.read.parquet(*paths) Above method feels like is sequentially reading those files & not really pa

Re: AnalysisException - Infer schema for the Parquet path

2020-05-11 Thread Chetan Khatri
Thanks Mich, Nilesh. What is also working is create schema object and provide at .schema(X) in spark.read. statement. Thanks a lot. On Sun, May 10, 2020 at 2:37 AM Nilesh Kuchekar wrote: > Hi Chetan, > > You can have a static parquet file created, and when you > c

Re: AnalysisException - Infer schema for the Parquet path

2020-05-09 Thread Nilesh Kuchekar
Hi Chetan, You can have a static parquet file created, and when you create a data frame you can pass the location of both the files, with option mergeSchema true. This will always fetch you a dataframe even if the original file is not present. Kuchekar, Nilesh On Sat, May 9

Re: AnalysisException - Infer schema for the Parquet path

2020-05-09 Thread Mich Talebzadeh
arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 9 May 2020 at 22:51, Chetan Khatri wrote: > Hi Spark Users, > > I've a spark job

AnalysisException - Infer schema for the Parquet path

2020-05-09 Thread Chetan Khatri
Hi Spark Users, I've a spark job where I am reading the parquet path, and that parquet path data is generated by other systems, some of the parquet paths doesn't contains any data which is possible. is there a any way to read the parquet if no data found I can create a dummy datafr

Why when writing Parquet files, columns are converted to nullable?

2020-04-24 Thread Julien Benoit
Hi, Spark documentation says: "When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons." Could you elaborate on the reasons for this choice? Is this for a similar reason as Protobuf which gets rid of "required" fields

Re:Question about how parquet files are read and processed

2020-04-15 Thread Kelvin Qin
Hi, The advantage of Parquet is that it only scans the required columns, it is a file in a column storage format. The fewer columns you select, the less memory is required. Developers do not need to care about the details of loading data, they are well-designed and imperceptible to users

Question about how parquet files are read and processed

2020-04-15 Thread Yeikel
I have a parquet file with millions of records and hundreds of fields that I will be extracting from a cluster with more resources. I need to take that data,derive a set of tables from only some of the fields and import them using a smaller cluster The smaller cluster cannot load in memory the

Re: Schema store for Parquet

2020-03-09 Thread Ruijing Li
lucas.g...@gmail.com> wrote: >> >>> Or AWS glue catalog if you're in AWS >>> >>> On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote: >>> >>>> Google hive metastore. >>>> >>>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li

Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
gt;> >>> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: >>> >>>> Hi all, >>>> >>>> Has anyone explored efforts to have a centralized storage of schemas of >>>> different parquet files? I know there is schema management for Avro, but >>>> couldn’t find solutions for parquet schema management. Thanks! >>>> -- >>>> Cheers, >>>> Ruijing Li >>>> >>> -- > Cheers, > Ruijing Li >

Re: Schema store for Parquet

2020-03-04 Thread Ruijing Li
10:35, Magnus Nilsson wrote: > >> Google hive metastore. >> >> On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: >> >>> Hi all, >>> >>> Has anyone explored efforts to have a centralized storage of schemas of >>> different parquet files

Re: Schema store for Parquet

2020-03-04 Thread lucas.g...@gmail.com
Or AWS glue catalog if you're in AWS On Wed, 4 Mar 2020 at 10:35, Magnus Nilsson wrote: > Google hive metastore. > > On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: > >> Hi all, >> >> Has anyone explored efforts to have a centralized storage of schemas of

Re: Schema store for Parquet

2020-03-04 Thread Magnus Nilsson
Google hive metastore. On Wed, Mar 4, 2020 at 7:29 PM Ruijing Li wrote: > Hi all, > > Has anyone explored efforts to have a centralized storage of schemas of > different parquet files? I know there is schema management for Avro, but > couldn’t find solutions for parquet sc

Schema store for Parquet

2020-03-04 Thread Ruijing Li
Hi all, Has anyone explored efforts to have a centralized storage of schemas of different parquet files? I know there is schema management for Avro, but couldn’t find solutions for parquet schema management. Thanks! -- Cheers, Ruijing Li

Re: Questions about count() performance with dataframes and parquet files

2020-02-18 Thread Nicolas PARIS
> either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) I wonder if avro is a better candidate for this because it's row oriented it should be faster to write/read for such a task. Never heard about checkpoint. Enrico Minack writes: > It is not about very large or s

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Enrico Minack
It is not about very large or small, it is about how large your cluster is w.r.t. your data. Caching is only useful if you have the respective memory available across your executors. Otherwise you could either materialize the Dataframe on HDFS (e.g. parquet or checkpoint) or indeed have to do

Re: Questions about count() performance with dataframes and parquet files

2020-02-17 Thread Nicolas PARIS
ates() \ .cache() | > > Since df_actions is cached, you can count inserts and updates quickly > with only that one join in df_actions: > > |inserts_count = df_actions|||.where(col('_action') === > 'Insert')|.count()||updates_count = df_actions|||.where(col('_action

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Ashley Hoff
Enrico, I have tried your suggestions and I can see some wins as well. I have to re-design and rebuild some of my solution to get them to work. When this project was started, I was asked to provide single partitioned parquet files (in the same sort of way you would see being outputted by Pandas)

Re: Questions about count() performance with dataframes and parquet files

2020-02-13 Thread Enrico Minack
in in df_actions: |inserts_count = df_actions|||.where(col('_action') === 'Insert')|.count()||updates_count = df_actions|||.where(col('_action') === 'Update')|.count()| And you can get rid of the union: |df_output = df_actions.where(col('_action')

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
want to do something with the counts you could try removing the individual counts and caches. Put a single cache on the df_output df_output = df_inserts.union(df_updates).cache() Then output a count group by type on this df before writing out the parquet. Hope that helps Dave On Thu, 13 Feb 2020, 06

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
e you referring to getting the counts of the individual data frames, or from the already outputted parquet? Thanks and I appreciate your reply On Thu, Feb 13, 2020 at 4:15 PM David Edwards wrote: > Hi Ashley, > > I'm not an expert but think this is because spark does lazy exe

Re: Questions about count() performance with dataframes and parquet files

2020-02-12 Thread David Edwards
Hi, > > I am currently working on an app using PySpark to produce an insert and > update daily delta capture, being outputted as Parquet. This is running on > a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of > 2GB memory each) running Spark 2.4.3. > > Th

Questions about count() performance with dataframes and parquet files

2020-02-12 Thread Ashley Hoff
Hi, I am currently working on an app using PySpark to produce an insert and update daily delta capture, being outputted as Parquet. This is running on a 8 core 32 GB Linux server in standalone mode (set to 6 worker cores of 2GB memory each) running Spark 2.4.3. This is being achieved by reading

  1   2   3   4   5   6   7   8   9   10   >