Re: Splittable or not?

2022-09-19 Thread Jack Goodson
SparkContext.parallelize(yourgzipfile) Hope this helps > On 19/09/2022, at 9:45 PM, Sid wrote: > > Cool. Thanks, everyone for the reply. > > On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack <mailto:i...@enrico.minack.dev>> wrote: > If with "won't affect the performance&

Re: Splittable or not?

2022-09-19 Thread Sid
Cool. Thanks, everyone for the reply. On Sat, Sep 17, 2022 at 9:50 PM Enrico Minack wrote: > If with "won't affect the performance" you mean "parquet is splittable > though it uses snappy", then yes. Splittable files allow for optimal > parallelization, wh

Re: Splittable or not?

2022-09-17 Thread Enrico Minack
If with "won't affect the performance" you mean "parquet is splittable though it uses snappy", then yes. Splittable files allow for optimal parallelization, which "won't affect performance". Spark writing data will split the data into multiple files already (here

Re: Splittable or not?

2022-09-14 Thread Sid
Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance. Only using snappy will affect the performance Am I correct? On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote: > Hi Sid, > > Snappy itself is not splittable. But t

Re: Splittable or not?

2022-09-14 Thread Amit Joshi
Hi Sid, Snappy itself is not splittable. But the format that contains the actual data like parquet (which are basically divided into row groups) can be compressed using snappy. This works because blocks(pages of parquet format) inside the parquet can be independently compressed using snappy

Splittable or not?

2022-09-14 Thread Sid
Hello experts, I know that Gzip and snappy files are not splittable i.e data won't be distributed into multiple blocks rather it would try to load the data in a single partition/block So, my question is when I write the parquet data via spark it gets stored at the destination with something like

Re: Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
On 16 Nov 2017, at 10:22, Michael Shtelma wrote: > you call repartition(1) before starting processing your files. This > will ensure that you end up with just one partition. One question and one remark: Q) val ds = sqlContext.read.parquet(path).repartition(1) Am I

Processing a splittable file from a single executor

2017-11-16 Thread Jeroen Miller
Dear Sparkers, A while back, I asked how to process non-splittable files in parallel, one file per executor. Vadim's suggested "scheduling within an application" approach worked out beautifully. I am now facing the 'opposite' problem: - I have a bunch of parquet files to proce

Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-31 Thread Hyukjin Kwon
om/databricks/spark/csv/util/TextFile.scala#L34-L36 > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html > Sent from the Ap

Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
-csv-Splittable-reads-tp26103p26105.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Reading lzo+index with spark-csv (Splittable reads)

2016-01-29 Thread syepes
Hello, ​ I have managed to speed up the read stage when loading CSV files using the classic "newAPIHadoopFile" method, the issue is that I would like to use the spark-csv package and it seams that its not taking into consideration the LZO Index file / Splittable reads. /# Using the clas