Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, 

I think you’re asking the right question, however you’re making an assumption 
that he’s on the cloud and he never talked about the size of the file. 

It could be that he’s got a lot of small-ish data sets.  1GB is kinda small in 
relative terms.  

Again YMMV. 

Personally if you’re going to use Spark for data engineering,  Scala first, 
Java second, then Python unless you’re a Python developer which means go w 
Python. 

I agree that wanting to have a single file needs to be explained. 


> On Aug 31, 2020, at 10:52 AM, Jörn Franke  wrote:
> 
> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files. 
> The reason is also that if you need to transfer it (eg to other clouds etc) - 
> having a large file of several terabytes is bad.
> 
> It depends on your use case but you might look also at partitions etc.
> 
>> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>> 
>> 
>> Hi, 
>> 
>> I would like to develop a process that merges parquet files. 
>> My first intention was to develop it with PySpark using coalesce(1) -  to 
>> create only 1 file. 
>> This process is going to run on a huge amount of files.
>> I wanted your advice on what is the best way to implement it (PySpark isn't 
>> a must).  
>> 
>> 
>> Thanks,
>> Tzahi



Re: Merging Parquet Files

2020-08-31 Thread Tzahi File
You are right.

In general this job should deal with very small files and create an output
file of less than 100MB.
In other cases I would need to create multiple files of around 100 MB..
The issues with partitions that decrease the number of partitions will
reduce ETLs performance, while this job should be a side job.




On Mon, Aug 31, 2020 at 5:52 PM Jörn Franke  wrote:

> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files.
> The reason is also that if you need to transfer it (eg to other clouds etc)
> - having a large file of several terabytes is bad.
>
> It depends on your use case but you might look also at partitions etc.
>
> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>
> 
> Hi,
>
> I would like to develop a process that merges parquet files.
> My first intention was to develop it with PySpark using coalesce(1) -  to
> create only 1 file.
> This process is going to run on a huge amount of files.
> I wanted your advice on what is the best way to implement it (PySpark
> isn't a must).
>
>
> Thanks,
> Tzahi
>
>

-- 
Tzahi File
Data Engineer
[image: ironSource] 

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com 
[image: linkedin] [image:
twitter] [image: facebook]
[image: googleplus]

This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Merging Parquet Files

2020-08-31 Thread Jörn Franke
Why only one file?
I would go more for files of specific size, eg data is split in 1gb files. The 
reason is also that if you need to transfer it (eg to other clouds etc) - 
having a large file of several terabytes is bad.

It depends on your use case but you might look also at partitions etc.

> Am 31.08.2020 um 16:17 schrieb Tzahi File :
> 
> 
> Hi, 
> 
> I would like to develop a process that merges parquet files. 
> My first intention was to develop it with PySpark using coalesce(1) -  to 
> create only 1 file. 
> This process is going to run on a huge amount of files.
> I wanted your advice on what is the best way to implement it (PySpark isn't a 
> must).  
> 
> 
> Thanks,
> Tzahi


Merging Parquet Files

2020-08-31 Thread Tzahi File
Hi,

I would like to develop a process that merges parquet files.
My first intention was to develop it with PySpark using coalesce(1) -  to
create only 1 file.
This process is going to run on a huge amount of files.
I wanted your advice on what is the best way to implement it (PySpark isn't
a must).


Thanks,
Tzahi


Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9

On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon  wrote:

Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim :

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
them into 1 file after they are output from Spark. Doing a coalesce(1) on
the Spark cluster will not work. It just does not have the resources to do
it. I'm trying to do it using the commandline and not use Spark. I will use
this command in shell script. I tried "hdfs dfs -getmerge", but the file
becomes unreadable by Spark with gzip footer error.





Thanks,


Ben


-


To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t 
work, I’ll try Kite.

Cheers,
Ben


> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon  wrote:
> 
> Hi Benjamin,
> 
> 
> As you might already know, I believe the Hadoop command automatically does 
> not merge the column-based format such as ORC or Parquet but just simply 
> concatenates them.
> 
> I haven't tried this by myself but I remember I saw a JIRA in Parquet - 
> https://issues.apache.org/jira/browse/PARQUET-460 
> 
> 
> It seems parquet-tools allows merge small Parquet files into one. 
> 
> 
> Also, I believe there are command-line tools in Kite - 
> https://github.com/kite-sdk/kite 
> 
> This might be useful.
> 
> 
> Thanks!
> 
> 2016-12-23 7:01 GMT+09:00 Benjamin Kim  >:
> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
> into 1 file after they are output from Spark. Doing a coalesce(1) on the 
> Spark cluster will not work. It just does not have the resources to do it. 
> I'm trying to do it using the commandline and not use Spark. I will use this 
> command in shell script. I tried "hdfs dfs -getmerge", but the file becomes 
> unreadable by Spark with gzip footer error.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim :

> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
> them into 1 file after they are output from Spark. Doing a coalesce(1) on
> the Spark cluster will not work. It just does not have the resources to do
> it. I'm trying to do it using the commandline and not use Spark. I will use
> this command in shell script. I tried "hdfs dfs -getmerge", but the file
> becomes unreadable by Spark with gzip footer error.
>
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark 
cluster will not work. It just does not have the resources to do it. I'm trying 
to do it using the commandline and not use Spark. I will use this command in 
shell script. I tried "hdfs dfs -getmerge", but the file becomes unreadable by 
Spark with gzip footer error.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Merging Parquet Files

2014-11-25 Thread Michael Armbrust
You'll need to be running a very recent version of Spark SQL as this
feature was just added.

On Tue, Nov 25, 2014 at 1:01 AM, Daniel Haviv danielru...@gmail.com wrote:

 Hi,
 Thanks for your reply.. I'm trying to do what you suggested but I get:
 scala sqlContext.sql(CREATE TEMPORARY TABLE data USING
 org.apache.spark.sql.parquet OPTIONS (path '/requests_parquet.toomany'))

 *java.lang.RuntimeException: Failed to load class for data source:
 org.apache.spark.sql.parquet*
 *at scala.sys.package$.error(package.scala:27)*

 any idea why ?

 Thanks,
 Daniel

 On Mon, Nov 24, 2014 at 11:30 PM, Michael Armbrust mich...@databricks.com
  wrote:

 Parquet does a lot of serial metadata operations on the driver which
 makes it really slow when you have a very large number of files (especially
 if you are reading from something like S3).  This is something we are aware
 of and that I'd really like to improve in 1.3.

 You might try the (brand new and very experimental) new parquet support
 that I added into 1.2 at the last minute in an attempt to make our metadata
 handling more efficient.

 Basically you load the parquet files using the new data source API
 instead of using parquetFile:

 CREATE TEMPORARY TABLE data
 USING org.apache.spark.sql.parquet
 OPTIONS (
   path 'path/to/parquet'
 )

 This will at least parallelize the retrieval of file status object, but
 there is a lot more optimization that I hope to do.

 On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv danielru...@gmail.com
 wrote:

 Hi,
 I'm ingesting a lot of small JSON files and convert them to unified
 parquet files, but even the unified files are fairly small (~10MB).
 I want to run a merge operation every hour on the existing files, but it
 takes a lot of time for such a small amount of data: about 3 GB spread of
 3000 parquet files.

 Basically what I'm doing is load files in the existing directory,
 coalesce them and save to the new dir:
 val parquetFiles=sqlContext.parquetFile(/requests_merged/inproc)

 parquetFiles.coalesce(2).saveAsParquetFile(/requests_merged/$currday)

 Doing this takes over an hour on my 3 node cluster...

 Is there a better way to achieve this ?
 Any ideas what can cause such a simple operation take so long?

 Thanks,
 Daniel






Re: Merging Parquet Files

2014-11-24 Thread Michael Armbrust
Parquet does a lot of serial metadata operations on the driver which makes
it really slow when you have a very large number of files (especially if
you are reading from something like S3).  This is something we are aware of
and that I'd really like to improve in 1.3.

You might try the (brand new and very experimental) new parquet support
that I added into 1.2 at the last minute in an attempt to make our metadata
handling more efficient.

Basically you load the parquet files using the new data source API instead
of using parquetFile:

CREATE TEMPORARY TABLE data
USING org.apache.spark.sql.parquet
OPTIONS (
  path 'path/to/parquet'
)

This will at least parallelize the retrieval of file status object, but
there is a lot more optimization that I hope to do.

On Sat, Nov 22, 2014 at 1:53 PM, Daniel Haviv danielru...@gmail.com wrote:

 Hi,
 I'm ingesting a lot of small JSON files and convert them to unified
 parquet files, but even the unified files are fairly small (~10MB).
 I want to run a merge operation every hour on the existing files, but it
 takes a lot of time for such a small amount of data: about 3 GB spread of
 3000 parquet files.

 Basically what I'm doing is load files in the existing directory, coalesce
 them and save to the new dir:
 val parquetFiles=sqlContext.parquetFile(/requests_merged/inproc)

 parquetFiles.coalesce(2).saveAsParquetFile(/requests_merged/$currday)

 Doing this takes over an hour on my 3 node cluster...

 Is there a better way to achieve this ?
 Any ideas what can cause such a simple operation take so long?

 Thanks,
 Daniel



Merging Parquet Files

2014-11-22 Thread Daniel Haviv
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000 parquet files.

Basically what I'm doing is load files in the existing directory, coalesce
them and save to the new dir:
val parquetFiles=sqlContext.parquetFile(/requests_merged/inproc)

parquetFiles.coalesce(2).saveAsParquetFile(/requests_merged/$currday)

Doing this takes over an hour on my 3 node cluster...

Is there a better way to achieve this ?
Any ideas what can cause such a simple operation take so long?

Thanks,
Daniel


Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile(/requests)
val parquetRequests=sqlContext.parquetFile(/requests_parquet)

jsonRequests.registerTempTable(jsonRequests)
parquetRequests.registerTempTable(parquetRequests)

val unified_requests=sqlContext.sql(select * from jsonRequests union
select * from parquetRequests)

unified_requests.saveAsParquetFile(/tempdir)

and then I delete /requests_parquet and rename /tempdir as /requests_parquet

Is there a better way to achieve that ?

Another problem I have is that I get a lot of small json files and as a
result a lot of small parquet files, I'd like to merge the json files into
a few parquet files.. how I do that?

Thank you,
Daniel


Re: Merging Parquet Files

2014-11-19 Thread Marius Soutier
You can also insert into existing tables via .insertInto(tableName, overwrite). 
You just have to import sqlContext._

On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:

 Hello,
 I'm writing a process that ingests json files and saves them a parquet files.
 The process is as such:
 
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val jsonRequests=sqlContext.jsonFile(/requests)
 val parquetRequests=sqlContext.parquetFile(/requests_parquet)
 
 jsonRequests.registerTempTable(jsonRequests)
 parquetRequests.registerTempTable(parquetRequests)
 
 val unified_requests=sqlContext.sql(select * from jsonRequests union select 
 * from parquetRequests)
 
 unified_requests.saveAsParquetFile(/tempdir)
 
 and then I delete /requests_parquet and rename /tempdir as /requests_parquet
 
 Is there a better way to achieve that ? 
 
 Another problem I have is that I get a lot of small json files and as a 
 result a lot of small parquet files, I'd like to merge the json files into a 
 few parquet files.. how I do that?
 
 Thank you,
 Daniel
 
 



Re: Merging Parquet Files

2014-11-19 Thread Daniel Haviv
Very cool thank you!


On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote:

 You can also insert into existing tables via .insertInto(tableName,
 overwrite). You just have to import sqlContext._

 On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:

 Hello,
 I'm writing a process that ingests json files and saves them a parquet
 files.
 The process is as such:

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val jsonRequests=sqlContext.jsonFile(/requests)
 val parquetRequests=sqlContext.parquetFile(/requests_parquet)

 jsonRequests.registerTempTable(jsonRequests)
 parquetRequests.registerTempTable(parquetRequests)

 val unified_requests=sqlContext.sql(select * from jsonRequests union
 select * from parquetRequests)

 unified_requests.saveAsParquetFile(/tempdir)

 and then I delete /requests_parquet and rename /tempdir as
 /requests_parquet

 Is there a better way to achieve that ?

 Another problem I have is that I get a lot of small json files and as a
 result a lot of small parquet files, I'd like to merge the json files into
 a few parquet files.. how I do that?

 Thank you,
 Daniel






Re: Merging Parquet Files

2014-11-19 Thread Michael Armbrust
On Wed, Nov 19, 2014 at 12:41 AM, Daniel Haviv danielru...@gmail.com
wrote:

 Another problem I have is that I get a lot of small json files and as a
 result a lot of small parquet files, I'd like to merge the json files into
 a few parquet files.. how I do that?


You can use `coalesce` on any RDD to merge files.