Re: Opening many Parquet files = slow

Masf Wed, 15 Apr 2015 03:08:49 -0700

Hi guys

Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files
(250MB/file), it lasts 4 minutes.


I have a cluster with 4 nodes and it seems me too slow.

The "load" function is not available in Spark 1.2, so I can't test it


Regards.
Miguel.

On Mon, Apr 13, 2015 at 8:12 PM, Eric Eijkelenboom <
eric.eijkelenb...@gmail.com> wrote:

> Hi guys
>
> Does anyone know how to stop Spark from opening all Parquet files before
> starting a job? This is quite a show stopper for me, since I have 5000
> Parquet files on S3.
>
> Recap of what I tried:
>
> 1. Disable schema merging with: sqlContext.load(“parquet",
> Map("mergeSchema" -> "false”, "path" -> “s3://path/to/folder"))
>     This opens most files in the folder (17 out of 21 in my small
> example). For 5000 files on S3, sqlContext.load() takes 30 minutes to
> complete.
>
> 2. Use the old api with:
> sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false”)
>     Now sqlContext.parquetFile() only opens a few files and prints the
> schema: so far so good! However, as soon as I run e.g. a count() on the
> dataframe, Spark still opens all files _before_ starting a job/stage.
> Effectively this moves the delay from load() to count() (or any other
> action I presume).
>
> 3. Run Spark 1.3.1-rc2.
>     sqlContext.load() took about 30 minutes for 5000 Parquet files on S3,
> the same as 1.3.0.
>
> Any help would be greatly appreciated!
>
> Thanks a lot.
>
> Eric
>
>
>
>
> On 10 Apr 2015, at 16:46, Eric Eijkelenboom <eric.eijkelenb...@gmail.com>
> wrote:
>
> Hi Ted
>
> Ah, I guess the term ‘source’ confused me :)
>
> Doing:
>
> sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path
> to a single day of logs"))
>
> for 1 directory with 21 files, Spark opens 17 files:
>
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
> s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072'
> for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072' for
> reading at position '261573524'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
> s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074'
> for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
> s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077'
> for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
> s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062'
> for reading
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074' for
> reading at position '259256807'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077' for
> reading at position '260002042'
> 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
> 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062' for
> reading at position ‘260875275'
> etc.
>
> I can’t seem to pass a comma-separated list of directories to load(), so
> in order to load multiple days of logs, I have to point to the root folder
> and depend on auto-partition discovery (unless there’s a smarter way).
>
> Doing:
>
> sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path
> to root log dir"))
>
> starts opening what seems like all files (I killed the process after a
> couple of minutes).
>
> Thanks for helping out.
> Eric
>
>
>


-- 


Saludos.
Miguel Ángel

Re: Opening many Parquet files = slow

Reply via email to