Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes.
I have a cluster with 4 nodes and it seems me too slow. The "load" function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hi guys > > Does anyone know how to stop Spark from opening all Parquet files before > starting a job? This is quite a show stopper for me, since I have 5000 > Parquet files on S3. > > Recap of what I tried: > > 1. Disable schema merging with: sqlContext.load(“parquet", > Map("mergeSchema" -> "false”, "path" -> “s3://path/to/folder")) > This opens most files in the folder (17 out of 21 in my small > example). For 5000 files on S3, sqlContext.load() takes 30 minutes to > complete. > > 2. Use the old api with: > sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false”) > Now sqlContext.parquetFile() only opens a few files and prints the > schema: so far so good! However, as soon as I run e.g. a count() on the > dataframe, Spark still opens all files _before_ starting a job/stage. > Effectively this moves the delay from load() to count() (or any other > action I presume). > > 3. Run Spark 1.3.1-rc2. > sqlContext.load() took about 30 minutes for 5000 Parquet files on S3, > the same as 1.3.0. > > Any help would be greatly appreciated! > > Thanks a lot. > > Eric > > > > > On 10 Apr 2015, at 16:46, Eric Eijkelenboom <eric.eijkelenb...@gmail.com> > wrote: > > Hi Ted > > Ah, I guess the term ‘source’ confused me :) > > Doing: > > sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path > to a single day of logs")) > > for 1 directory with 21 files, Spark opens 17 files: > > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000072' for > reading at position '261573524' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening ' > s3n://mylogs/logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062' > for reading > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000074' for > reading at position '259256807' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000077' for > reading at position '260002042' > 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key > 'logs/yyyy=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-000062' for > reading at position ‘260875275' > etc. > > I can’t seem to pass a comma-separated list of directories to load(), so > in order to load multiple days of logs, I have to point to the root folder > and depend on auto-partition discovery (unless there’s a smarter way). > > Doing: > > sqlContext.load(“parquet", Map("mergeSchema" -> "false”, "path" -> “path > to root log dir")) > > starts opening what seems like all files (I killed the process after a > couple of minutes). > > Thanks for helping out. > Eric > > > -- Saludos. Miguel Ángel