Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context. SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, conten

RE: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread java8964
boardcasting small files from the Spark Driver. This sounds like a good normal way to handle small files, but I cannot find a configuration to force spark disable it. Yong From: daniel.ha...@veracity-group.com Subject: Re: spark-avro takes a lot time to load thousands of files Date: Tue, 22 Sep 2015 16

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
I Agree but it's a constraint I have to deal with. The idea is load these files and merge them into ORC. When using hive on Tez it takes less than a minute. Daniel > On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote: > > having a file per record is pretty inefficient on almost any file system

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Jonathan Coveney
having a file per record is pretty inefficient on almost any file system El martes, 22 de septiembre de 2015, Daniel Haviv < daniel.ha...@veracity-group.com> escribió: > Hi, > We are trying to load around 10k avro files (each file holds only one > record) using spark-avro but it takes over 15 min