Daniel
Can you elaborate why are you using a broadcast variable to concatenate
many Avro files into a single ORC file. Look at wholetextfiles on Spark
context.
SparkContext.wholeTextFiles lets you read a directory containing multiple
small text files, and returns each of them as (filename, conten
boardcasting small files from the Spark Driver. This sounds like a good
normal way to handle small files, but I cannot find a configuration to force
spark disable it.
Yong
From: daniel.ha...@veracity-group.com
Subject: Re: spark-avro takes a lot time to load thousands of files
Date: Tue, 22 Sep 2015 16
I Agree but it's a constraint I have to deal with.
The idea is load these files and merge them into ORC.
When using hive on Tez it takes less than a minute.
Daniel
> On 22 בספט׳ 2015, at 16:00, Jonathan Coveney wrote:
>
> having a file per record is pretty inefficient on almost any file system
having a file per record is pretty inefficient on almost any file system
El martes, 22 de septiembre de 2015, Daniel Haviv <
daniel.ha...@veracity-group.com> escribió:
> Hi,
> We are trying to load around 10k avro files (each file holds only one
> record) using spark-avro but it takes over 15 min