What I understand is you have a source location where files are dropped and
never removed? If that is the case, you may want to keep a track of which
files are already processed by your program and read only the "new" files.
On 3 Aug 2016 22:03, "Yana Kadiyska" <yana.kadiy...@gmail.com> wrote:

> Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When
> spark reads these files, I end up with 315K tasks for a dataframe reading a
> few days worth of data.
>
> I now with a regular Spark job, I can use coalesce to come to a lower
> number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I
> have a line in hive-conf that says to use CombinedInputFormat but I'm not
> sure it's working.
>
> (Obviously haivng fewer large files is better but I don't control the file
> generation side of this)
>
> Tips much appreciated
>

Reply via email to