Hi, HiveThriftserver2 itself has no such functionality. Have you tried adaptive execution in spark? https://issues.apache.org/jira/browse/SPARK-9850 I have not used this yet though, it seems this experimental feature is to tune #tasks depending on partition size.
// maropu On Thu, Aug 4, 2016 at 1:13 AM, Chanh Le <giaosu...@gmail.com> wrote: > I believe there is no way to reduce tasks by Hive using coalesce because > when It come to Hive just read the files and depend on number of files you > put into. So The way to did was coalesce at the ELT layer put a small > number of files as possible reduce IO time for reading file. > > > > On Aug 3, 2016, at 7:03 PM, Yana Kadiyska <yana.kadiy...@gmail.com> > wrote: > > > > Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When > spark reads these files, I end up with 315K tasks for a dataframe reading a > few days worth of data. > > > > I now with a regular Spark job, I can use coalesce to come to a lower > number of tasks. Is there a way to tell HiveThriftserver2 to coalsce? I > have a line in hive-conf that says to use CombinedInputFormat but I'm not > sure it's working. > > > > (Obviously haivng fewer large files is better but I don't control the > file generation side of this) > > > > Tips much appreciated > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro