Hi Gopal - actually no., the table is not partitioned/bucketed.
Everyday the whole table gets cleaned up and populated with last 120 days'
data...

What are the other properties I can try to improve the performance of
reduce steps...?

Suresh V
http://www.justbirds.in


On Sat, Jan 9, 2016 at 8:52 AM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

> Hi,
>
> > The job completes fine if we reduce the # of rows processed by reducing
> >the # of days data being processed.
> >
>
> > It just gets stuck after all maps are completed. We checked the logs and
> >it says the containers are released.
>
> Looks like you're inserting into a bucketed & partitioned table and facing
> connection timeouts due to GC pauses?
>
> By default, the optimization slows down the 1-partition at a time ETL, so
> it is disabled.
>
> If your data load falls into the category of >1 partition & has bucketing,
> you need to set
>
> set hive.optimize.sort.dynamic.partition=true;
>
>
> The largest data-load done using a single SQL statement was the 100Tb ETL
> loads for TPC-DS.
>
> In hive-11, people had workarounds using explicit "DISTRIBUTE BY" or "SORT
> BY" which didn't scale as well.
>
> If you have those in your query, remove it.
>
> >2016-01-08 19:33:33,119 INFO [Socket Reader #1 for port 43451]
> >org.apache.hadoop.ipc.Server: Socket Reader #1 for port 43451:
> >readAndProcess from client 39.0.8.17 threw exception
> >[java.io.IOException: Connection reset by peer]
>
> Whether that fixes it or not, there are other low-level issues which
> trigger similar errors as you scale your cluster to 300+ nodes [1].
>
> https://github.com/t3rmin4t0r/notes/wiki/Hadoop-Tuning-notes
>
>
>
> Cheers,
> Gopal
> [1] -
> <http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3/10>
>
>
>
>
>
>
>

Reply via email to