Amit, The local fetch optimization is enabled by default in Tez-0.7. It reduces the number of connections by a bit and ends up reading files generated on the same box directly.
Another optimization which is far more useful is the shared fetch optimization. This tries to avoid copying the same data onto the same host multiple times. We've seen fairly good gains when fetching data to 10K reducers from a single source - 28 minutes down to 2 minutes. There's an example - BroadcastLoadGen - which can be used to try out this feature. For the local fetch optimization - you could use the same job (may need some modification), to control the amount of data generated and fetched by each node. i.e. measure advantages with a 1MB fetch/local vs 200MB fetch / local read. HTH - Sid On Fri, May 15, 2015 at 11:19 AM, Amit Tiwari <[email protected]> wrote: > Hey guys, > Local fetch optimization seems like an awesome feature. I'd like to add > some tests for our CI/CD pipeline that exercise this feature. > Any thoughts on what kind of setup, data etc I may need for this? > thanks > --amit > >
