[ https://issues.apache.org/jira/browse/PIG-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Satish Subhashrao Saley updated PIG-5365: ----------------------------------------- Description: It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 512MB or 1G when they are reading TBs of data to avoid launching too many map tasks (50-100K) for loading data. It has unnecessary overhead in terms of container launch and wastes lot of resources. Would be good to have a new settings to configure the max number of tasks which will override pig.maxCombinedSplitSize and combine more splits into one task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K tasks. That will go as default into pig-default.properties and apply to all users. Thank you [~rohini] for filing the issue. was: It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 512MB or 1G when they are reading TBs of data to avoid launching too many map tasks (50-100K) for loading data. It has unnecessary overhead in terms of container launch and wastes lot of resources. Would be good to have a new settings to configure the max number of tasks which will override pig.maxCombinedSplitSize and combine more splits into one task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will combine more than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K tasks. That will go as default into pig-default.properties and apply to all users. > Add support for PARALLEL clause in LOAD statement > ------------------------------------------------- > > Key: PIG-5365 > URL: https://issues.apache.org/jira/browse/PIG-5365 > Project: Pig > Issue Type: New Feature > Reporter: Satish Subhashrao Saley > Assignee: Satish Subhashrao Saley > Priority: Major > > It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to > 512MB or 1G when they are reading TBs of data to avoid launching too many map > tasks (50-100K) for loading data. It has unnecessary overhead in terms of > container launch and wastes lot of resources. > Would be good to have a new settings to configure the max number of tasks > which will override pig.maxCombinedSplitSize and combine more splits into one > task. For eg: pig.max.input.splits=30000 and data size is 2TB, it will > combine more than 128MB (default pig.maxCombinedSplitSize) per task to have > maximum of 30K tasks. That will go as default into pig-default.properties and > apply to all users. > Thank you [~rohini] for filing the issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)