[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897368#action_12897368 ] Alan Gates commented on PIG-1518: - bq. For mapside cogroup or mapside group by, though, the splits can be combined because the splits are only required to contain the all duplicate keys per instance and combination of splits will still preserve that invariant. You are correct for mapside group, but not mapside cogroup. Mapside cogroup does require all files being grouped to be processed in an ordered fashion. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-965: -- Status: Patch Available (was: Reopened) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Fix For: 0.8.0 Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-965: -- Fix Version/s: 0.8.0 Ankit is right, the patch is not present in trunk. I will apply it to trunk. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Fix For: 0.8.0 Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair reassigned PIG-965: - Assignee: Ankit Modi PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Fix For: 0.8.0 Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897451#action_12897451 ] Richard Ding commented on PIG-1458: --- The proposal is to run another map-reduce job to merge the small files before the replicated join. This additional job will be added to the MR plan at the compile time. We consider three cases of a replicated join: # The right input is a map-only job and input files exist at the compile time. # The right input is a map-only job and input files do not exist at the compile time. # The right input is a map-reduce job. For 1., if the number of files exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 3., if the number of reducers exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 2., if the flag specified in the property file (_pig.frjoin.merge.files.optimistic_) is false, a merge job is added between right input job and FR join job. The default value of this flag is false. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable
[ https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-103: - Tags: documentation Shared Job /tmp location should be configurable --- Key: PIG-103 URL: https://issues.apache.org/jira/browse/PIG-103 Project: Pig Issue Type: Improvement Components: impl Environment: Partially shared file:// filesystem (eg NFS) Reporter: Craig Macdonald Assignee: niraj rai Fix For: 0.8.0 Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch Hello, I'm investigating running pig in an environment where various parts of the file:// filesystem are available on all nodes. I can tell hadoop to use a file:// file system location for it's default, by seting fs.default.name=file://path/to/shared/folder However, this creates issues for Pig, as Pig writes it's job information in a folder that it assumes is a shared FS (eg DFS). However, in this scenario /tmp is not shared on each machine. So /tmp should either be configurable, or Hadoop should tell you the actual full location set in fs.default.name? Straightforward solution is to make /tmp/ a property in src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext) Any suggestions of property names? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897455#action_12897455 ] Thejas M Nair commented on PIG-1501: Why was TFile chosen over SequenceFile ? I am wondering if the additional unused features of TFile (index, metadata) result in any overhead compared to SequenceFile. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897484#action_12897484 ] Richard Ding commented on PIG-1458: --- For 1. and 2. above, another approach is to do nothing and rely on MultiFileInputFormat (PIG-1518) to merge small files. aggregate files for replicated join --- Key: PIG-1458 URL: https://issues.apache.org/jira/browse/PIG-1458 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 We have noticed that if the smaller data in replicated join has many files, this puts unneeded burden on the name node. pre-aggregating the files can improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897493#action_12897493 ] Yan Zhou commented on PIG-1518: --- Right, map side cogroup needs the sortness of the input, but just the side inputs need the feature to be able to seek on a key; the base input will only need presence of all duplicate keys in a mapper. I'll mark the side inputs as non-combinable. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance
[ https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897496#action_12897496 ] Yan Zhou commented on PIG-1501: --- Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It appears for compressed data, TFile performs better than SeqFile. need to investigate the impact of compression on pig performance Key: PIG-1501 URL: https://issues.apache.org/jira/browse/PIG-1501 Project: Pig Issue Type: Test Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, PIG-1501.patch We would like to understand how compressing map results as well as well as reducer output in a chain of MR jobs impacts performance. We can use PigMix queries for this investigation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1542) log level not propogated to MR task loggers
log level not propogated to MR task loggers --- Key: PIG-1542 URL: https://issues.apache.org/jira/browse/PIG-1542 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Fix For: 0.8.0 Specifying -d DEBUG does not affect the logging of the MR tasks . This was fixed earlier in PIG-882 . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1448: --- Attachment: PIG-1448.1.patch Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1448: --- Status: Patch Available (was: Open) Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Status: Patch Available (was: Open) FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Attachment: PIG-1541.patch FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.