[jira] Commented: (PIG-1466) Improve log messages for memory usage
[ https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898296#action_12898296 ] Thejas M Nair commented on PIG-1466: bq. It would also be nice to know when GC is called but we can make message to reflect that Olga, Are you suggesting that we should log everytime the memory manager handler is called or when the memory manager invokes GC after spilling enough memory ? I am not sure if it is useful to log every call to the memory manager handler, maybe we can log the first time for each type of threshold has been exceeded and then every time we actually spill something to disk. Improve log messages for memory usage - Key: PIG-1466 URL: https://issues.apache.org/jira/browse/PIG-1466 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Priority: Minor Fix For: 0.8.0 For anything more then a moderately sized dataset Pig usually spits following messages: {code} 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Usage threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed = 954466304(932096K) max = 954466304(932096K) 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: low memory handler called (Collection threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed = 954466304(932096K) max = 954466304(932096K) {code} This seems to confuse users a lot. Once these messages are printed, users tend to believe that Pig is having hard time with memory, is spilling to disk etc. but in fact Pig might be cruising along at ease. We should be little more careful what to print in logs. Currently these are printed when a notification is sent by JVM and some other conditions are met which may not necessarily indicate low memory condition. Furthermore, with {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these messages have lost their usefulness. At the every least, we should lower the log level at which these are printed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898382#action_12898382 ] Alan Gates commented on PIG-1404: - Unless there's any objections I'm going to commit the latest patch and the doc patch in the next day or two. I want to get it in in time for the 0.8 branch. PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, PIG-1404-4.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1520) Remove Owl from Pig contrib
[ https://issues.apache.org/jira/browse/PIG-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1520: Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed. Remove Owl from Pig contrib --- Key: PIG-1520 URL: https://issues.apache.org/jira/browse/PIG-1520 Project: Pig Issue Type: Task Components: impl Affects Versions: 0.8.0 Reporter: Alan Gates Assignee: Alan Gates Fix For: 0.8.0 Attachments: PIG-1520.patch Yahoo has transitioned work on Owl to Howl (which will not be a Pig contrib project). Since no one else is working on Owl and there will be no one to support it we should remove it from our contrib before releasing 0.8. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Attachment: PIG-1541_1.patch New patch to address the general case where the join key is tuple. FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1541.patch, PIG-1541_1.patch Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898404#action_12898404 ] Thejas M Nair commented on PIG-1448: All tests are successful. Patch is ready for review. Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-965: -- Status: Resolved (was: Patch Available) Resolution: Fixed Committed to trunk. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Fix For: 0.8.0 Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1392: --- Status: Patch Available (was: Open) Release Note: The issue is fixed but while fixing this, I encountered another problem of can't open iterator for C. Created jira# PIG-1545. There was also issue in the secondary optimizer, where it calls system.setProperties to set the pig.exec.nosecondarykey . I changed to use pigContext properties. Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: niraj rai Fix For: 0.8.0 Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] niraj rai updated PIG-1392: --- Attachment: nested_parser.patch Parser fails to recognize valid field - Key: PIG-1392 URL: https://issues.apache.org/jira/browse/PIG-1392 Project: Pig Issue Type: Bug Reporter: Ankur Assignee: niraj rai Fix For: 0.8.0 Attachments: nested_parser.patch Using this script below, parser fails to recognize a valid field in the relation and throws error A = LOAD '/tmp' as (a:int, b:chararray, c:int); B = GROUP A BY (a, b); C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; The error thrown is 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Attachment: PIG-1295_0.16.patch Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach PIG-1295_0.16.patch. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Open (was: Patch Available) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1295: Status: Patch Available (was: Open) Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898450#action_12898450 ] Richard Ding commented on PIG-1448: --- +1. Looks good. Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-348) -j command line option doesn't work
[ https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898464#action_12898464 ] Corinne Chandel commented on PIG-348: - Olga - We document the help command, but not the output generated when the help is issued (the list of Pig commands). So, there's nothing to update in the docs. Thanks/C -j command line option doesn't work --- Key: PIG-348 URL: https://issues.apache.org/jira/browse/PIG-348 Project: Pig Issue Type: Improvement Components: documentation Reporter: Amir Youssefi Assignee: Corinne Chandel Fix For: 0.8.0 Attachments: PIG-348.path, PIG-348_1.patch According to: $ pig --help ... -j, -jar jarfile load jarfile ... yet $pig -j my.jar doesn't work in place of: register my.jar in Pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1448: --- Status: Resolved (was: Patch Available) Resolution: Fixed Patch committed to trunk. Detach tuple from inner plans of physical operator --- Key: PIG-1448 URL: https://issues.apache.org/jira/browse/PIG-1448 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 Reporter: Ashutosh Chauhan Assignee: Thejas M Nair Fix For: 0.8.0 Attachments: multi_oom_filt.pig, PIG-1448.1.patch This is a follow-up on PIG-1446 which only addresses this general problem for a specific instance of For Each. In general, all the physical operators which can have inner plans are vulnerable to this. Few of them include POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898490#action_12898490 ] Yan Zhou commented on PIG-1518: --- There is a bigger question at hand. The semantics of OrderedLoadFunc is that the splits are totally ordered. And BinStorage, InterStorage and PigStorage all implement that interface through FileInputLoadFunc. Since the combination of splits as conceived here will definitely destroy the split ordering, if the combination is disabled for these storages, the feature would be virtually useless for a majority of use cases. On the other hand, I'm seeing no use of the comparison capability except for MergeJoinIndexer's getNext() method, which makes me wonder if the OrderedLoadFunc can be removed from the FileInputLoadFunc. Semantically, FileInputLoadFunc should not support the ordering of splits, as Hadoop's FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can add that extension on. But the change may incur some backward compatibility issues. I'm now soliciting comments in this area. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898509#action_12898509 ] Romain Rigaux commented on PIG-1404: In the latest patch, PigUnit is stored into 'pigunit' at the root of the project and a pigunit.jar is created. Some people would prefer to have it in test or maybe piggybank. Should we move it or keep it like this? PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, PIG-1404-4.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-1404 started by Romain Rigaux. PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, PIG-1404-4.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work stopped: (PIG-1404) PigUnit - Pig script testing simplified.
[ https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-1404 stopped by Romain Rigaux. PigUnit - Pig script testing simplified. - Key: PIG-1404 URL: https://issues.apache.org/jira/browse/PIG-1404 Project: Pig Issue Type: New Feature Reporter: Romain Rigaux Assignee: Romain Rigaux Fix For: 0.8.0 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, PIG-1404-4.patch, PIG-1404.patch The goal is to provide a simple xUnit framework that enables our Pig scripts to be easily: - unit tested - regression tested - quickly prototyped No cluster set up is required. For example: TestCase {code} @Test public void testTop3Queries() { String[] args = { n=3, }; test = new PigTest(top_queries.pig, args); String[] input = { yahoo\t10, twitter\t7, facebook\t10, yahoo\t15, facebook\t5, }; String[] output = { (yahoo,25L), (facebook,15L), (twitter,7L), }; test.assertOutput(data, input, queries_limit, output); } {code} top_queries.pig {code} data = LOAD '$input' AS (query:CHARARRAY, count:INT); ... queries_sum = FOREACH queries_group GENERATE group AS query, SUM(queries.count) AS count; ... queries_limit = LIMIT queries_ordered $n; STORE queries_limit INTO '$output'; {code} They are 3 modes: * LOCAL (if pigunit.exectype.local properties is present) * MAPREDUCE (use the cluster specified in the classpath, same as HADOOP_CONF_DIR) ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in the class path will be: ~/pigtest/conf) ** pointing to an existing cluster (if pigunit.exectype.cluster properties is present) For now, it would be nice to see how this idea could be integrated in Piggybank and if PigParser/PigServer could improve their interfaces in order to make PigUnit simple. Other components based on PigUnit could be built later: - standalone MiniCluster - notion of workspaces for each test - standalone utility that reads test configuration and generates a test report... It is a first prototype, open to suggestions and can definitely take advantage of feedbacks. How to test, in pig_trunk: {code} Apply patch $pig_trunk ant compile-test $pig_trunk ant $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99 {code} (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the future between 'unit' and 'integration') Many examples are in: {code} contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java {code} When used as a standalone, do not forget commons-lang-2.4.jar and the HADOOP_CONF_DIR to your cluster in your CLASSPATH. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.