[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-08-13 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898296#action_12898296
 ] 

Thejas M Nair commented on PIG-1466:


bq. It would also be nice to know when GC is called but we can make message to 
reflect that 
Olga, Are you suggesting that we should log everytime the memory manager 
handler is called or when the memory manager invokes GC after spilling enough 
memory ?
I am not sure if it is  useful to log every call to the memory manager handler, 
maybe we can log the first time for each type of threshold has been exceeded 
and then every time we actually spill something to disk.


 Improve log messages for memory usage
 -

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0


 For anything more then a moderately sized dataset Pig usually spits following 
 messages:
 {code}
 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Usage
 threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Collection
 threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 {code}
 This seems to confuse users a lot. Once these messages are printed, users 
 tend to believe that Pig is having hard time with memory, is spilling to disk 
 etc. but in fact Pig might be cruising along at ease. We should be little 
 more careful what to print in logs. Currently these are printed when a 
 notification is sent by JVM and some other conditions are met which may not 
 necessarily indicate low memory condition. Furthermore, with 
 {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
 messages have lost their usefulness. At the every least, we should lower the 
 log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-13 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898382#action_12898382
 ] 

Alan Gates commented on PIG-1404:
-

Unless there's any objections I'm going to commit the latest patch and the doc 
patch in the next day or two.  I want to get it in in time for the 0.8 branch.

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1520) Remove Owl from Pig contrib

2010-08-13 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1520:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed.

 Remove Owl from Pig contrib
 ---

 Key: PIG-1520
 URL: https://issues.apache.org/jira/browse/PIG-1520
 Project: Pig
  Issue Type: Task
  Components: impl
Affects Versions: 0.8.0
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.8.0

 Attachments: PIG-1520.patch


 Yahoo has transitioned work on Owl to Howl (which will not be a Pig contrib 
 project).  Since no one else is working on Owl and there will be no one to 
 support it we should remove it from our contrib before releasing 0.8.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-13 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Attachment: PIG-1541_1.patch

New patch to address the general case where the join key is tuple.

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch, PIG-1541_1.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-13 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898404#action_12898404
 ] 

Thejas M Nair commented on PIG-1448:


All tests are successful. Patch is ready for review.


 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2010-08-13 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-965:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Committed to trunk.


 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Fix For: 0.8.0

 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1392) Parser fails to recognize valid field

2010-08-13 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1392:
---

  Status: Patch Available  (was: Open)
Release Note: 
The issue is fixed but while fixing this, I encountered another problem of 
can't open iterator for C. Created jira# PIG-1545. There was also issue in the 
secondary optimizer, where it calls system.setProperties to set the 
pig.exec.nosecondarykey . I changed to use pigContext properties.


 Parser fails to recognize valid field
 -

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: niraj rai
 Fix For: 0.8.0


 Using this script below, parser fails to recognize a valid field in the 
 relation and throws error
 A = LOAD '/tmp' as (a:int, b:chararray, c:int);
 B = GROUP A BY (a, b);
 C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
 The error thrown is
 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
 chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1392) Parser fails to recognize valid field

2010-08-13 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1392:
---

Attachment: nested_parser.patch

 Parser fails to recognize valid field
 -

 Key: PIG-1392
 URL: https://issues.apache.org/jira/browse/PIG-1392
 Project: Pig
  Issue Type: Bug
Reporter: Ankur
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: nested_parser.patch


 Using this script below, parser fails to recognize a valid field in the 
 relation and throws error
 A = LOAD '/tmp' as (a:int, b:chararray, c:int);
 B = GROUP A BY (a, b);
 C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ;
 The error thrown is
 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: 
 chararray),A: {a: int,b: chararray,c: int}}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Attachment: PIG-1295_0.16.patch

Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach 
PIG-1295_0.16.patch. 

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, 
 PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, 
 PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, 
 PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Status: Open  (was: Patch Available)

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, 
 PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, 
 PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, 
 PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-13 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Status: Patch Available  (was: Open)

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.16.patch, 
 PIG-1295_0.2.patch, PIG-1295_0.3.patch, PIG-1295_0.4.patch, 
 PIG-1295_0.5.patch, PIG-1295_0.6.patch, PIG-1295_0.7.patch, 
 PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-13 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898450#action_12898450
 ] 

Richard Ding commented on PIG-1448:
---

+1. Looks good.

 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-348) -j command line option doesn't work

2010-08-13 Thread Corinne Chandel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898464#action_12898464
 ] 

Corinne Chandel commented on PIG-348:
-

Olga - We document the help command, but not the output generated when the help 
is issued (the list of Pig commands). So, there's nothing to update in the 
docs.  Thanks/C

 -j command line option doesn't work
 ---

 Key: PIG-348
 URL: https://issues.apache.org/jira/browse/PIG-348
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Amir Youssefi
Assignee: Corinne Chandel
 Fix For: 0.8.0

 Attachments: PIG-348.path, PIG-348_1.patch


 According to:
 $ pig --help 
 ...
 -j, -jar jarfile load jarfile
 ...
 yet 
 $pig -j my.jar
 doesn't work in place of:
 register my.jar 
 in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-13 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1448:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to trunk.


 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-13 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898490#action_12898490
 ] 

Yan Zhou commented on PIG-1518:
---

There is a bigger question at hand. The semantics of OrderedLoadFunc is that 
the splits are totally ordered. And BinStorage, InterStorage and PigStorage all 
implement that interface through FileInputLoadFunc. Since the combination of 
splits as conceived here will definitely destroy the split ordering, if the 
combination is disabled for these storages, the feature would be virtually 
useless for a majority of use cases.

On the other hand, I'm seeing no use of the comparison capability except for 
MergeJoinIndexer's getNext() method, which makes me wonder if the 
OrderedLoadFunc can be removed from the FileInputLoadFunc.  Semantically, 
FileInputLoadFunc should not support the ordering of splits, as Hadoop's 
FileInputFormat doesn't. When a need arises like in MergeJoinIndexer, we can 
add that extension on. But the change may incur some backward compatibility 
issues.
I'm now soliciting comments in this area.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-13 Thread Romain Rigaux (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898509#action_12898509
 ] 

Romain Rigaux commented on PIG-1404:


In the latest patch, PigUnit is stored into 'pigunit' at the root of the 
project and a pigunit.jar is created. 

Some people would prefer to have it in test or maybe piggybank. Should we move 
it or keep it like this?

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-13 Thread Romain Rigaux (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-1404 started by Romain Rigaux.

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work stopped: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-13 Thread Romain Rigaux (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-1404 stopped by Romain Rigaux.

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.