[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Attachment: PIG-1295_0.14.patch

I reviewed and regenerate the patch. Couple of notes:
1. All unit test and end-to-end test pass, hudson warning are addressed
2. See consistent performance improvement (around 20%) in pigmix query L16 
(using 10 reducers, on a cluster of 10 nodes)
3. Did some refactory, change some class names and move some code around, move 
getRawComparatorClass to Tuple instead of TupleFactory

Gianmarco, can you take a look if my changes are good?

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, 
 PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, 
 PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

2010-08-12 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-1442.


Resolution: Duplicate

 java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
 ---

 Key: PIG-1442
 URL: https://issues.apache.org/jira/browse/PIG-1442
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
 (18/may)
 Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0
Reporter: Dirk Schmid
Assignee: Thejas M Nair
 Fix For: 0.8.0


 As mentioned by Ashutosh this is a reopen of 
 https://issues.apache.org/jira/browse/PIG-766 because there is still a 
 problem which causes that PIG scales only by memory.
 For convenience here comes the last entry of the PIG-766-Jira-Ticket:
 {quote}1. Are you getting the exact same stack trace as mentioned in the 
 jira?{quote} Yes the same and some similar traces:
 {noformat}
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2786)
   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
   at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
   at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
   at 
 org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
   at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
   at 
 org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)
 java.lang.OutOfMemoryError: Java heap space
   at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
   at 
 org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at 
 org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
   at 
 org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
   at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
   at 
 org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
 

[jira] Commented: (PIG-1541) FR Join shouldn't match null values

2010-08-12 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897866#action_12897866
 ] 

Richard Ding commented on PIG-1541:
---


Results of test-patch:

{code}
 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to i
 [exec] nclude 6 new or modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
{code}

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1486) update ant eclipse-files target to include new jar and remove contrib dirs from build path

2010-08-12 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1486:
---

Attachment: PIG-1486.1.patch

Updated patch which includes guava and jython. It might need more changes after 
PIG-1452 is committed.


 update ant eclipse-files target to include new jar and remove contrib dirs 
 from build path
 --

 Key: PIG-1486
 URL: https://issues.apache.org/jira/browse/PIG-1486
 Project: Pig
  Issue Type: Bug
  Components: tools
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1486.1.patch, PIG-1486.patch


  .eclipse.templates/.classpath needs to be updated to address following -
 1. There is a new jar that is used by the code - guava-r03.jar
 2. The jar ANT_HOME/lib/ant.jar gives an 'unbounded jar' error in eclipse.
 3. Removing the contrib projects from class path as discussed in PIG-1390, 
 until all libs necessary for the contribs are included in classpath.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-12 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897887#action_12897887
 ] 

Yan Zhou commented on PIG-1518:
---

During the merge process, any empty splits will be skipped. Currently empty 
splits will be generated on empty files, which is not necessary at the first 
place.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897922#action_12897922
 ] 

Thejas M Nair commented on PIG-1295:


Comments about PIG-1295_0.14.patch -
- The comparison logic for BinInterSedes relies on the serliazation format, so 
it think its better to have it closer to where the serialization format is 
implemented. Ie add a function to InterSedes interface (getComparator() ?) , 
and move the implementation logic to BinInterSedes class.
- I think TupleFactory is a better place for getRawComparatorClass() for the 
following reasons-
-- TupleFactory is a singleton class, Tuple is not. Having it in Tuple implies 
that you can have different values returned by different instances.
-- Adding it to Tuple interface breaks backward compatibility, all Tuple 
implementations will need to add this function. Also, does not make sense for 
load functions that return a custom tuple to implement this method, because it 
is not related to that tuple implementation. 

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, 
 PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, 
 PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1543) IsEmpty returns the wrong value after using LIMIT

2010-08-12 Thread Justin Hu (JIRA)
IsEmpty returns the wrong value after using LIMIT
-

 Key: PIG-1543
 URL: https://issues.apache.org/jira/browse/PIG-1543
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Hu


1. Two input files:

1a: limit_empty.input_a
1
1
1

1b: limit_empty.input_b
2
2

2.
The pig script: limit_empty.pig

-- A contains only 1's  B contains only 2's
A = load 'limit_empty.input_a' as (a1:int);
B = load 'limit_empty.input_a' as (b1:int);

C =COGROUP A by a1, B by b1;
D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), 
COUNT(B);
store D into 'limit_empty.output/d';
-- After the script done, we see the right results:
-- {(1),(1),(1)}   {}  1   0   3   0
-- {} {(2),(2)}  0   1   0   2

C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; }
D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? 0:1), 
COUNT(Alim), COUNT(Blim);
store D1 into 'limit_empty.output/d1';
-- After the script done, we see the unexpected results:
-- {(1)}   {}1   1   1   0
-- {}  {(2)} 1   1   0   1

dump D;
dump D1;

3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues:

The major one:

IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while 
IsEmpty() returns correctly in limit_empty.output/d/*.

The difference is that one has been applied with LIMIT before using IsEmpty().

The minor one:

The redirected output only contains the first dump:

({(1),(1),(1)},{},1,0,3L,0L)
({},{(2),(2)},0,1,0L,2L)

We expect two more lines like:
({(1)},{},1,1,1L,0L)
({},{(2)},1,1,0L,1L)

Besides, there is error says:

[main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - 
java.lang.ClassCastException: java.lang.Integer cannot be cast to 
org.apache.pig.data.Tuple


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1544) proactive-spill bags should share the memory alloted for it

2010-08-12 Thread Thejas M Nair (JIRA)
proactive-spill bags should share the memory alloted for it
---

 Key: PIG-1544
 URL: https://issues.apache.org/jira/browse/PIG-1544
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair


Initially proactive spill bags were designed for use in (co)group 
(InternalCacheBag) and they knew the total number of proactive bags that were 
present, and shared the memory limit specified using the property 
pig.cachedbag.memusage .
But the two proactive bag implementations were added later - 
InternalDistinctBag and InternalSortedBag are not aware of actual number of 
bags being used - their users always assume total-numbags = 3. 

This needs to be fixed and all proactive-spill bags should share the 
memory-limit .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897958#action_12897958
 ] 

Thejas M Nair commented on PIG-1295:


bq. Conceptually comparator is in the logic of Tuple. 
This comparator is part of only the *default* tuple implementation used 
internally within pig. So the class that is the source of truth for the default 
internal tuple implementation seems a good place to have this function. A tuple 
returned by a loadfunction has nothing to do with the comparator logic. 

bq. Ideally it should be a static method of Tuple, however Tuple interface do 
not allow me do that.
Yes, a static method can't be overridden. Since this is supposed to return only 
one value per pig query, the singleton TupleFactory is a better place.

bq. For backward compatibility, first, we will break either Tuple or 
TupleFactory, the impact is equivalent;
No. TupleFactory is an abstract class, while Tuple is an interface. Users will 
not be forced to change their implementation if we add a function to 
TupleFactory. Also, users are more likely to have custom Tuple than custom 
TupleFactory - because they might implement different tuples as part of their 
load function implementation, and are unlikely to change the default Tuple 
implementation used in internally in pig.

bq. second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we 
will check if Tuple does not implement the new method, we fall back to the 
default serialize version. 
If Tuple interface is going to have this function, i think we should add in the 
javadoc that it makes sense to implement the function only if it is going to be 
used as the default internal tuple implementation. And that
null value can be returned if user chooses to not implement it.



 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.2.patch, PIG-1295_0.3.patch, 
 PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch, 
 PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Attachment: PIG-1295_0.15.patch

Attach another patch to address Thejas's first point. 

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Status: Patch Available  (was: Open)

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-12 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


Status: Open  (was: Patch Available)

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.13.patch, 
 PIG-1295_0.14.patch, PIG-1295_0.15.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1466) Improve log messages for memory usage

2010-08-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12898020#action_12898020
 ] 

Thejas M Nair commented on PIG-1466:


I am planning to change as following -
1. Change the string low memory handler called to cleaner notified .
2. Print additional log message stating total spill object memory freed and 
total number of objects freed. This will be printed the first time a candidate 
to be freed is found, and then every 10 times the GC is called from the code. 
(GC is called after a threshold of memory suitable to be freed has been 
collected.)



 Improve log messages for memory usage
 -

 Key: PIG-1466
 URL: https://issues.apache.org/jira/browse/PIG-1466
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
Priority: Minor
 Fix For: 0.8.0


 For anything more then a moderately sized dataset Pig usually spits following 
 messages:
 {code}
 2010-05-27 18:28:31,659 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Usage
 threshold exceeded) init = 4194304(4096K) used = 672012960(656262K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 2010-05-27 18:10:52,653 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
 low memory handler called (Collection
 threshold exceeded) init = 4194304(4096K) used = 954466304(932096K) committed 
 = 954466304(932096K) max =
 954466304(932096K)
 {code}
 This seems to confuse users a lot. Once these messages are printed, users 
 tend to believe that Pig is having hard time with memory, is spilling to disk 
 etc. but in fact Pig might be cruising along at ease. We should be little 
 more careful what to print in logs. Currently these are printed when a 
 notification is sent by JVM and some other conditions are met which may not 
 necessarily indicate low memory condition. Furthermore, with 
 {{InternalCachedBag}} embraced everywhere in favor of {{DefaultBag}}, these 
 messages have lost their usefulness. At the every least, we should lower the 
 log level at which these are printed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.