[jira] Created: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766) --- Key: PIG-1442 URL: https://issues.apache.org/jira/browse/PIG-1442 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0, 0.2.0 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev (18/may) Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0 Reporter: Dirk Schmid As mentioned by Ashutosh this is a reopen of https://issues.apache.org/jira/browse/PIG-766 because there is still a problem which causes that PIG scales only by memory. For convenience here comes the last entry of the PIG-766-Jira-Ticket: {quote}1. Are you getting the exact same stack trace as mentioned in the jira?{quote} Yes the same and some similar traces: {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263) at org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163) at
[jira] Commented: (PIG-1429) Add Boolean Data Type to Pig
[ https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876326#action_12876326 ] Alan Gates commented on PIG-1429: - Is this patch ready for review or does it need more work? Add Boolean Data Type to Pig Key: PIG-1429 URL: https://issues.apache.org/jira/browse/PIG-1429 Project: Pig Issue Type: New Feature Components: data Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.8.0 Attachments: working_boolean.patch Original Estimate: 8h Remaining Estimate: 8h Pig needs a Boolean data type. Pig-1097 is dependent on doing this. I volunteer. Is there anything beyond the work in src/org/apache/pig/data/ plus unit tests to make this work? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876354#action_12876354 ] Daniel Dai commented on PIG-1295: - I briefly review the patch, looks good. This is the approach we expected. Can we do some initial performance test first? Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-490) Combiner not used when group elements referred to in tuple notation instead of flatten.
[ https://issues.apache.org/jira/browse/PIG-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-490: -- Assignee: Thejas M Nair Combiner not used when group elements referred to in tuple notation instead of flatten. --- Key: PIG-490 URL: https://issues.apache.org/jira/browse/PIG-490 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Thejas M Nair Fix For: 0.8.0 Given a query like: {code} A = load 'myfile'; B = group A by ($0, $1); C = foreach B generate group.$0, group.$1, COUNT(A); {code} The combiner will not be invoked. But if the last line is changed to: {code} C = foreach B generate flatten(group), COUNT(A); {code} it will be. The reason for the discrepancy is because the CombinerOptimizer checks that all of the projections are simple. If not, it does not use the combiner. group.$0 is not a simple projection, so this is failed. However, this is a common enough case that the CombinerOptimizer should detect it and still use the combiner. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails
[ https://issues.apache.org/jira/browse/PIG-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1435: --- Assignee: Richard Ding make sure dependent jobs fail when a jon in multiquery fails Key: PIG-1435 URL: https://issues.apache.org/jira/browse/PIG-1435 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Currently if one of the MQ jobs fails, Pig tries to run all remainin jobs. As the result, if data was partially generated by the failed job, you might get incorrect results from dependent jobs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1436) Print number of records outputted at each step of a Pig script
[ https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1436: --- Assignee: Richard Ding I think Richard is already doing this as part of his stats work Print number of records outputted at each step of a Pig script -- Key: PIG-1436 URL: https://issues.apache.org/jira/browse/PIG-1436 Project: Pig Issue Type: New Feature Components: grunt Affects Versions: 0.7.0 Reporter: Russell Jurney Assignee: Richard Ding Priority: Minor Fix For: 0.8.0 I often run a script multiple times, or have to go and look through Hadoop task logs, to figure out where I broke a long script in such a way that I get 0 records out of it. I think this is a common problem. If someone can point me in the right direction, I can make a pass at this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1434: --- Assignee: Aniket Mokashi Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-928: -- Assignee: Aniket Mokashi UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1405) Need to move many standard functions from piggybank into Pig
[ https://issues.apache.org/jira/browse/PIG-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1405: --- Assignee: Aniket Mokashi Need to move many standard functions from piggybank into Pig Key: PIG-1405 URL: https://issues.apache.org/jira/browse/PIG-1405 Project: Pig Issue Type: Improvement Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 There are currently a number of functions in Piggybank that represent features commonly supported by languages and database engines. We need to decide which of these Pig should support as built in functions and put them in org.apache.pig.builtin. This will also mean adding unit tests and javadocs for some UDFs. The existing classes will be left in Piggybank for some time for backward compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1295) Binary comparator for secondary sort
[ https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gianmarco De Francisci Morales updated PIG-1295: Attachment: PIG-1295_0.2.patch I added some simple performance tests. The tests generate 1 million tuples modifying a prototypical tuple and compare them to the prototype. One test uses the new comparator and the other uses the old one. I generate exactly the same tuples using a fixed seed. I also check the correctness of the comparisons using the normal compareTo() method of the tuples. The logic to generate the tuples is a bit involved: I tried to exercise all the datatype comparisons in a uniform manner, so I mutate less the first elements of the tuple, in order to have more probability of getting the comparison further down the tuple. The probabilities are totally made up and do not make much sense. As a first approximation, I see a slight overall speedup in the test. I will do some profiling to see which margins of improvement we have. Binary comparator for secondary sort Key: PIG-1295 URL: https://issues.apache.org/jira/browse/PIG-1295 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case: 1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator 2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down. Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is group by followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case. We mark this issue to be a candidate project for Google summer of code 2010 program. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
distributed cache in pig
HI all, I notice that whether pig use distributed cache depends on the context (local or mapreduce). When running in mapreduce mode, the distributed cache is always enable (e.g. replicated join). However, I never find such method, DistributedCache.getLocalCacheFiles(job), which get the cached file from the local disk. So, how does pig read these files from local disk? I am looking at the pig 0.7 source code. Thanks, -Gang
[jira] Updated: (PIG-1441) New test targets: unit and smoke
[ https://issues.apache.org/jira/browse/PIG-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1441: Attachment: PIG-1441.patch New test targets: unit and smoke Key: PIG-1441 URL: https://issues.apache.org/jira/browse/PIG-1441 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1441.patch As we get more and more tests, adding more structure would help us to minimize time spent on testing. Here are 2 new targets I propose we add. (Hadoop has the same targets for the same purposes). unit - to run all true unit tests (those that trully testing apis and internal functionality and not running e2e tests through junit. This test should run relatively quick 10-15 minutes and if we are good at adding unit tests will give good covergae. smoke - this would be a set of a few e2e tests that provide good overall coverage within about 30 minutes. I would say that for simple patche, we would still require only commit tests while for more involved patches, the developers should run both unit and smoke before submitting the patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1441) New test targets: unit and smoke
[ https://issues.apache.org/jira/browse/PIG-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876537#action_12876537 ] Olga Natkovich commented on PIG-1441: - I uploaded the patch. Unit test target runs in about 35 minutes and executes all non e2e tests. Smoke test target runs for 15 minutes so we can add more. So far, I have - simple script (load + store) - fairly complex script - multiquery (in local mode because MR mode one takes 45 minutes). Ths also tests local mode so I think it is a good combination - streaming Please, suggest if we should add any other e2e tests. Also, please, review the patch. It does not need to go through patch test since no code is modified. New test targets: unit and smoke Key: PIG-1441 URL: https://issues.apache.org/jira/browse/PIG-1441 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Olga Natkovich Fix For: 0.8.0 Attachments: PIG-1441.patch As we get more and more tests, adding more structure would help us to minimize time spent on testing. Here are 2 new targets I propose we add. (Hadoop has the same targets for the same purposes). unit - to run all true unit tests (those that trully testing apis and internal functionality and not running e2e tests through junit. This test should run relatively quick 10-15 minutes and if we are good at adding unit tests will give good covergae. smoke - this would be a set of a few e2e tests that provide good overall coverage within about 30 minutes. I would say that for simple patche, we would still require only commit tests while for more involved patches, the developers should run both unit and smoke before submitting the patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: distributed cache in pig
Thanks Olga. But what it running in mapreduce mode? Once the distributed cache is enable in this mode, there should still be some way to read these cached files. Actually, searching all the source files in pig-0.7, I can't find 'DistributedCache.getLocalCacheFiles' anywhere and I soppose there is no other way to read cached files. This is what confuse me. Any other ideas? -Gang - 原始邮件 发件人: Olga Natkovich ol...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org 发送日期: 2010/6/7 (周一) 6:50:01 下午 主 题: RE: distributed cache in pig This is because Hadoop 20 does not support distributed cache in local mode. My understanding is that it would be part of Hadoop 22. Olga -Original Message- From: Gang Luo [mailto:lgpub...@yahoo.com.cn] Sent: Monday, June 07, 2010 3:40 PM To: pig-dev@hadoop.apache.org Subject: distributed cache in pig HI all, I notice that whether pig use distributed cache depends on the context (local or mapreduce). When running in mapreduce mode, the distributed cache is always enable (e.g. replicated join). However, I never find such method, DistributedCache.getLocalCacheFiles(job), which get the cached file from the local disk. So, how does pig read these files from local disk? I am looking at the pig 0.7 source code. Thanks, -Gang