[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871046#action_12871046 ] Dirk Schmid commented on PIG-766: - Many memory changes went in. Please, reopen if this is still a problem. I found the problem described by Vadim still existing for the following configuration: - Apache-Hadoop 0.20.2 - Pig 0.7.0 and also for 0.8.0-dev (18/may) ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dirk Schmid updated PIG-766: Affects Version/s: 0.7.0 ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871253#action_12871253 ] Ashutosh Chauhan commented on PIG-766: -- Dirk, 1. Are you getting the exact same stack trace as mentioned in the jira? 2. Which operations are you doing in your query - join, group-by, any other ? 3. What load/store func are you using to read and write data? PigStorage or your own ? 4. What is your data size and memory available to your tasks? 5. Do you have very large records in your dataset, like hundreds of MB for one record ? It would be great if you can paste here the script from which you get this exception. ava.lang.OutOfMemoryError: Java heap space -- Key: PIG-766 URL: https://issues.apache.org/jira/browse/PIG-766 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0, 0.7.0 Environment: Hadoop-0.18.3 (cloudera RPMs). mapred.child.java.opts=-Xmx1024m Reporter: Vadim Zaliva My pig script always fails with the following error: Java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashitosh Darbarwar reassigned PIG-1347: --- Assignee: Ashitosh Darbarwar (was: Daniel Dai) Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871358#action_12871358 ] Alan Gates commented on PIG-1249: - 1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate. {color:blue} It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1. {color} {color:red} Could we add a test to test this? I think it would be good to assure it works in this situation. Maybe you could take one of the tests that uses the Hbase loader. {color} 2. I'm curious where the values of ~1GB per reducer and 999 reducers came from. {color:blue} These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience. {color} {color:red} ok, good enough. We can adjust them later if we need to. {color} 3. Does this estimate apply only to the first job or to all jobs? {color:blue} It will apply to all the jobs {color} {color:red} Eventually we should change this to do the estimation on the fly in the JobControlCompiler. Since most queries tend to aggregate data down after a number of steps I suspect that using the initial input to estimate the entire query will mean that the final results are parallelized too widely. But this is better than the current situation where they aren't parallelized at all. {color} 4. How does this work in the case of joins, where there are multiple inputs to a job? {color:blue} it will estimate the reducer number according the all the inputs files' size {color} {color:red} cool {color} So other than testing the non-file case I'm +1 on this patch. Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249.patch, PIG_1249_2.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871389#action_12871389 ] Pradeep Kamath commented on PIG-1419: - +1 Minor observation in GruntParser.java: {noformat} 565 if (path == null) { 566 if (mDfs instanceof HDataStorage) { 567 container = mDfs.asContainer(((HDataStorage)mDfs). 568 getHFS().getHomeDirectory().toString()); 569 } else 570 container = mDfs.asContainer(/user/ + System.getProperty(user.name)); {noformat} Would the else ever get executed? (I think currently mDfs is always an instance of HDataStorage right?) If this is just to make it future proof, then I am fine keeping it. Minor style comment - would be good to enclose the else in {} even though it is a single statement - there is another statement right below the container = ... statement - so it would be more readable with {} block. Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871448#action_12871448 ] Ashutosh Chauhan commented on PIG-928: -- Arnab, Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions {{define}} seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use {{define}} to define non-native functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1419: Attachment: PIG-1419-2.patch For Pradeep's review comment, I think we can safely assume HDataStorage is the only data storage, so we can remove this check and make it simpler. Reattach the patch. Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch, PIG-1419-2.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1419: Attachment: (was: PIG-1419-2.patch) Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch, PIG-1419-2.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1419: Attachment: PIG-1419-2.patch Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch, PIG-1419-2.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871511#action_12871511 ] Daniel Dai commented on PIG-1343: - This script will reproduce the issue: {code} a = load '1.txt' as (a0:int); b = foreach a generate StringSize(a0); store b into '111'; {code} However, if we change store with dump, we get log file. pig_log file missing even though Main tells it is creating one and an M/R job fails Key: PIG-1343 URL: https://issues.apache.org/jira/browse/PIG-1343 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat There is a particular case where I was running with the latest trunk of Pig. {code} $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig [main] INFO org.apache.pig.Main - Logging error messages to: /homes/viraj/pig_1263420012601.log $ls -l pig_1263420012601.log ls: pig_1263420012601.log: No such file or directory {code} The job failed and the log file did not contain anything, the only way to debug was to look into the Jobtracker logs. Here are some reasons which would have caused this behavior: 1) The underlying filer/NFS had some issues. In that case do we not error on stdout? 2) There are some errors from the backend which are not being captured Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871514#action_12871514 ] Daniel Dai commented on PIG-1347: - In current code, we use StoreFunc.cleanupOnFailure for this purpose. FileLocalizer.deleteOnFail should be removed. So this issue is fixed in trunk, we should remove redundant code. Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.