[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2010-05-25 Thread Dirk Schmid (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871046#action_12871046
 ] 

Dirk Schmid commented on PIG-766:
-

Many memory changes went in. Please, reopen if this is still a problem. 

I found the problem described by Vadim still existing for the following 
configuration:

- Apache-Hadoop 0.20.2
- Pig 0.7.0 and also for  0.8.0-dev (18/may)

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2010-05-25 Thread Dirk Schmid (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dirk Schmid updated PIG-766:


Affects Version/s: 0.7.0

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2010-05-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871253#action_12871253
 ] 

Ashutosh Chauhan commented on PIG-766:
--

Dirk,

1. Are you getting the exact same stack trace as mentioned in the jira?
2. Which operations are you doing in your query - join, group-by, any other ?
3. What load/store func are you using to read and write data? PigStorage or 
your own ?
4. What is your data size and memory available to your tasks?
5. Do you have very large records in your dataset, like hundreds of MB for one 
record ?

It would be great if you can paste here the script from which you get this 
exception.

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0, 0.7.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1347) Clear up output directory for a failed job

2010-05-25 Thread Ashitosh Darbarwar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashitosh Darbarwar reassigned PIG-1347:
---

Assignee: Ashitosh Darbarwar  (was: Daniel Dai)

 Clear up output directory for a failed job
 --

 Key: PIG-1347
 URL: https://issues.apache.org/jira/browse/PIG-1347
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Ashitosh Darbarwar
 Fix For: 0.8.0


 FileLocalizer.deleteOnFail suppose to track the output files need to be 
 deleted in case the job fails. However, in the current code base, 
 deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is 
 called by nobody. We need to bring it back.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

2010-05-25 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871358#action_12871358
 ] 

Alan Gates commented on PIG-1249:
-

1. In this code, what happens if a loader is not loading from a file (like an 
HBase loader)? It looks to me like it will end up throwing an IOException when 
it tries to stat the 'file' which won't exist and that will cause Pig to die. 
Ideally in this case it should decide that it cannot make a rational estimate 
and not try to estimate.

{color:blue}
It won't throw IOException when file doesn't exit, getTotalInputFileSize will 
return 0 if not loading from file or file doesn't exit. And the final estimated 
reducer number will be 1.

{color}
{color:red}
Could we add a test to test this?  I think it would be good to assure it works 
in this situation.  Maybe you could take one of the tests that uses the Hbase 
loader.
{color}

2. I'm curious where the values of ~1GB per reducer and 999 reducers came from.

{color:blue}
These two numbers is what Hive use, I'm not sure how they came from. Maybe from 
their experience.
{color}
{color:red}
ok, good enough.  We can adjust them later if we need to.
{color}

3. Does this estimate apply only to the first job or to all jobs?

{color:blue}
It will apply to all the jobs
{color}
{color:red}
Eventually we should change this to do the estimation on the fly in the 
JobControlCompiler.  Since most queries tend to aggregate data down after a 
number of steps I suspect that using the initial input to estimate the entire 
query will mean that the final results are parallelized too widely.  But this 
is better than the current situation where they aren't parallelized at all.
{color}

4. How does this work in the case of joins, where there are multiple inputs to 
a job?

{color:blue}
it will estimate the reducer number according the all the inputs files' size
{color}
{color:red}
cool
{color}

So other than testing the non-file case I'm +1 on this patch.


 Safe-guards against misconfigured Pig scripts without PARALLEL keyword
 --

 Key: PIG-1249
 URL: https://issues.apache.org/jira/browse/PIG-1249
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Arun C Murthy
Assignee: Jeff Zhang
Priority: Critical
 Fix For: 0.8.0

 Attachments: PIG-1249.patch, PIG_1249_2.patch


 It would be *very* useful for Pig to have safe-guards against naive scripts 
 which process a *lot* of data without the use of PARALLEL keyword.
 We've seen a fair number of instances where naive users process huge 
 data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1419) Remove user.name from JobConf

2010-05-25 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871389#action_12871389
 ] 

Pradeep Kamath commented on PIG-1419:
-

+1

Minor observation in GruntParser.java:
{noformat}
565 if (path == null) { 

 
  566 if (mDfs instanceof HDataStorage) {   

   
  567 container = 
mDfs.asContainer(((HDataStorage)mDfs).  

 
  568 getHFS().getHomeDirectory().toString());  

   
  569 } else

   
  570 container = mDfs.asContainer(/user/ + 
System.getProperty(user.name));
{noformat}

Would the else ever get executed? (I think currently mDfs is always an instance 
of HDataStorage right?) If this is just to make it future proof, then I am fine 
keeping it. Minor style comment - would be good to enclose the else in {} even 
though it is a single statement - there is another statement right below the 
container = ... statement - so it would be more readable with {} block.

 Remove user.name from JobConf
 ---

 Key: PIG-1419
 URL: https://issues.apache.org/jira/browse/PIG-1419
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1419-1.patch


 In hadoop security, hadoop will use kerberos id instead of unix id. Pig 
 should not set user.name entry in jobconf. This should be decided by hadoop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-05-25 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871448#action_12871448
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Arnab,

Thanks for putting together a patch for this. One question I have is about 
register Vs define. Currently you are auto-registering all the functions in the 
script file and then they are available for later use in script. But I am not 
sure how we will handle the case for inlined functions. For inline functions 
{{define}} seems to be a natural choice as noted in previous comments of the 
jira. And if so, then we need to modify define to support that use case. 
Wondering to remain consistent, we always use {{define}} to define non-native 
functions instead of auto registering them. I also didn't get why there will be 
need for separate interpreter instances in that case.


 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, pig-greek.tgz, 
 pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1419) Remove user.name from JobConf

2010-05-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1419:


Attachment: PIG-1419-2.patch

For Pradeep's review comment, I think we can safely assume HDataStorage is the 
only data storage, so we can remove this check and make it simpler. Reattach 
the patch.

 Remove user.name from JobConf
 ---

 Key: PIG-1419
 URL: https://issues.apache.org/jira/browse/PIG-1419
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1419-1.patch, PIG-1419-2.patch


 In hadoop security, hadoop will use kerberos id instead of unix id. Pig 
 should not set user.name entry in jobconf. This should be decided by hadoop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1419) Remove user.name from JobConf

2010-05-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1419:


Attachment: (was: PIG-1419-2.patch)

 Remove user.name from JobConf
 ---

 Key: PIG-1419
 URL: https://issues.apache.org/jira/browse/PIG-1419
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1419-1.patch, PIG-1419-2.patch


 In hadoop security, hadoop will use kerberos id instead of unix id. Pig 
 should not set user.name entry in jobconf. This should be decided by hadoop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1419) Remove user.name from JobConf

2010-05-25 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1419:


Attachment: PIG-1419-2.patch

 Remove user.name from JobConf
 ---

 Key: PIG-1419
 URL: https://issues.apache.org/jira/browse/PIG-1419
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1419-1.patch, PIG-1419-2.patch


 In hadoop security, hadoop will use kerberos id instead of unix id. Pig 
 should not set user.name entry in jobconf. This should be decided by hadoop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-05-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871511#action_12871511
 ] 

Daniel Dai commented on PIG-1343:
-

This script will reproduce the issue:
{code}
a = load '1.txt' as (a0:int);
b = foreach a generate StringSize(a0);
store b into '111';
{code}

However, if we change store with dump, we get log file.

 pig_log file missing even though Main tells it is creating one and an M/R job 
 fails 
 

 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 There is a particular case where I was running with the latest trunk of Pig.
 {code}
 $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
 [main] INFO  org.apache.pig.Main - Logging error messages to: 
 /homes/viraj/pig_1263420012601.log
 $ls -l pig_1263420012601.log
 ls: pig_1263420012601.log: No such file or directory
 {code}
 The job failed and the log file did not contain anything, the only way to 
 debug was to look into the Jobtracker logs.
 Here are some reasons which would have caused this behavior:
 1) The underlying filer/NFS had some issues. In that case do we not error on 
 stdout?
 2) There are some errors from the backend which are not being captured
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1347) Clear up output directory for a failed job

2010-05-25 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871514#action_12871514
 ] 

Daniel Dai commented on PIG-1347:
-

In current code, we use StoreFunc.cleanupOnFailure for this purpose. 
FileLocalizer.deleteOnFail should be removed. So this issue is fixed in trunk, 
we should remove redundant code.

 Clear up output directory for a failed job
 --

 Key: PIG-1347
 URL: https://issues.apache.org/jira/browse/PIG-1347
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Ashitosh Darbarwar
 Fix For: 0.8.0


 FileLocalizer.deleteOnFail suppose to track the output files need to be 
 deleted in case the job fails. However, in the current code base, 
 deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is 
 called by nobody. We need to bring it back.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.