[jira] Created: (PIG-1427) Monitor and kill runaway UDFs
Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871719#action_12871719 ] Dirk Schmid commented on PIG-766: - {quote}1. Are you getting the exact same stack trace as mentioned in the jira?{quote} Yes the same and some similar traces: {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2786) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209) at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264) at org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501) java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58) at org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263) at org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142) at org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136) at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284) at org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116) at org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:155) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:170) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161) at
[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword
[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-1249: Attachment: PIG_1249_3.patch Update the patch, including testcase of non-dfs input and do path checking when doing estimation Safe-guards against misconfigured Pig scripts without PARALLEL keyword -- Key: PIG-1249 URL: https://issues.apache.org/jira/browse/PIG-1249 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Arun C Murthy Assignee: Jeff Zhang Priority: Critical Fix For: 0.8.0 Attachments: PIG-1249.patch, PIG_1249_2.patch, PIG_1249_3.patch It would be *very* useful for Pig to have safe-guards against naive scripts which process a *lot* of data without the use of PARALLEL keyword. We've seen a fair number of instances where naive users process huge data-sets (10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1381) Need a way for Pig to take an alternative property file
[ https://issues.apache.org/jira/browse/PIG-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1381: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Manual test pass. Patch committed. Thanks V.V.Chaitanya! Need a way for Pig to take an alternative property file --- Key: PIG-1381 URL: https://issues.apache.org/jira/browse/PIG-1381 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: V.V.Chaitanya Krishna Fix For: 0.8.0 Attachments: PIG-1381-1.patch, PIG-1381-2.patch, PIG-1381-3.patch, PIG-1381-4.patch, PIG-1381-5.patch, PIG-1381_cli_1.patch, PIG-1381_cli_2.patch Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a default pig.properties and if user have a different pig.properties, there will be a conflict since we can only read one. There are couple of ways to solve it: 1. Give a command line option for user to pass an additional property file 2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override 3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1419: Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Manual test pass. Patch committed. Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch, PIG-1419-2.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1347: Attachment: PIG-1347-1.patch Remove redundant code. Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1347: Attachment: (was: PIG-1347-1.patch) Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1347: Attachment: PIG-1347-1.patch Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1419) Remove user.name from JobConf
[ https://issues.apache.org/jira/browse/PIG-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871845#action_12871845 ] Pradeep Kamath commented on PIG-1419: - +1 Remove user.name from JobConf --- Key: PIG-1419 URL: https://issues.apache.org/jira/browse/PIG-1419 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.8.0 Attachments: PIG-1419-1.patch, PIG-1419-2.patch In hadoop security, hadoop will use kerberos id instead of unix id. Pig should not set user.name entry in jobconf. This should be decided by hadoop. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871862#action_12871862 ] Ashutosh Chauhan commented on PIG-1347: --- Patch is pretty straightforward and harmless as it only removes code and does not add any thing new. Only concern I have is FileLocalizer.registerDeleteOnFail() is a public method so its possible that some one using Pig's java api is using this method to do the cleanup himself previously. So, this can be considered as backward incompatible change. But, Daniel explained to me that this method was meant for Pig's internal usage and clean up in any case was taken care by Pig before the recent store func changes, so user need not to worry about it. So, its extremely unlikely that someone is using it. So, +1 on committing. Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1347) Clear up output directory for a failed job
[ https://issues.apache.org/jira/browse/PIG-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-1347. - Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed. Clear up output directory for a failed job -- Key: PIG-1347 URL: https://issues.apache.org/jira/browse/PIG-1347 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Daniel Dai Assignee: Ashitosh Darbarwar Fix For: 0.8.0 Attachments: PIG-1347-1.patch FileLocalizer.deleteOnFail suppose to track the output files need to be deleted in case the job fails. However, in the current code base, deleteOnFail is dangling. registerDeleteOnFail and triggerDeleteOnFail is called by nobody. We need to bring it back. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
does EvalFunc generate the entire bag always ?
Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
[jira] Commented: (PIG-1424) Error logs of streaming should not be placed in output location
[ https://issues.apache.org/jira/browse/PIG-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871902#action_12871902 ] Ashutosh Chauhan commented on PIG-1424: --- Till we figure out a proper solution for this, one possibility is to wrap the code in my previous comment into try-catch block. That will unblock PIG-1229 for commit. We can leave this ticket open if we feel there is a need for a better solution. Error logs of streaming should not be placed in output location --- Key: PIG-1424 URL: https://issues.apache.org/jira/browse/PIG-1424 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 This becomes a problem when output location is anything other then a filesystem. Output will be written to DB but where the logs generated by streaming should go? Clearly, they cant be written into DB. This blocks PIG-1229 which introduces writing to DB from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872007#action_12872007 ] Arnab Nandi commented on PIG-928: - Thanks for looking into the patch Ashutosh! Very good question, short answer: I couldn't come up with an elegant solution using {{define}} :) I spent a bunch of time thinking about the right thing to do before going this way. As Woody mentioned, my initial instinct was to do this in in {{define}}, but kept hitting roadblocks when working with {{define}}: # I came up with the analogy that register is like import in java, and define is like alias in bash. In this interpretation, whenever you want to introduce new code, you {{register}} it with Pig. Whenever you want to alias anything for convenience or to add meta-information, you {{define}} it. # Define is not amenable to multiple functions in the same script. #* For example, to follow the {{stream}} convention, {quote} \{define X 'x.py' [inputoutputspec][schemaspec];\}. {quote} Which function is the input/output spec for? A solution like {quote} \{[func1():schemaspec1,func2:schemaspec2]} {quote} is... ugly. #* Further, how do we access these functions? One solution is to have the namespace as a codeblock, e.g. X.func1(), which is doable by registering functions as X.func1, but we're (mis)leading the user to believe there is some sort of real namespacing going on. I foresee multi-function files as a very common use case; people could have a util.py with their commonly used suite of functions instead of forcing 1 file per 2-3 line function. #* Note that Julien's @decorator idea cleanly solves this problem and I think it'll work for all languages. # With inline {{define}}, most languages have the convention of mentioning function definitions with the function name, input references return schema spec, it seems redundant to force the user to break this convention and have something like {quote} \{define x as script('def X(a,b): return a + b;');}, {quote} and have x.X(). Lambdas can solve this problem halfway, you'll need to then worry about the schema spec and we're back at a kludgy solution! # My plan for inline functions is to write all to a temp file (1 per script engine) and then deal with them as registering a file. # Jython code runs in its own interpreter because I couldn't figure out how to load Jython bytecode into Java, this has something to do with the lack of a jythonc afaik(I may be wrong). There will be one interpreter per non-compilable scriptengine, for others(Janino, Groovy), we load the class directly into the runtime. # From a code-writing perspective, overloading {{define}} to tack on a third use-case despite would involve an overhaul to the POStream physical operator and felt very inelegant; register on the other hand is well contained to a single purpose -- including files for UDFs. # Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine, this loads java UDFs into the native runtime and doesn't translate objects; so we're looking at potentially _zero_ loss of performance for inline UDFs (or register 'UDF.java'; ). The difference between native and script code gets blurry here... [tl;dr] ...and then I thought fair enough, let's just go with {{register}}! :D UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872031#action_12872031 ] Ashutosh Chauhan commented on PIG-1427: --- A useful feature. Couple of comments: 1. Currently in case of time outs and error you are always returning null. It will be useful if user can specify a default return value as a definition of his annotation which is returned in those cases. For example if my regex fails on an input String, I want to return an empty String back. Something like: {code} @MonitoredUDF(timeUnit = TimeUnit.MILLISECONDS, duration = 500, defaultReturnValue = ) {code} 2. It seems that PigHadoopLogger.getReporter() method accidentally got removed in 0.7 and trunk. This needs to be restored. It will be really cool to see how many of my input records are faulty on UI. Since, it is a small change, I think you can add that getter method in there and then update the appropriate counters. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: monitoredUdf.patch As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.