[jira] Assigned: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-832: -- Assignee: Daniel Dai (was: Olga Natkovich) Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721353#action_12721353 ] Olga Natkovich commented on PIG-832: As part of this fix we should also expand the default list to include piggybank functions Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721361#action_12721361 ] Milind Bhandarkar commented on PIG-832: --- If we include the piggybank functions in the default import list, we need to make sure that they are compiled and tested in the default build, and that the releases will be blocked due to them not compiling etc. Is that the intention ? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721373#action_12721373 ] Olga Natkovich commented on PIG-832: In response to Milind. I don't think we are committing to more support for piggybank. All this does is, if you do use UDFs from piggybank, you don't need to use full package name. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721387#action_12721387 ] Olga Natkovich commented on PIG-832: Milind, Not quite sure what you are saying. We currently don't have any way to pass the list in. import.list does not exist in pig as far as I know. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721396#action_12721396 ] Milind Bhandarkar commented on PIG-832: --- Olga, what I am saying is to have a default import list: which contains default UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the same list can be added to / overridden on the command-line. This has several advantages. Pig built-ins do not have to be reserved words, and can be overridden. For example, recent mails on pig-users have mentioned that tokenize+flatten should be a single udf. This can be done by providing a flatten (which is null), and tokenize, which does tokenize+flatten, and existing scripts will still work. This simplifies pig grammar as well. Users can create udf libraries, and use them with: {code} java -Dimport.list += `cat my-udf-lib.import` {code} Thoughts ? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721413#action_12721413 ] Milind Bhandarkar commented on PIG-832: --- Instead of a list, if you make it a map (i.e. short name - fully qualified class name), it will be much easier, as it will guarantee that each name has exactly one udf class associated with it. It will also allow users to use udfs that have class names which are pig reserved words. For example, If I have an existing UDF with a class name such as load or store, I can still use them with a different name like myload, without having to rename the class. So, I suggest: {code} java -jar pig.jar -Dimport.list+=MyLoad:com..Load,Flatten:com..Flatten,... {code} If I do not specify -Dimport.list on the pig command line, then the default import.list is used. Thoughts ? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721414#action_12721414 ] Olga Natkovich commented on PIG-832: Milind, Couple of comments and clarifications: (1) Builtin UDFs are not reserved words. (Flatten is reserved but it is not a UDF) The issue we have seen is users creating UDFs that had reserved words in the package name and if the package name is registered as proposed in this JIRa, their problem will go away. (2) I don't think we need to allow to overwrite the defaults. We are not planning to expand the list beyond default distribution (builtins + piggybank.) The plan is to hardwire this values in the code since they are not likely to change (3) Our plan is to keep it simple and to just allow users to add packages based on what they use in their UDFs. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721423#action_12721423 ] Olga Natkovich commented on PIG-832: Also think you are suggesting UDF aliasing on command line which I am not sure is the right place for it. The scope of this work is just to make it easier for users to refer to their UDFs. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[VOTE] Release Pig 0.3.0 (candidate 0)
Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
[jira] Created: (PIG-857) Pig should implement Tool interface from Hadoop
Pig should implement Tool interface from Hadoop --- Key: PIG-857 URL: https://issues.apache.org/jira/browse/PIG-857 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.3.0 Environment: All Reporter: Milind Bhandarkar Hadoop, Hadoop Streaming, and Hadoop Pipes all use Tool interface, which provides support for parsing generic options. This has resulted in consistent options for all three hadoop launch mechanisms. Pig should also implement Tool (or use GenericOptionsParser directly.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721456#action_12721456 ] Daniel Dai commented on PIG-832: Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-734: --- Status: Open (was: Patch Available) Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.3.0 Attachments: PIG-734.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-734: --- Attachment: PIG-734_2.patch New version of the patch, brought up to date with current trunk. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.3.0 Attachments: PIG-734.patch, PIG-734_2.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721481#action_12721481 ] Olga Natkovich commented on PIG-832: Milind, we have parameter substitution for what you are mentioning as example. My proposal would be to keep this issue strictly for the packaging thing. This will already make a lot of people happy and users asked for just that. We can discuss and understand more user requirements regarding aliases in a separate thread. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721490#action_12721490 ] Milind Bhandarkar commented on PIG-832: --- Daniel: For that to work, user's class will have to be called PigStorage. And also, inserting user's jars before pig jar for looking up methods can have major unintended consequences. pig.jar should always be the first in the classpath. Olga: My use case cannot use parameter substitution, because PigMix scrips does not specify PigStorage as, say, $storage. The solution I proposed is as simple to implement as Daniel's original proposal (+= is a syntactic sugar. even = can be used with the same effect.), and it fixes a specific ask, and also allows for extensibility. Am I missing something here ? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721503#action_12721503 ] Daniel Dai commented on PIG-832: Hi, Milind, For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put -Dudf.import.list=com.xxx.udf1:com.xxx.udf2, when we see an unknown UDF, we first search in the package com.xxx.udf1, then com.xxx.udf2, then org.apache.pig.builtin. We build this policy in our code. It's not put user.jar in front of pig.jar. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721506#action_12721506 ] Olga Natkovich commented on PIG-832: Milind, Issue is not the complexity of implementation but that I am not sure we want to support command line aliasing and I want to discuss and understand the use cases for it separately. And we can parameterize PigMix if we needed to - that was just an example of an alternative solution for the issue you specified. I looking for a list of requirements - not a solution. Another comment is I don't think the solution you are proposing would work. The way the list is used to by prepending the package name to the function name to see if the function exist. It deos not do anything with function name itself. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721514#action_12721514 ] Milind Bhandarkar commented on PIG-832: --- Olga, specifying a list of packages as a path list will have the same issues as {code} import com.xyz.package.*; {code} in java, where it is considered to be a bad practice. So, in the solution that I have proposed, I am assuming the class name is specified on the commandline and not the package name. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721519#action_12721519 ] Daniel Dai commented on PIG-832: Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-852) pig -version or pig -help returns exit code of 1
[ https://issues.apache.org/jira/browse/PIG-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-852: --- Fix Version/s: 0.3.0 pig -version or pig -help returns exit code of 1 Key: PIG-852 URL: https://issues.apache.org/jira/browse/PIG-852 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Environment: All Reporter: Milind Bhandarkar Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: rc.patch {code} java -jar pig.jar -x local [-version|-help] {code} returns an exit code of 1 to the shell. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-819) run -param -param; is a valid grunt command
[ https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-819: --- Fix Version/s: 0.3.0 run -param -param; is a valid grunt command --- Key: PIG-819 URL: https://issues.apache.org/jira/browse/PIG-819 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Environment: all Reporter: Milind Bhandarkar Assignee: Milind Bhandarkar Fix For: 0.3.0 Attachments: invalidparam.patch By mistake, I typed {code} run -param -param; {code} in grunt. And was surprised to find it to be a valid grunt command. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-818) Explain doesn't handle PODemux properly
[ https://issues.apache.org/jira/browse/PIG-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-818: --- Fix Version/s: 0.3.0 Explain doesn't handle PODemux properly --- Key: PIG-818 URL: https://issues.apache.org/jira/browse/PIG-818 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner Assignee: Gunther Hagleitner Fix For: 0.3.0 Attachments: explain.patch The PODemux operator has nested plans but they are not expanded in the -dot version of explain. Also, both split and demux are displayed as clusters of nodes, but it really makes more sense to just show them as multi output operators. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-564) Parameter Substitution using -param option does not seem to work when parameters contain special characters such as +,=,-,?,'
[ https://issues.apache.org/jira/browse/PIG-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-564: --- Fix Version/s: 0.3.0 Parameter Substitution using -param option does not seem to work when parameters contain special characters such as +,=,-,?,' --- Key: PIG-564 URL: https://issues.apache.org/jira/browse/PIG-564 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Assignee: Olga Natkovich Fix For: 0.3.0 Attachments: PIG-564.patch Consider the following Pig script which uses parameter substitution {code} %default qual '/user/viraj' %default mydir 'mydir_myextraqual' VISIT_LOGS = load '$qual/$mydir' as (a,b,c); dump VISIT_LOGS; {code} If you run the script as: == java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main -param mydir=mydir-myextraqual mypigparamsub.pig == You get the following error: == 2008-12-15 19:49:43,964 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - java.io.IOException: /user/viraj/mydir does not exist at org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:109) at org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59) at org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:200) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) java.io.IOException: Unable to open iterator for alias: VISIT_LOGS [Job terminated with anomalous status FAILED] at org.apache.pig.PigServer.openIterator(PigServer.java:389) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: java.io.IOException: Job terminated with anomalous status FAILED ... 6 more == Also tried using: -param mydir='mydir\-myextraqual' This behavior occurs if the parameter value contains characters such as +,=, ?. A workaround for this behavior is using a param_file which contains param_name=param_value on each line, with the param_value enclosed by quotes. For example: mydir='mydir-myextraqual' and then running the pig script as: java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main -param_file myparamfile mypigparamsub.pig The following issues need to be fixed: 1) In -param option if parameter value contains special characters, it is truncated 2) In param_file, if param_value contains a special characters, it should be enclosed in quotes 3) If 2 is a known issue then it should be documented in http://wiki.apache.org/pig/ParameterSubstitution -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721523#action_12721523 ] Milind Bhandarkar commented on PIG-832: --- Daniel, Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right? No, thats why I have a `cat myudflist` allowed on the command-line. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization
[ https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-627: --- Fix Version/s: 0.3.0 PERFORMANCE: multi-query optimization - Key: PIG-627 URL: https://issues.apache.org/jira/browse/PIG-627 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: 0.3.0 Attachments: doc-fix.patch, error_handling_0415.patch, error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, multiquery-phase3_0423.patch, multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, non_reversible_store_load_dependencies_2.patch, noop_filter_absolute_path_flag.patch, noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch Currently, if your Pig script contains multiple stores and some shared computation, Pig will execute several independent queries. For instance: A = load 'data' as (a, b, c); B = filter A by a 5; store B into 'output1'; C = group B by b; store C into 'output2'; This script will result in map-only job that generated output1 followed by a map-reduce job that generated output2. As the resuld data is read, parsed and filetered twice which is unnecessary and costly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-850) Dump produce wrong result while store into is ok
[ https://issues.apache.org/jira/browse/PIG-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-850: --- Fix Version/s: (was: 0.3.0) 0.4.0 Dump produce wrong result while store into is ok -- Key: PIG-850 URL: https://issues.apache.org/jira/browse/PIG-850 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-850.patch The following script will wrongly produce 20 output, however, if we change dump to store into, the result is correct. Not sure if the problem is only for limited sort case. A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = order A by gpa parallel 2; C = limit B 10; dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-852) pig -version or pig -help returns exit code of 1
[ https://issues.apache.org/jira/browse/PIG-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-852: --- Fix Version/s: (was: 0.3.0) 0.4.0 pig -version or pig -help returns exit code of 1 Key: PIG-852 URL: https://issues.apache.org/jira/browse/PIG-852 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.3.0 Environment: All Reporter: Milind Bhandarkar Assignee: Milind Bhandarkar Fix For: 0.4.0 Attachments: rc.patch {code} java -jar pig.jar -x local [-version|-help] {code} returns an exit code of 1 to the shell. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-849) Local engine loses records in splits
[ https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-849: --- Fix Version/s: (was: 0.3.0) 0.4.0 Local engine loses records in splits Key: PIG-849 URL: https://issues.apache.org/jira/browse/PIG-849 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Gunther Hagleitner Fix For: 0.4.0 Attachments: local_engine.patch, local_engine.patch When there is a split in the physical plan records can be dropped in certain circumstances. The local split operator puts all records in a databag and turns over iterators to the POSplitOutput operators. The problem is that the local split also adds STATUS_NULL records to the bag. That will cause the databag's iterator to prematurely return false on the hasNext call (so a STATUS_NULL becomes a STATUS_EOP in the split output operators). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #92
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/92/ -- [...truncated 93989 lines...] [exec] [junit] 09/06/18 22:16:02 INFO dfs.DataNode: PacketResponder 1 for block blk_-3610909769110207607_1010 terminating [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:42932 is added to blk_-3610909769110207607_1010 size 6 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Received block blk_-3610909769110207607_1010 of size 6 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:45477 is added to blk_-3610909769110207607_1010 size 6 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: PacketResponder 2 for block blk_-3610909769110207607_1010 terminating [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hudson/input2.txt. blk_-3111693600154221798_1011 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Receiving block blk_-3111693600154221798_1011 src: /127.0.0.1:49669 dest: /127.0.0.1:42956 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Receiving block blk_-3111693600154221798_1011 src: /127.0.0.1:45986 dest: /127.0.0.1:54021 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Receiving block blk_-3111693600154221798_1011 src: /127.0.0.1:58580 dest: /127.0.0.1:45477 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Received block blk_-3111693600154221798_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: PacketResponder 0 for block blk_-3111693600154221798_1011 terminating [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:45477 is added to blk_-3111693600154221798_1011 size 6 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Received block blk_-3111693600154221798_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: PacketResponder 1 for block blk_-3111693600154221798_1011 terminating [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:54021 is added to blk_-3111693600154221798_1011 size 6 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: Received block blk_-3111693600154221798_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:03 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:42956 is added to blk_-3111693600154221798_1011 size 6 [exec] [junit] 09/06/18 22:16:03 INFO dfs.DataNode: PacketResponder 2 for block blk_-3111693600154221798_1011 terminating [exec] [junit] 09/06/18 22:16:03 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:35520 [exec] [junit] 09/06/18 22:16:03 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:49012 [exec] [junit] 09/06/18 22:16:03 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/06/18 22:16:03 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/06/18 22:16:04 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/06/18 22:16:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/06/18 22:16:04 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200906182215_0002/job.jar. blk_-3197336206391371647_1012 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: Receiving block blk_-3197336206391371647_1012 src: /127.0.0.1:45988 dest: /127.0.0.1:54021 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: Receiving block blk_-3197336206391371647_1012 src: /127.0.0.1:34269 dest: /127.0.0.1:42932 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: Receiving block blk_-3197336206391371647_1012 src: /127.0.0.1:58583 dest: /127.0.0.1:45477 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: Received block blk_-3197336206391371647_1012 of size 1415240 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: PacketResponder 0 for block blk_-3197336206391371647_1012 terminating [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode: Received block blk_-3197336206391371647_1012 of size 1415240 from /127.0.0.1 [exec] [junit] 09/06/18 22:16:04 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:45477 is added to blk_-3197336206391371647_1012 size 1415240 [exec] [junit] 09/06/18 22:16:04 INFO dfs.DataNode:
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721527#action_12721527 ] Hadoop QA commented on PIG-734: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411133/PIG-734_2.patch against trunk revision 785450. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 63 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/92/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/92/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/92/console This message is automatically generated. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.3.0 Attachments: PIG-734.patch, PIG-734_2.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-781) Error reporting for failed MR jobs
[ https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-781: --- Fix Version/s: 0.3.0 Error reporting for failed MR jobs -- Key: PIG-781 URL: https://issues.apache.org/jira/browse/PIG-781 Project: Pig Issue Type: Improvement Reporter: Gunther Hagleitner Fix For: 0.3.0 Attachments: partial_failure.patch, partial_failure.patch, partial_failure.patch, partial_failure.patch If we have multiple MR jobs to run and some of them fail the behavior of the system is to not stop on the first failure but to keep going. That way jobs that do not depend on the failed job might still succeed. The question is to how best report this scenario to a user. How do we tell which jobs failed and which didn't? One way could be to tie jobs to stores and report which store locations won't have data and which ones do. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721528#action_12721528 ] Daniel Dai commented on PIG-832: yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-734: --- Fix Version/s: (was: 0.3.0) 0.4.0 Status: Patch Available (was: Open) Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.4.0 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-734: --- Attachment: PIG-734_3.patch Attaching a version of the file that fixes some of the introduced compiler warnings. The findbugs warnings have to do with naming convention. All of the function names in QueryParser start with upper case, so I am only following the convention there. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.4.0 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721550#action_12721550 ] Milind Bhandarkar commented on PIG-832: --- Daniel, Pig streaming already uses backquotes for executing external programs. So, users are familiar with this syntax. I believe an ordinary pig user already knows about doing such things in unix shells. But anyway, as Olga said, she is looking for requirements, and not solutions, so, here is a requirement: I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I want to use function1 from xyz.jar, and function2 from abc.jar. How do I use function2 from abc.jar with full confidence that xyz.jar does not contain a UDF named function2? How do you propose I do that without modifying a whole bunch of pig scripts that I am testing for my functions ? In the solution that I proposed, I can just change function2 mapping by including -Dimport.list=function2:com.yahoo.milind.function2 on the command-line. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721551#action_12721551 ] Olga Natkovich commented on PIG-832: You use a fully qualified name for the other one. I would like for us to continue on our original plan. It might not solve all the issues but it certainly helps and it is a very small change to the current implementation. We can discuss improvements in a separate JIRA. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721559#action_12721559 ] Alan Gates commented on PIG-753: +1 I tested the patch, and the issue was just with the bzip tests. I'd like to have Santosh's opinion on this as he is the expert in the logical plan and type checker area where these changes are. Provide support for UDFs without parameters --- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Jeff Zhang Attachments: Pig_753_Patch.txt Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721558#action_12721558 ] Olga Natkovich commented on PIG-856: The number of replicas can be set via dfs.replication parameter in Hadoop's JobConf PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721562#action_12721562 ] Milind Bhandarkar commented on PIG-832: --- Olga, As long the suggested improvements do not result in redundancy / make the original solutions obsolete, its fine. But I believe that the core issue, which is, how does pig resolve UDFs?, is not addressed properly in the small change to current implementation. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721563#action_12721563 ] Olga Natkovich commented on PIG-832: I don't believe this prevents future improvements Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-753) Provide support for UDFs without parameters
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721564#action_12721564 ] Santhosh Srinivasan commented on PIG-753: - +1 for the code changes. The license header and the unit tests that failed have to be checked. Provide support for UDFs without parameters --- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Jeff Zhang Attachments: Pig_753_Patch.txt Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721570#action_12721570 ] Olga Natkovich commented on PIG-856: Hi Milind, yes, these are very good points. I was hoping that we could set the flag for jobs that produce temparary results only but I did not clearly state this in the bug. I am also considering replication of 1 as I agree it should yield much better performance gains. My plan is to run a test on a large query (join + order by) with replication factor of 1, 2, and default and see what perf differences are. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721574#action_12721574 ] Milind Bhandarkar commented on PIG-856: --- +1 on seeing performance differences. But, is there code in pig to determine that the output of a previous map-reduce stage is not accessible because of datanode failures (as opposed to some other reason), and repeat the map-reduce stage ? Because a single datanode failure with replication 1 will cause temporary data to be unavailable, and is very likely for long-running queries. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721577#action_12721577 ] Olga Natkovich commented on PIG-856: If a job fails, the store connected to this job will fail as well. Pig has no retries beyond what hadoop provides. That's why no replication seems a little risky but I want to see what the perf difference is and whether it is worth the risk. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721581#action_12721581 ] Milind Bhandarkar commented on PIG-856: --- +1. I will file a separate Jira (if replication of 1 is decided upon) so that Pig retries a map-reduce stage if it fails for *external* reasons. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721590#action_12721590 ] Alan Gates commented on PIG-856: My $0.02, based on the assumption that we see a significant performance improvement using only 1 replica instead of 2 or 3: In the long term we might want Pig to retry jobs if they fail for this. But in the short term, I would think some users would be willing to trade reliability for performance and some would not, so we should let them choose. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721593#action_12721593 ] Olga Natkovich commented on PIG-856: Yes, I agree - we should let users choose, I was thinking perhaps even for their final output. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721592#action_12721592 ] Santhosh Srinivasan commented on PIG-856: - Would that be through a configuration parameter? What would be the default 1 or 2 ? PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721594#action_12721594 ] Milind Bhandarkar commented on PIG-856: --- +1 to both Alan and Olga. Default should still be hadoop's default dfs.replication. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721596#action_12721596 ] Santhosh Srinivasan commented on PIG-856: - Essentially, are we adding more knobs to tune Pig? We should document these knobs and explain how they interact with each other. PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #93
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/93/ -- [...truncated 94294 lines...] [exec] [junit] 09/06/19 01:06:15 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:49808 is added to blk_2088255647506135164_1011 size 6 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: Received block blk_2088255647506135164_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: PacketResponder 1 for block blk_2088255647506135164_1011 terminating [exec] [junit] 09/06/19 01:06:15 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:54748 is added to blk_2088255647506135164_1011 size 6 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: Received block blk_2088255647506135164_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: PacketResponder 2 for block blk_2088255647506135164_1011 terminating [exec] [junit] 09/06/19 01:06:15 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40871 is added to blk_2088255647506135164_1011 size 6 [exec] [junit] 09/06/19 01:06:15 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:55595 [exec] [junit] 09/06/19 01:06:15 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:38969 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: Deleting block blk_2919053229063530843_1005 file dfs/data/data2/current/blk_2919053229063530843 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: Deleting block blk_6688640043981499581_1006 file dfs/data/data1/current/blk_6688640043981499581 [exec] [junit] 09/06/19 01:06:15 INFO dfs.DataNode: Deleting block blk_6773019531096958866_1004 file dfs/data/data1/current/blk_6773019531096958866 [exec] [junit] 09/06/19 01:06:15 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/06/19 01:06:15 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/06/19 01:06:16 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:40871 to delete blk_6773019531096958866_1004 blk_6688640043981499581_1006 [exec] [junit] 09/06/19 01:06:16 INFO dfs.StateChange: BLOCK* ask 127.0.0.1:53215 to delete blk_2919053229063530843_1005 blk_6688640043981499581_1006 [exec] [junit] 09/06/19 01:06:16 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/06/19 01:06:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/06/19 01:06:16 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200906190105_0002/job.jar. blk_-557104969554073193_1012 [exec] [junit] 09/06/19 01:06:16 INFO dfs.DataNode: Receiving block blk_-557104969554073193_1012 src: /127.0.0.1:48141 dest: /127.0.0.1:40871 [exec] [junit] 09/06/19 01:06:16 INFO dfs.DataNode: Receiving block blk_-557104969554073193_1012 src: /127.0.0.1:60050 dest: /127.0.0.1:53215 [exec] [junit] 09/06/19 01:06:16 INFO dfs.DataNode: Receiving block blk_-557104969554073193_1012 src: /127.0.0.1:49230 dest: /127.0.0.1:54748 [exec] [junit] 09/06/19 01:06:17 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:54748 is added to blk_-557104969554073193_1012 size 1415279 [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: Received block blk_-557104969554073193_1012 of size 1415279 from /127.0.0.1 [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: PacketResponder 0 for block blk_-557104969554073193_1012 terminating [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: Received block blk_-557104969554073193_1012 of size 1415279 from /127.0.0.1 [exec] [junit] 09/06/19 01:06:17 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:53215 is added to blk_-557104969554073193_1012 size 1415279 [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: PacketResponder 1 for block blk_-557104969554073193_1012 terminating [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: Received block blk_-557104969554073193_1012 of size 1415279 from /127.0.0.1 [exec] [junit] 09/06/19 01:06:17 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40871 is added to blk_-557104969554073193_1012 size 1415279 [exec] [junit] 09/06/19 01:06:17 INFO dfs.DataNode: PacketResponder 2 for block blk_-557104969554073193_1012 terminating [exec] [junit] 09/06/19 01:06:17 INFO fs.FSNamesystem: Increasing replication for file
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721598#action_12721598 ] Hadoop QA commented on PIG-734: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411160/PIG-734_3.patch against trunk revision 785450. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 63 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/93/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/93/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/93/console This message is automatically generated. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.4.0 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas
[ https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721599#action_12721599 ] Milind Bhandarkar commented on PIG-856: --- +1 to Sathosh to documenting Knobs. Better to add and document knobs rather than modify language like this: {code} %TempReplicate 2 store A into PigStorage('\t') with replication 2; {code} PERFORMANCE: reduce number of replicas -- Key: PIG-856 URL: https://issues.apache.org/jira/browse/PIG-856 Project: Pig Issue Type: Improvement Affects Versions: 0.3.0 Reporter: Olga Natkovich Currently Pig uses the default number of replicas between MR jobs. Currently, the number is 3. Given the temp nature of the data, we should never need more than 2 and should explicitely set it to improve performance and to be nicer to the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.