[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator
[ https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778430#action_12778430 ] Daniel Dai commented on PIG-1064: - With this patch, group by * without schema does not work anymore. I think there could be some valid use case on that, eg, people may want to use this to do a count for each distinctive values using statement group by *; foreach generate group, COUNT(*);. It is much safe to allow group by * work, and only disallow cogroup by *. Behvaiour of COGROUP with and without schema when using * operator Key: PIG-1064 URL: https://issues.apache.org/jira/browse/PIG-1064 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Reporter: Viraj Bhat Assignee: Pradeep Kamath Fix For: 0.6.0 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, PIG-1064.patch I have 2 tab separated files, 1.txt and 2.txt $ cat 1.txt 1 2 2 3 $ cat 2.txt 1 2 2 3 I use COGROUP feature of Pig in the following way: $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main {code} grunt A = load '1.txt'; grunt B = load '2.txt' as (b0, b1); grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1012: Each COGroup input has to have the same number of inner plans Details at logfile: pig_1256845224752.log == If I reverse, the order of the schema's {code} grunt A = load '1.txt' as (a0, a1); grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; {code} 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1013: Grouping attributes can either be star (*) or a list of expressions, but not both. Details at logfile: pig_1256845224752.log == Now running without schema?? {code} grunt A = load '1.txt'; grunt B = load '2.txt'; grunt C = cogroup A by *, B by *; grunt dump C; {code} 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: file:/tmp/temp-319926700/tmp-1990275961 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 2 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 154 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! 2009-10-29 12:55:37,202 [main] INFO org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! ((1,2),{(1,2)},{(1,2)}) ((2,3),{(2,3)},{(2,3)}) == Is this a bug or a feature? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs
[ https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-1085: Resolution: Fixed Status: Resolved (was: Patch Available) Patch checked in. Pass JobConf and UDF specific configuration information to UDFs --- Key: PIG-1085 URL: https://issues.apache.org/jira/browse/PIG-1085 Project: Pig Issue Type: New Feature Components: impl Reporter: Alan Gates Assignee: Alan Gates Attachments: udfconf-2.patch, udfconf.patch Users have long asked for a way to get the JobConf structure in their UDFs. It would also be nice to have a way to pass properties between the front end and back end so that UDFs can store state during parse time and use it at runtime. This patch does part of what is proposed in PIG-602, but not all of it. It does not provide a way to give user specified configuration files to UDFs. So I will mark 602 as depending on this bug, but it isn't a duplicate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-trunk #621
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/621/changes Changes: [gates] PIG-1085: Pass JobConf and UDF specific configuration information to UDFs. -- [...truncated 1266 lines...] AUtutorial/scripts/script1-local.pig AUtutorial/scripts/script2-hadoop.pig AUtutorial/scripts/script2-local.pig A tutorial/src A tutorial/src/org A tutorial/src/org/apache A tutorial/src/org/apache/pig A tutorial/src/org/apache/pig/tutorial A tutorial/src/org/apache/pig/tutorial/TutorialUtil.java A tutorial/src/org/apache/pig/tutorial/ScoreGenerator.java A tutorial/src/org/apache/pig/tutorial/TutorialTest.java A tutorial/src/org/apache/pig/tutorial/NonURLDetector.java A tutorial/src/org/apache/pig/tutorial/ExtractHour.java A tutorial/src/org/apache/pig/tutorial/NGramGenerator.java A tutorial/src/org/apache/pig/tutorial/ToLower.java A tutorial/data AUtutorial/data/excite-small.log AUtutorial/data/excite.log.bz2 AUtutorial/build.xml A RELEASE_NOTES.txt A ivy.xml A lib A lib/jdiff A lib/jdiff/pig_0.3.1.xml AUlib/hbase-0.20.0-test.jar AUlib/hadoop20.jar A lib/hadoop-LICENSE.txt AUlib/hbase-0.20.0.jar AUlib/zookeeper-hbase-1329.jar AUlib/hadoop18.jar A ivy A ivy/ivysettings.xml A ivy/libraries.properties A ivy/pig.pom A bin AUbin/pig A README.txt A KEYS U. Fetching 'http://svn.apache.org/repos/asf/hadoop/nightly/test-patch' at -1 into 'http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/test/bin' A test/bin/test-patch.sh At revision 880993 At revision 880992 no change for http://svn.apache.org/repos/asf/hadoop/nightly/test-patch since the previous build + export JAVA_HOME=/homes/hudson/tools/java/latest1.6 + JAVA_HOME=/homes/hudson/tools/java/latest1.6 + export PATH=/homes/hudson/tools/java/latest1.6/bin:/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games + PATH=/homes/hudson/tools/java/latest1.6/bin:/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games ++ pwd + TRUNK=http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk + export ANT_HOME=/homes/hudson/tools/ant/apache-ant-1.7.0 + ANT_HOME=/homes/hudson/tools/ant/apache-ant-1.7.0 + cd trunk + /homes/hudson/tools/ant/apache-ant-1.7.0/bin/ant -Dversion=2009-11-16_22-05-51 -Dtest.junit.output.format=xml -Dtest.output=yes -Dfindbugs.home=/homes/gkesavan/tools/findbugs/latest -Dforrest.home=/homes/gkesavan/tools/forrest/latest -Djava5.home=/homes/hudson/tools/java/jdk1.5.0_17-32 docs tar findbugs Buildfile: build.xml java5.check: forrest.check: ivy-download: [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar [get] To: http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/ivy/ivy-2.0.0-rc2.jar ivy-init-dirs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/lib [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/report [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/maven ivy-probe-antlib: ivy-init-antlib: ivy-init: [ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ :: :: loading settings :: file = http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/ivy/ivysettings.xml ivy-compile: [ivy:resolve] :: resolving dependencies :: org.apache.pig#Pig;2009-11-16_22-05-51 [ivy:resolve] confs: [compile] [ivy:resolve] found com.jcraft#jsch;0.1.38 in maven2 [ivy:resolve] found jline#jline;0.9.94 in maven2 [ivy:resolve] found net.java.dev.javacc#javacc;4.2 in maven2 [ivy:resolve] :: resolution report :: resolve 188ms :: artifacts dl 8ms - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | compile | 3 | 0 | 0 | 0 || 3 | 0 | - [ivy:retrieve] :: retrieving :: org.apache.pig#Pig [ivy:retrieve] confs: [compile] [ivy:retrieve] 3 artifacts copied, 0 already retrieved (546kB/11ms) No ivy:settings found for the default reference 'ivy.instance'. A default instance will be used DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead :: loading settings :: file =
Re: optimizer hints in Pig
In general I think optimizer hints fit well with Pig's approach to data processing, as expressed in our philosophic statement that Pigs are domestic animals (see http://hadoop.apache.org/pig/ philosophy.html ). At least in the examples you give, I don't see 'with' as binding. The user is giving Pig information; it can choose how to use it, or to not use it all. I would like 'using' to continue to be binding as in that case the user is explicitly telling Pig to do something in a particular way. Alan. On Nov 14, 2009, at 2:07 PM, Ashutosh Chauhan wrote: Hi All, We would like to know what Pig devs feel about optimizer hints. Traditionally, optimizer hints have been received with mixed reactions in RDBMS world. Oracle provides lots of knobs[1][2] to turn and tune, while postgres[3][4] have tried to stay away from them. Mysql have few of them (e.g., straight_join). Surajit Chaudhary [5] (Microsoft) is making case in favor of them. More specifically, I am talking of hints like following a = filter 'mydata' by myudf ($1) with selectivity 0.5; // This is letting user to tell Pig that myudf filters out nearly half of tuples of 'mydata'. c = join a by $0, b by $0 with selectivity a.$0 = b.$0, 0.1; // This is letting user to tell Pig that only 10% of keys in a will match with those in b. Exact syntax isn't important it could be adapted. But, question is does it seem to be a useful enough idea to be added in Pig Latin. Pig's case is slightly different from other sql engines in that while other systems treats them as hints and thus are free to ignore them Pig treats hints as commands in a sense that it will fail even if it can figure out that hint will result in failure of query. Perhaps, Pig can interpret using as command and with as hint. Thoughts? Ashutosh [1] http://www.dba-oracle.com/art_otn_cbo_p7.htm [2] http://www.dba-oracle.com/oracle11g/oracle_11g_extended_optimizer_statistics.htm [3] http://archives.postgresql.org/pgsql-hackers/2006-10/msg00663.php [4] http://archives.postgresql.org/pgsql-hackers/2006-08/msg00506.php [5] portal.acm.org/ft_gateway.cfm?id=1559955type=pdf
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778654#action_12778654 ] Pradeep Kamath commented on PIG-1062: - Review comments: In SampleLoader.java Isn't the idea of SampleLoader only to carry common code for RandomSampleLoader and PoissonLoader and add a computeSamples() method? - Looks like now it has the getNext() implementation needed by RandomSampleLoader in it now. Should we move that to RandomSampleLoader instead? {code} 134 System.err.println(Sample + samples[nextSampleIdx]); {code} Debug statement above should be removed. Why is skipNext() needed? Can't loader.getNext() == null be used instead? If so, is recordReader needed? In RandomSampleLoader.java == XXX FIXME comment (put in by me :))should be removed I think we should move the actual getNext() implementation code from SampleLoader to here In PoissonSampleLoader.java {code} 40 // this will be value of first column in the special row {code} I think this is no longer the case - should be removed. {code} 58 // memory per sample. divide this by avgTupleMemSize to get skipInterval 59 private long memPerSample=0; 60 {code} Should the above be called memToSkipPerSample? {code} 104 if(skipInterval == -1){ {code} It doesn't look like skipInterval is initialized to -1 Instead of keeping track of max. num of columns in the different rows and then appending the special marker string and num of rows at the end, would it be better to just have these as the first two fields of the last tuple emitted and then introduce a split-union combination to ensure that the foreach pipeline gets the regular tuples (excluding the special tuple)? load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface
[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778666#action_12778666 ] Arun C Murthy commented on PIG-1062: bq. It looks like ReduceContext has a getCounter() method. Am I missing a subtlety? The counters you get from a {Map|Reduce}Context are only specific to the specific task. One would have to jump through a whole set of hoops i.e. create new JobClient or equivalent in the new context object apis), query the JobTracker for rolled up counters and even then they aren't guaranteed to be completely accurate (until job completion), thus I wouldn't recommend that we rely upon them. load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface --- Key: PIG-1062 URL: https://issues.apache.org/jira/browse/PIG-1062 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-1062.patch, PIG-1062.patch.3 This is part of the effort to implement new load store interfaces as laid out in http://wiki.apache.org/pig/LoadStoreRedesignProposal . PigStorage and BinStorage are now working. SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to be changed to work with new LoadFunc interface. Fixing SampleLoader and RandomSampleLoader will get order-by queries working. PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Attachment: PIG_872.patch I have verified that the job.xml has mapred.cache.files set to the replicated files. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1095) [zebra] Schema support of anonymous fields in COLECTION fails
[zebra] Schema support of anonymous fields in COLECTION fails - Key: PIG-1095 URL: https://issues.apache.org/jira/browse/PIG-1095 Project: Pig Issue Type: Bug Reporter: Yan Zhou Assignee: Yan Zhou Priority: Minor The schema parser fails on schemas of COLLECTION columns like c:collection(int). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778732#action_12778732 ] Hadoop QA commented on PIG-872: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch against trunk revision 881008. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console This message is automatically generated. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.