[jira] Commented: (PIG-1064) Behvaiour of COGROUP with and without schema when using * operator

2009-11-16 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778430#action_12778430
 ] 

Daniel Dai commented on PIG-1064:
-

With this patch, group by * without schema does not work anymore. I think 
there could be some valid use case on that, eg, people may want to use this to 
do a count for each distinctive values using statement group by *; foreach 
generate group, COUNT(*);. It is much safe to allow group by * work, and 
only disallow cogroup by *.

 Behvaiour of COGROUP with and without schema when using * operator
 

 Key: PIG-1064
 URL: https://issues.apache.org/jira/browse/PIG-1064
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Pradeep Kamath
 Fix For: 0.6.0

 Attachments: PIG-1064-2.patch, PIG-1064-3.patch, PIG-1064-4.patch, 
 PIG-1064.patch


 I have 2 tab separated files, 1.txt and 2.txt
 $ cat 1.txt 
 
 1   2
 2   3
 
 $ cat 2.txt 
 1   2
 2   3
 I use COGROUP feature of Pig in the following way:
 $java -cp pig.jar:$HADOOP_HOME org.apache.pig.Main
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt' as (b0, b1);
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:46:04,150 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1012: Each COGroup input has to have the same number of inner plans
 Details at logfile: pig_1256845224752.log
 ==
 If I reverse, the order of the schema's
 {code}
 grunt A = load '1.txt' as (a0, a1);
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;  
 {code}
 2009-10-29 12:49:27,869 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1013: Grouping attributes can either be star (*) or a list of expressions, 
 but not both.
 Details at logfile: pig_1256845224752.log
 ==
 Now running without schema??
 {code}
 grunt A = load '1.txt';
 grunt B = load '2.txt';
 grunt C = cogroup A by *, B by *;
 grunt dump C; 
 {code}
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
 stored result in: file:/tmp/temp-319926700/tmp-1990275961
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records 
 written : 2
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written 
 : 154
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-10-29 12:55:37,202 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 ((1,2),{(1,2)},{(1,2)})
 ((2,3),{(2,3)},{(2,3)})
 ==
 Is this a bug or a feature?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1085) Pass JobConf and UDF specific configuration information to UDFs

2009-11-16 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1085:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.

 Pass JobConf and UDF specific configuration information to UDFs
 ---

 Key: PIG-1085
 URL: https://issues.apache.org/jira/browse/PIG-1085
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: udfconf-2.patch, udfconf.patch


 Users have long asked for a way to get the JobConf structure in their UDFs.  
 It would also be nice to have a way to pass properties between the front end 
 and back end so that UDFs can store state during parse time and use it at 
 runtime.
 This patch does part of what is proposed in PIG-602, but not all of it.  It 
 does not provide a way to give user specified configuration files to UDFs.  
 So I will mark 602 as depending on this bug, but it isn't a duplicate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Pig-trunk #621

2009-11-16 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/621/changes

Changes:

[gates] PIG-1085:  Pass JobConf and UDF specific configuration information to 
UDFs.

--
[...truncated 1266 lines...]
AUtutorial/scripts/script1-local.pig
AUtutorial/scripts/script2-hadoop.pig
AUtutorial/scripts/script2-local.pig
A tutorial/src
A tutorial/src/org
A tutorial/src/org/apache
A tutorial/src/org/apache/pig
A tutorial/src/org/apache/pig/tutorial
A tutorial/src/org/apache/pig/tutorial/TutorialUtil.java
A tutorial/src/org/apache/pig/tutorial/ScoreGenerator.java
A tutorial/src/org/apache/pig/tutorial/TutorialTest.java
A tutorial/src/org/apache/pig/tutorial/NonURLDetector.java
A tutorial/src/org/apache/pig/tutorial/ExtractHour.java
A tutorial/src/org/apache/pig/tutorial/NGramGenerator.java
A tutorial/src/org/apache/pig/tutorial/ToLower.java
A tutorial/data
AUtutorial/data/excite-small.log
AUtutorial/data/excite.log.bz2
AUtutorial/build.xml
A RELEASE_NOTES.txt
A ivy.xml
A lib
A lib/jdiff
A lib/jdiff/pig_0.3.1.xml
AUlib/hbase-0.20.0-test.jar
AUlib/hadoop20.jar
A lib/hadoop-LICENSE.txt
AUlib/hbase-0.20.0.jar
AUlib/zookeeper-hbase-1329.jar
AUlib/hadoop18.jar
A ivy
A ivy/ivysettings.xml
A ivy/libraries.properties
A ivy/pig.pom
A bin
AUbin/pig
A README.txt
A KEYS
 U.
Fetching 'http://svn.apache.org/repos/asf/hadoop/nightly/test-patch' at -1 into 
'http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/test/bin'
A test/bin/test-patch.sh
At revision 880993
At revision 880992
no change for http://svn.apache.org/repos/asf/hadoop/nightly/test-patch since 
the previous build
+ export JAVA_HOME=/homes/hudson/tools/java/latest1.6
+ JAVA_HOME=/homes/hudson/tools/java/latest1.6
+ export 
PATH=/homes/hudson/tools/java/latest1.6/bin:/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
+ 
PATH=/homes/hudson/tools/java/latest1.6/bin:/home/hudson/tools/java/latest1.6/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
++ pwd
+ TRUNK=http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk
+ export ANT_HOME=/homes/hudson/tools/ant/apache-ant-1.7.0
+ ANT_HOME=/homes/hudson/tools/ant/apache-ant-1.7.0
+ cd trunk
+ /homes/hudson/tools/ant/apache-ant-1.7.0/bin/ant 
-Dversion=2009-11-16_22-05-51 -Dtest.junit.output.format=xml -Dtest.output=yes 
-Dfindbugs.home=/homes/gkesavan/tools/findbugs/latest 
-Dforrest.home=/homes/gkesavan/tools/forrest/latest 
-Djava5.home=/homes/hudson/tools/java/jdk1.5.0_17-32 docs tar findbugs
Buildfile: build.xml

java5.check:

forrest.check:

ivy-download:
  [get] Getting: 
http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.0.0-rc2/ivy-2.0.0-rc2.jar
  [get] To: 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/ivy/ivy-2.0.0-rc2.jar

ivy-init-dirs:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/lib
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/report
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/build/ivy/maven

ivy-probe-antlib:

ivy-init-antlib:

ivy-init:
[ivy:configure] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ 
::
:: loading settings :: file = 
http://hudson.zones.apache.org/hudson/job/Pig-trunk/ws/trunk/ivy/ivysettings.xml

ivy-compile:
[ivy:resolve] :: resolving dependencies :: 
org.apache.pig#Pig;2009-11-16_22-05-51
[ivy:resolve]   confs: [compile]
[ivy:resolve]   found com.jcraft#jsch;0.1.38 in maven2
[ivy:resolve]   found jline#jline;0.9.94 in maven2
[ivy:resolve]   found net.java.dev.javacc#javacc;4.2 in maven2
[ivy:resolve] :: resolution report :: resolve 188ms :: artifacts dl 8ms
-
|  |modules||   artifacts   |
|   conf   | number| search|dwnlded|evicted|| number|dwnlded|
-
|  compile |   3   |   0   |   0   |   0   ||   3   |   0   |
-
[ivy:retrieve] :: retrieving :: org.apache.pig#Pig
[ivy:retrieve]  confs: [compile]
[ivy:retrieve]  3 artifacts copied, 0 already retrieved (546kB/11ms)
No ivy:settings found for the default reference 'ivy.instance'.  A default 
instance will be used
DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' instead
:: loading settings :: file = 

Re: optimizer hints in Pig

2009-11-16 Thread Alan Gates
In general I think optimizer hints fit well with Pig's approach to  
data processing, as expressed in our philosophic statement that Pigs  
are domestic animals (see http://hadoop.apache.org/pig/ 
philosophy.html ).


At least in the examples you give, I don't see 'with' as binding.  The  
user is giving Pig information; it can choose how to use it, or to not  
use it all.  I would like 'using' to continue to be binding as in that  
case the user is explicitly telling Pig to do something in a  
particular way.


Alan.

On Nov 14, 2009, at 2:07 PM, Ashutosh Chauhan wrote:


Hi All,

We would like to know what Pig devs feel about optimizer hints.
Traditionally, optimizer hints have been received with mixed reactions
in RDBMS world.  Oracle provides lots of knobs[1][2] to turn and tune,
while postgres[3][4] have tried to stay away from them. Mysql have few
of them (e.g., straight_join). Surajit Chaudhary [5] (Microsoft) is
making case in favor of them.
More specifically, I am talking of hints like following

a = filter 'mydata' by myudf ($1) with selectivity 0.5;
// This is letting user to tell Pig that  myudf filters out nearly
half of tuples of 'mydata'.

c = join a by $0, b by $0 with selectivity a.$0 = b.$0, 0.1;
// This is letting user to tell Pig that only 10% of keys in a will
match with those in b.

Exact syntax isn't important it could be adapted. But, question is
does it seem to be  a useful enough idea to be added in Pig Latin.
Pig's case is slightly different from other sql engines in that while
other systems treats them as hints and thus are free to ignore them
Pig treats hints as commands in a sense that it will fail even if it
can figure out that hint will result in failure of query. Perhaps, Pig
can interpret using as command and with as hint.

Thoughts?

Ashutosh

[1] http://www.dba-oracle.com/art_otn_cbo_p7.htm
[2] 
http://www.dba-oracle.com/oracle11g/oracle_11g_extended_optimizer_statistics.htm
[3] http://archives.postgresql.org/pgsql-hackers/2006-10/msg00663.php
[4] http://archives.postgresql.org/pgsql-hackers/2006-08/msg00506.php
[5] portal.acm.org/ft_gateway.cfm?id=1559955type=pdf




[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-16 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778654#action_12778654
 ] 

Pradeep Kamath commented on PIG-1062:
-

Review comments:
In SampleLoader.java

Isn't the idea of SampleLoader only to carry common code for RandomSampleLoader 
and PoissonLoader
and add a computeSamples() method? - Looks like now it has the getNext() 
implementation
needed by RandomSampleLoader in it now. Should we move that to 
RandomSampleLoader instead?


{code}
134 System.err.println(Sample  + samples[nextSampleIdx]);
{code}
Debug statement above should be removed.


Why is skipNext() needed? Can't loader.getNext() == null be used instead? If 
so, is recordReader
needed?

In RandomSampleLoader.java
==
XXX FIXME comment (put in by me :))should be removed

I think we should move the actual getNext() implementation code from 
SampleLoader to here

In PoissonSampleLoader.java


{code}
 40 // this will be value of first column in the special row   
{code}
I think this is no longer the case - should be removed.


{code}
58 // memory per sample. divide this by avgTupleMemSize to get 
skipInterval 
 59 private long memPerSample=0;
 60 
{code}
Should the above be called memToSkipPerSample?


{code}
 104 if(skipInterval == -1){
{code}
It doesn't look like skipInterval is initialized to -1


Instead of keeping track of max. num of columns in the different rows and then 
appending the
special marker string and num of rows at the end, would it be better to just 
have these as the
first two fields of the last tuple emitted and then introduce a split-union 
combination to 
ensure that the foreach pipeline gets the regular tuples (excluding the special 
tuple)?



 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Attachments: PIG-1062.patch, PIG-1062.patch.3


 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1062) load-store-redesign branch: change SampleLoader and subclasses to work with new LoadFunc interface

2009-11-16 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778666#action_12778666
 ] 

Arun C Murthy commented on PIG-1062:


bq. It looks like ReduceContext has a getCounter() method. Am I missing a 
subtlety?

The counters you get from a {Map|Reduce}Context are only specific to the 
specific task. One would have to jump through a whole set of hoops i.e. create 
new JobClient or equivalent in the new context object apis), query the 
JobTracker for rolled up counters and even then they aren't guaranteed to be 
completely accurate (until job completion), thus I wouldn't recommend that we 
rely upon them.

 load-store-redesign branch: change SampleLoader and subclasses to work with 
 new LoadFunc interface 
 ---

 Key: PIG-1062
 URL: https://issues.apache.org/jira/browse/PIG-1062
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Attachments: PIG-1062.patch, PIG-1062.patch.3


 This is part of the effort to implement new load store interfaces as laid out 
 in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
 PigStorage and BinStorage are now working.
 SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
 be changed to work with new LoadFunc interface.  
 Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
 PoissonSampleLoader is used by skew join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Attachment: PIG_872.patch

I have verified that the job.xml has mapred.cache.files set to the replicated 
files.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1095) [zebra] Schema support of anonymous fields in COLECTION fails

2009-11-16 Thread Yan Zhou (JIRA)
[zebra] Schema support of anonymous fields in COLECTION fails
-

 Key: PIG-1095
 URL: https://issues.apache.org/jira/browse/PIG-1095
 Project: Pig
  Issue Type: Bug
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Minor


The schema parser fails on schemas of COLLECTION columns like c:collection(int).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778732#action_12778732
 ] 

Hadoop QA commented on PIG-872:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch
  against trunk revision 881008.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console

This message is automatically generated.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.