[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-05-19 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710628#action_12710628
 ] 

Daniel Dai commented on PIG-774:


You get the point, Viraj. 

Actually we can have two different configurations:
1. LANG=UTF8, all data files, script files, parameter files are UTF8
2. LANG=GB2321, data files are UTF8; script files, parameter files are GB2321
However, for RH-EL default settings, LANG=POSIX, which does not work well for 
Chinese characters. 

So for simplicity, we can have everything UTF8 (case 1). This is the default 
setting for Ubuntu. 

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.3.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop 

[jira] Updated: (PIG-765) to implement jdiff

2009-05-19 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-765:
---

Hadoop Flags: [Reviewed]
  Status: Patch Available  (was: In Progress)

 to implement jdiff
 --

 Key: PIG-765
 URL: https://issues.apache.org/jira/browse/PIG-765
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-765.patch, pig-765.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-765) to implement jdiff

2009-05-19 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-765:
---

Status: In Progress  (was: Patch Available)

 to implement jdiff
 --

 Key: PIG-765
 URL: https://issues.apache.org/jira/browse/PIG-765
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-765.patch, pig-765.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-765) to implement jdiff

2009-05-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710724#action_12710724
 ] 

Hadoop QA commented on PIG-765:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407876/pig-765.patch
  against trunk revision 776106.

-1 @author.  The patch appears to contain 5 @author tags which the Pig 
community has agreed to not allow in code contributions.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/console

This message is automatically generated.

 to implement jdiff
 --

 Key: PIG-765
 URL: https://issues.apache.org/jira/browse/PIG-765
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-765.patch, pig-765.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Pig-Patch-minerva.apache.org #48

2009-05-19 Thread Apache Hudson Server
See 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/48/changes

Changes:

[sms] PIG-697: Proposed improvements to pig's optimizer

--
[...truncated 91447 lines...]
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: Received block 
blk_-6409077214663205636_1011 of size 6 from /127.0.0.1
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: Received block 
blk_-6409077214663205636_1011 of size 6 from /127.0.0.1
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: PacketResponder 1 
for block blk_-6409077214663205636_1011 terminating
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.DataNode: PacketResponder 2 
for block blk_-6409077214663205636_1011 terminating
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:44808 is added to 
blk_-6409077214663205636_1011 size 6
 [exec] [junit] 09/05/19 06:29:42 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50164 is added to 
blk_-6409077214663205636_1011 size 6
 [exec] [junit] 09/05/19 06:29:42 INFO 
executionengine.HExecutionEngine: Connecting to hadoop file system at: 
hdfs://localhost:44060
 [exec] [junit] 09/05/19 06:29:42 INFO 
executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: 
localhost:48467
 [exec] [junit] 09/05/19 06:29:42 INFO 
mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1
 [exec] [junit] 09/05/19 06:29:42 INFO 
mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* ask 
127.0.0.1:56490 to delete  blk_-8848395961846024087_1006 
blk_2864616809689055598_1004
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* ask 
127.0.0.1:46139 to delete  blk_-8848395961846024087_1006 
blk_986403042756521851_1005
 [exec] [junit] 09/05/19 06:29:43 INFO 
mapReduceLayer.JobControlCompiler: Setting up single store job
 [exec] [junit] 09/05/19 06:29:43 WARN mapred.JobClient: Use 
GenericOptionsParser for parsing the arguments. Applications should implement 
Tool for the same.
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/tmp/hadoop-hudson/mapred/system/job_200905190628_0002/job.jar. 
blk_-1187686921115561227_1012
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block 
blk_-1187686921115561227_1012 src: /127.0.0.1:34222 dest: /127.0.0.1:46139
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block 
blk_-1187686921115561227_1012 src: /127.0.0.1:50480 dest: /127.0.0.1:56490
 [exec] [junit] 09/05/19 06:29:43 INFO dfs.DataNode: Receiving block 
blk_-1187686921115561227_1012 src: /127.0.0.1:33721 dest: /127.0.0.1:50164
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block 
blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 0 
for block blk_-1187686921115561227_1012 terminating
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50164 is added to 
blk_-1187686921115561227_1012 size 1393210
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Deleting block 
blk_-8848395961846024087_1006 file 
dfs/data/data5/current/blk_-8848395961846024087
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block 
blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1
 [exec] [junit] 09/05/19 06:29:44 WARN dfs.DataNode: Unexpected error 
trying to delete block blk_2864616809689055598_1004. BlockInfo not found in 
volumeMap.
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 1 
for block blk_-1187686921115561227_1012 terminating
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: Received block 
blk_-1187686921115561227_1012 of size 1393210 from /127.0.0.1
 [exec] [junit] 09/05/19 06:29:44 WARN dfs.DataNode: 
java.io.IOException: Error in deleting blocks.
 [exec] [junit] at 
org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:1146)
 [exec] [junit] at 
org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:793)
 [exec] [junit] at 
org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:663)
 [exec] [junit] at 
org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
 [exec] [junit] at java.lang.Thread.run(Thread.java:619)
 [exec] [junit] 
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.DataNode: PacketResponder 2 
for block blk_-1187686921115561227_1012 terminating
 [exec] [junit] 09/05/19 06:29:44 INFO dfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:46139 is added to 

[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-19 Thread Yiping Han (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710818#action_12710818
 ] 

Yiping Han commented on PIG-807:


David, the syntax: B = foreach A generate SUM(m), is confusing for both 
developers and the parser.

I like the idea to remove the explicit GROUP ALL, but would rather to use a 
different key word for that. I.e., B = FOR A GENERATE SUM(m);

Adding a new keyword for this purpose would also works as the hint for parser 
to treat this as a direct hadoop iterator access.

 PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
 Hadoop values iterator)
 

 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently all bags resulting from a group or cogroup are materialized as bags 
 containing all of the contents. The issue with this is that if a particular 
 key has many corresponding values, all these values get stuffed in a bag 
 which may run out of memory and hence spill causing slow down in performance 
 and sometime memory exceptions. In many cases, the udfs which use these bags 
 coming out a group and cogroup only need to iterate over the bag in a 
 unidirectional read-once manner. This can be implemented by having the bag 
 implement its iterator by simply iterating over the underlying hadoop 
 iterator provided in the reduce. This kind of a bag is also needed in 
 http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
 this issue too. The other part of this issue is to have some way for the udfs 
 to communicate to Pig that any input bags that they need are read once bags 
 . This can be achieved by having an Interface - say UsesReadOnceBags  which 
 is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
 execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710862#action_12710862
 ] 

Viraj Bhat commented on PIG-656:


Another pig parse issue when a udf was defined within a package which had 
matches keywords in its path.

So something like :
define DISTANCE_SCORE mypackage.pig.udf.matches.LevensteinMatchUDF();

gives a parse error


ERROR 1000: Error during parsing. Encountered  matches matches  at 
line 11, column 42.
Was expecting:
 IDENTIFIER ...


It is possible to have keywords from pig within package names or even  udf - 
shouldn't pig not be robust to simple grammar disambiguation of  this sort ?

 Use of eval word in the package hierarchy of a UDF causes parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-656) Use of eval word in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat reopened PIG-656:



Documentation should be updated on the eval keyword and what it actually does 
otherwise the user can be lost trying to find out the error.

 Use of eval word in the package hierarchy of a UDF causes parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-19 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710901#action_12710901
 ] 

Mridul Muralidharan commented on PIG-807:
-

I think I am missing something here.


If I did not get it wrong, two (different ?) usecases seem to be mentioned here 
?


1) Avoid materializing bag's for a record when it can be streamed from the 
underlying data.
bag's currently created through (co)group output seems to fall inside this.
As in :
B = GROUP A by id;
C = FOREACH B generate SUM($1.field);

This does not reqiure the $1.field bag to be created explicitly - but through 
an iterator interface, just stream the values from underlying reducer output.

2) The group ALL based construct seem to be to directly stream an entire 
relation through udf's.
As a shorthand for 
A_tmp = GROUP A all;
B = FOREACH A_tmp GENERATE algUdf($1);



If I am right in splitting this, then :

First usecase has tremendous potential for improving performance - particularly 
to remove the annoying OOM's or spills which happen : but not sure how it 
interact with pig's current pipeline design... (if any).


Since there are alternatives (though more cryptic) to do it, I dont have any 
particular opinion about 2.

Regards,
Mridul

 PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
 Hadoop values iterator)
 

 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently all bags resulting from a group or cogroup are materialized as bags 
 containing all of the contents. The issue with this is that if a particular 
 key has many corresponding values, all these values get stuffed in a bag 
 which may run out of memory and hence spill causing slow down in performance 
 and sometime memory exceptions. In many cases, the udfs which use these bags 
 coming out a group and cogroup only need to iterate over the bag in a 
 unidirectional read-once manner. This can be implemented by having the bag 
 implement its iterator by simply iterating over the underlying hadoop 
 iterator provided in the reduce. This kind of a bag is also needed in 
 http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
 this issue too. The other part of this issue is to have some way for the udfs 
 to communicate to Pig that any input bags that they need are read once bags 
 . This can be achieved by having an Interface - say UsesReadOnceBags  which 
 is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
 execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A proposal for changing pig's memory management

2009-05-19 Thread Alan Gates
The claims in the paper I was interested in were not issues like non- 
blocking I/O etc.  The claim that is of interest to pig is that a  
memory allocation and garbage collection scheme that is beyond the  
control of the programmer is a bad fit for a large data processing  
system.  This is a fundamental design choice in Java, and fits it well  
for the vast majority of its uses.  But for systems like Pig there  
seems to be no choice but to work around Java's memory management.   
I'll clarify this point in the document.


I took a closer look at NIO.  My concern is that it does not give the  
level of control I want.  NIO allows you to force a buffer to disk and  
request a buffer to load, but you cannot force a page out of memory.   
It doesn't even guarantee that after you load a page it will really be  
loaded.  One of the biggest issues in pig right now is that we run out  
memory or get the garbage collector in a situation where it can't make  
sufficient progress.  Perhaps switching to large buffers instead of  
having many individual objects will address this.  But I'm concerned  
that if we cannot explicitly force data out of memory onto disk then  
we'll be back in the same boat of trusting the Java memory manager.


Alan.

On May 14, 2009, at 7:43 PM, Ted Dunning wrote:


That Telegraph dataflow paper is pretty long in the tooth.  Certainly
several of their claims have little force any more (lack of non- 
blocking
I/O, poor thread performance, no unmap, very expensive  
synchronization for
uncontested locks).  It is worth that they did all of their tests on  
the 1.3

JVM and things have come an enormous way since then.

Certainly, it is worth having opaque contains based on byte arrays,  
but

isn't that pretty much what the NIO byte buffers are there to provide?
Wouldn't a virtual tuple type that was nothing more than a byte  
buffer, type

and an offset do almost all of what is proposed here?

On Thu, May 14, 2009 at 5:33 PM, Alan Gates ga...@yahoo-inc.com  
wrote:



http://wiki.apache.org/pig/PigMemory

Alan.





Re: A proposal for changing pig's memory management

2009-05-19 Thread Ted Dunning
If you have a small number of long-lived large objects and a large number of
small ephemeral objects then the java collector should be in pig-heaven (as
it were).  The long-lived objects will take no time to collect and the
ephemeral objects won't be around to collect by the time the full GC
happens.

On Tue, May 19, 2009 at 3:44 PM, Alan Gates ga...@yahoo-inc.com wrote:

 Perhaps switching to large buffers instead of having many individual
 objects will address this.  But I'm concerned that if we cannot explicitly
 force data out of memory onto disk then we'll be back in the same boat of
 trusting the Java memory manager.


-- 
Ted Dunning, CTO
DeepDyve


[jira] Updated: (PIG-802) PERFORMANCE: not creating bags for ORDER BY

2009-05-19 Thread Rakesh Setty (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Setty updated PIG-802:
-

Attachment: (was: OrderByOptimization.patch)

 PERFORMANCE: not creating bags for ORDER BY
 ---

 Key: PIG-802
 URL: https://issues.apache.org/jira/browse/PIG-802
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: OrderByOptimization.patch


 Order by should be changed to not use POPackage to put all of the tuples in a 
 bag on the reduce side, as the bag is just immediately flattened. It can 
 instead work like join does for the last input in the join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-656:
---

Summary: Use of eval or any other keyword in the package hierarchy of a UDF 
causes parse exception  (was: Use of eval word in the package hierarchy of a 
UDF causes parse exception)

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-19 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: OptimizerPhase3_parrt1.patch

Part 1 of the Phase3 patch. It implements the requiredFields feature in all the 
relational operators. New unit tests have been added.

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-19 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-05-19 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
 OptimizerPhase2.patch, OptimizerPhase3_parrt1.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, 

[jira] Created: (PIG-812) COUNT(*) does not work

2009-05-19 Thread Viraj Bhat (JIRA)
COUNT(*) does not work 
---

 Key: PIG-812
 URL: https://issues.apache.org/jira/browse/PIG-812
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.2.0


Pig script to count the number of rows in a studenttab10k file which contains 
10k records.
{code}
studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float);

X2 = GROUP studenttab ALL;

describe X2;

Y2 = FOREACH X2 GENERATE COUNT(*);

explain Y2;

DUMP Y2;

{code}

returns the following error

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator 
for alias Y2
Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log


If you look at the log file:

Caused by: java.lang.ClassCastException
at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76)
at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-812) COUNT(*) does not work

2009-05-19 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-812:
---

Attachment: studenttab10k

Input file

 COUNT(*) does not work 
 ---

 Key: PIG-812
 URL: https://issues.apache.org/jira/browse/PIG-812
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.2.0

 Attachments: studenttab10k


 Pig script to count the number of rows in a studenttab10k file which contains 
 10k records.
 {code}
 studenttab = LOAD 'studenttab10k' AS (name:chararray, age:int,gpa:float);
 X2 = GROUP studenttab ALL;
 describe X2;
 Y2 = FOREACH X2 GENERATE COUNT(*);
 explain Y2;
 DUMP Y2;
 {code}
 returns the following error
 
 ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator 
 for alias Y2
 Details at logfile: /homes/viraj/pig-svn/trunk/pig_1242783700970.log
 
 If you look at the log file:
 
 Caused by: java.lang.ClassCastException
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:76)
 at org.apache.pig.builtin.COUNT$Initial.exec(COUNT.java:68)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:235)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:254)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:223)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:88)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A proposal for changing pig's memory management

2009-05-19 Thread Mridul Muralidharan


I am still not very convinced about the value about this implementation 
- particularly considering the advances made since 1.3 in memory 
allocators and garbage collection.


The side effect of this proposal is many, and sometimes non-obvious.
Like implicitly moving young generation data into older generation, 
causing much more memory pressure for gc, fragmentation of memory blocks 
causing quite a bit of memory pressure, replicating quite a bit of 
functionality with garbage collection, possibility of bugs with ref 
counting, etc.


If assumption that current working set of bag/tuple does not need to be 
spilled, and anything else can be, then this will pretty much 
deteriorate to current impl in worst case.





A much more simpler method to gain benefits would be to handle 
primitives as ... primitives and not through the java wrapper classes 
for them.
It should be possible to write schema aware tuples which make use of the 
primitives specified to take a fraction of memory required (4 bytes + 
null_check boolean for int + offset mapping instead of 24/32 bytes it 
currently is, etc).




Regards,
Mridul

Alan Gates wrote:

http://wiki.apache.org/pig/PigMemory

Alan.