[jira] Commented: (PIG-849) Local engine loses records in splits

2009-06-16 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720306#action_12720306
 ] 

Gunther Hagleitner commented on PIG-849:


Same errors as before. Ran manually and passed. The issue with the automated 
patch testing seems to be still there.

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch, local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-849) Local engine loses records in splits

2009-06-16 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719968#action_12719968
 ] 

Gunther Hagleitner commented on PIG-849:


the new patch has a unit test - otherwise it's the same

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch, local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-849) Local engine loses records in splits

2009-06-16 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-849:
---

Attachment: local_engine.patch

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch, local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-849) Local engine loses records in splits

2009-06-16 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-849:
---

Status: Patch Available  (was: Open)

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch, local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-849) Local engine loses records in splits

2009-06-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-849:
---

Status: Patch Available  (was: Open)

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-849) Local engine loses records in splits

2009-06-15 Thread Gunther Hagleitner (JIRA)
Local engine loses records in splits


 Key: PIG-849
 URL: https://issues.apache.org/jira/browse/PIG-849
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner
 Attachments: local_engine.patch

When there is a split in the physical plan records can be dropped in certain 
circumstances.

The local split operator puts all records in a databag and turns over iterators 
to the POSplitOutput operators. The problem is that the local split also adds 
STATUS_NULL records to the bag. That will cause the databag's iterator to 
prematurely return false on the hasNext call (so a STATUS_NULL becomes a 
STATUS_EOP in the split output operators).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-849) Local engine loses records in splits

2009-06-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-849:
---

Attachment: local_engine.patch

> Local engine loses records in splits
> 
>
> Key: PIG-849
> URL: https://issues.apache.org/jira/browse/PIG-849
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
> Attachments: local_engine.patch
>
>
> When there is a split in the physical plan records can be dropped in certain 
> circumstances.
> The local split operator puts all records in a databag and turns over 
> iterators to the POSplitOutput operators. The problem is that the local split 
> also adds STATUS_NULL records to the bag. That will cause the databag's 
> iterator to prematurely return false on the hasNext call (so a STATUS_NULL 
> becomes a STATUS_EOP in the split output operators).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags

2009-06-07 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-839:
---

Status: Patch Available  (was: Open)

> incorrect return codes on failure when using -f or -e flags
> ---
>
> Key: PIG-839
> URL: https://issues.apache.org/jira/browse/PIG-839
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Attachments: fix_return_code.patch
>
>
> To repro: pig -e "a = load '' ; b = stream a through \`false\` ; 
> store b into '';"
> Both the -e and -f flags do not return the right code upon exit. Running the 
> script w/o using -f works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags

2009-06-07 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-839:
---

Attachment: fix_return_code.patch

> incorrect return codes on failure when using -f or -e flags
> ---
>
> Key: PIG-839
> URL: https://issues.apache.org/jira/browse/PIG-839
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Attachments: fix_return_code.patch
>
>
> To repro: pig -e "a = load '' ; b = stream a through \`false\` ; 
> store b into '';"
> Both the -e and -f flags do not return the right code upon exit. Running the 
> script w/o using -f works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-839) incorrect return codes on failure when using -f or -e flags

2009-06-07 Thread Gunther Hagleitner (JIRA)
incorrect return codes on failure when using -f or -e flags
---

 Key: PIG-839
 URL: https://issues.apache.org/jira/browse/PIG-839
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner


To repro: pig -e "rmf keep99; a = load '' ; b = stream a through 
\`false\` ; store b into '';"

Both the -e and -f flags do not return the right code upon exit. Running the 
script w/o using -f works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-839) incorrect return codes on failure when using -f or -e flags

2009-06-07 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-839:
---

Description: 
To repro: pig -e "a = load '' ; b = stream a through \`false\` ; 
store b into '';"

Both the -e and -f flags do not return the right code upon exit. Running the 
script w/o using -f works fine.

  was:
To repro: pig -e "rmf keep99; a = load '' ; b = stream a through 
\`false\` ; store b into '';"

Both the -e and -f flags do not return the right code upon exit. Running the 
script w/o using -f works fine.


> incorrect return codes on failure when using -f or -e flags
> ---
>
> Key: PIG-839
> URL: https://issues.apache.org/jira/browse/PIG-839
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
>
> To repro: pig -e "a = load '' ; b = stream a through \`false\` ; 
> store b into '';"
> Both the -e and -f flags do not return the right code upon exit. Running the 
> script w/o using -f works fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-818) Explain doesn't handle PODemux properly

2009-05-26 Thread Gunther Hagleitner (JIRA)
Explain doesn't handle PODemux properly
---

 Key: PIG-818
 URL: https://issues.apache.org/jira/browse/PIG-818
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner
Assignee: Gunther Hagleitner
 Attachments: explain.patch

The PODemux operator has nested plans but they are not expanded in the -dot 
version of explain.

Also, both split and demux are displayed as clusters of nodes, but it really 
makes more sense to just show them as multi output operators.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-818) Explain doesn't handle PODemux properly

2009-05-26 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-818:
---

Status: Patch Available  (was: Open)

> Explain doesn't handle PODemux properly
> ---
>
> Key: PIG-818
> URL: https://issues.apache.org/jira/browse/PIG-818
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Attachments: explain.patch
>
>
> The PODemux operator has nested plans but they are not expanded in the -dot 
> version of explain.
> Also, both split and demux are displayed as clusters of nodes, but it really 
> makes more sense to just show them as multi output operators.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-818) Explain doesn't handle PODemux properly

2009-05-26 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-818:
---

Attachment: explain.patch

> Explain doesn't handle PODemux properly
> ---
>
> Key: PIG-818
> URL: https://issues.apache.org/jira/browse/PIG-818
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Gunther Hagleitner
> Attachments: explain.patch
>
>
> The PODemux operator has nested plans but they are not expanded in the -dot 
> version of explain.
> Also, both split and demux are displayed as clusters of nodes, but it really 
> makes more sense to just show them as multi output operators.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-811) Globs with "?" in the pattern are broken in local mode

2009-05-21 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-811:
---

Status: Patch Available  (was: Open)

> Globs with "?" in the pattern are broken in local mode
> --
>
> Key: PIG-811
> URL: https://issues.apache.org/jira/browse/PIG-811
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Gunther Hagleitner
> Fix For: 0.3.0
>
> Attachments: local_engine_glob.patch
>
>
> Script:
> a = load 'studenttab10?';
> dump a;
> Actual file name: studenttab10k
> Stack trace:
> ERROR 2081: Unable to setup the load function.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
> setup the load function.
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:128)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102)
> at 
> org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:163)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:763)
> at org.apache.pig.PigServer.execute(PigServer.java:756)
> at org.apache.pig.PigServer.access$100(PigServer.java:88)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:923)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:242)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:372)
> Caused by: java.io.IOException: 
> file:/home/y/share/pigtest/local/data/singlefile/studenttab10 does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188)
> at 
> org.apache.pig.impl.io.FileLocalizer.openLFSFile(FileLocalizer.java:244)
> at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:299)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:96)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:124)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-811) Globs with "?" in the pattern are broken in local mode

2009-05-21 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-811:
---

Attachment: local_engine_glob.patch

This patch should fix the problem. Globs are working again in local engine mode.


> Globs with "?" in the pattern are broken in local mode
> --
>
> Key: PIG-811
> URL: https://issues.apache.org/jira/browse/PIG-811
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>Assignee: Gunther Hagleitner
> Fix For: 0.3.0
>
> Attachments: local_engine_glob.patch
>
>
> Script:
> a = load 'studenttab10?';
> dump a;
> Actual file name: studenttab10k
> Stack trace:
> ERROR 2081: Unable to setup the load function.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
> setup the load function.
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:128)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:117)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129)
> at 
> org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102)
> at 
> org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:163)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:763)
> at org.apache.pig.PigServer.execute(PigServer.java:756)
> at org.apache.pig.PigServer.access$100(PigServer.java:88)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:923)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:242)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:372)
> Caused by: java.io.IOException: 
> file:/home/y/share/pigtest/local/data/singlefile/studenttab10 does not exist
> at 
> org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:188)
> at 
> org.apache.pig.impl.io.FileLocalizer.openLFSFile(FileLocalizer.java:244)
> at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:299)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.setUp(POLoad.java:96)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLoad.getNext(POLoad.java:124)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-05-14 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709603#action_12709603
 ] 

Gunther Hagleitner commented on PIG-777:


There is no new code. I just fixed an indentation issue in addition to the new 
log message.

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: log_message.patch
>
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-810) Scripts failing with NPE

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-810:
---

Attachment: null_pointer.patch

Ran into the same issue. I have a similar fix, but I also added a unit test, in 
case you're interested.

> Scripts failing with NPE
> 
>
> Key: PIG-810
> URL: https://issues.apache.org/jira/browse/PIG-810
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.3.0
>
> Attachments: null_pointer.patch, PIG-810.patch
>
>
> Scripts such as:
> {code}
> a = load 'nosuchfile';
> b = store a into 'bla';
> {code}
> are failing with
> {code}
> ERROR 2043: Unexpected error during execution.
> org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected 
> error during execution.
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:275)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:757)
> at org.apache.pig.PigServer.execute(PigServer.java:750)
> at org.apache.pig.PigServer.access$100(PigServer.java:88)
> at org.apache.pig.PigServer$Graph.execute(PigServer.java:917)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:242)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:110)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:151)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:123)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
> at org.apache.pig.Main.main(Main.java:372)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.pig.tools.pigstats.PigStats.accumulateMRStats(PigStats.java:175)
> at 
> org.apache.pig.tools.pigstats.PigStats.accumulateStats(PigStats.java:94)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:148)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:262)
> ... 10 more
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-777:
---

Status: Patch Available  (was: Open)

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: log_message.patch
>
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

Fixing the findbugs warning.

> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch, partial_failure.patch, 
> partial_failure.patch, partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

The latest patch is against the latest code base. It also includes the test 
with the "done" file. Finally, I was wrong about the log files. It's already 
the case that all the errors are logged into the same pig file.

> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch, partial_failure.patch, 
> partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Status: Patch Available  (was: Open)

> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch, partial_failure.patch, 
> partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-10 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

The new patch does same as above (report on the failed and succeeded jobs), but 
also:

   * Returns a list of exec jobs, one for each store, so that embedded programs 
can iterate through results and determine success and failures
   * Adds a flag "-F" or "-stop_on_failure" that causes an exception on the 
first failure which will cause the processing to stop.
   * Returns 2 when all jobs fail or when the stop_on_failure flag is 
specified. Returns 3 if some jobs passed and others failed.

> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch, partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705951#action_12705951
 ] 

Gunther Hagleitner edited comment on PIG-781 at 5/5/09 1:19 AM:


This fix associates stores with MR jobs. At the end of the execution it will 
print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: "/user/hagleitn/baz"
Successfully stored result in: "/user/hagleitn/bar"
Successfully stored result in: "/user/hagleitn/foo"
Some jobs have failed!
{noformat}


  was (Author: hagleitn):
This fix associates stores with MR jobs. At the end of the execution it 
will print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz"
Successfully stored result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar"
Successfully stored result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo"
Some jobs have failed!
{noformat}

  
> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-781) Error reporting for failed MR jobs

2009-05-05 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-781:
---

Attachment: partial_failure.patch

This fix associates stores with MR jobs. At the end of the execution it will 
print out which stores have passed and which ones have failed.

Example:

{noformat}
50% complete
100% complete
1 map reduce job(s) failed!
Failed to produce result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/baz"
Successfully stored result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/bar"
Successfully stored result in: 
"hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/hagleitn/foo"
Some jobs have failed!
{noformat}


> Error reporting for failed MR jobs
> --
>
> Key: PIG-781
> URL: https://issues.apache.org/jira/browse/PIG-781
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: partial_failure.patch
>
>
> If we have multiple MR jobs to run and some of them fail the behavior of the 
> system is to not stop on the first failure but to keep going. That way jobs 
> that do not depend on the failed job might still succeed.
> The question is to how best report this scenario to a user. How do we tell 
> which jobs failed and which didn't?
> One way could be to tie jobs to stores and report which store locations won't 
> have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-789) coupling load and store in script no longer works

2009-04-30 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-789:
---

Attachment: dump_bug.patch

Both dump (openIterator) and illustrate (getExamples) show this problem. 
dump_bug.patch contains a fix; The patch is for the trunk.

> coupling load and store in script no longer works
> -
>
> Key: PIG-789
> URL: https://issues.apache.org/jira/browse/PIG-789
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Alan Gates
>Assignee: Gunther Hagleitner
> Attachments: dump_bug.patch
>
>
> Many user's pig script do something like this:
> a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
> c = filter a by age > 500;
> e = group c by (name, age);
> f = foreach e generate group, COUNT($1);
> store f into 'bla';
> f1 = load 'bla';
> g = order f1 by $1;
> dump g;
> With the inclusion of the multi-query phase2 patch this appears to no longer 
> work.  You get an error:
> 2009-04-28 18:24:50,776 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 2100: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/gates/bla does not exist.
> We shouldn't be checking for bla's existence here because it will be created 
> eventually by the script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-28 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704011#action_12704011
 ] 

Gunther Hagleitner commented on PIG-759:


Looking at the code, it seems you can already specify columns in the load 
statment:

{noformat}
table = load 'hbase://foo' using 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('bar:c bar:d') as (c:int, 
d:int);
{noformat}

Is the suggestion to change the syntax of that? Or did I misunderstand the code?

> HBaseStorage scheme for Load/Slice function
> ---
>
> Key: PIG-759
> URL: https://issues.apache.org/jira/browse/PIG-759
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>
> We would like to change the HBaseStorage function to use a scheme when 
> loading a table in pig. The scheme we are thinking of is: "hbase". So in 
> order to load an hbase table in a pig script the statement should read:
> {noformat}
> table = load 'hbase://' using HBaseStorage();
> {noformat}
> If the scheme is omitted pig would assume the tablename to be an hdfs path 
> and the storage function would use the last component of the path as a table 
> name and output a warning.
> For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-04-28 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-777:
---

Attachment: log_message.patch

log_message.patch adds a the message "Removing unnecessary load operation ..." 
when we remove the load from the logical plan.

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
> Attachments: log_message.patch
>
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-04-28 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704002#action_12704002
 ] 

Gunther Hagleitner commented on PIG-777:


David,

Per PIG-627 the first example you gave will result in a single map reduce job 
that is going to process both store operations. No duplication of steps A thru 
D. So, yes you shouldn't need to introduce "D = load". Also PIG-627 introduced 
an optimization that will throw the "D = load" out - basically transforming 
your second example into the first.

This bug is mostly about the way the optimization is written. Some code should 
be moved around to align it with the optimization framework.

Adding a log message when this happens is a good idea though. Let me add that. 

> Code refactoring: Create optimization out of store/load post processing code
> 
>
> Key: PIG-777
> URL: https://issues.apache.org/jira/browse/PIG-777
> Project: Pig
>  Issue Type: Improvement
>Reporter: Gunther Hagleitner
>
> The postProcessing method in the pig server checks whether a logical graph 
> contains stores to and loads from the same location. If so, it will either 
> connect the store and load, or optimize by throwing out the load and 
> connecting the store predecessor with the successor of the load.
> Ideally the introduction of the store and load connection should happen in 
> the query compiler, while the optimization should then happen in an separate 
> optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-652) Need to give user control of OutputFormat

2009-04-26 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-652:
---

Attachment: PIG-652-v5.patch

v5 patch includes the stuff for multiquery.

> Need to give user control of OutputFormat
> -
>
> Key: PIG-652
> URL: https://issues.apache.org/jira/browse/PIG-652
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Pradeep Kamath
> Attachments: PIG-652-v2.patch, PIG-652-v3.patch, PIG-652-v4.patch, 
> PIG-652-v5.patch, PIG-652.patch
>
>
> Pig currently allows users some control over InputFormat via the Slicer and 
> Slice interfaces.  It does not allow any control over OutputFormat and 
> RecordWriter interfaces.  It just allows the user to implement a storage 
> function that controls how the data is serialized.  For hadoop tables, we 
> will need to allow custom OutputFormats that prepare output information and 
> objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-781) Error reporting for failed MR jobs

2009-04-24 Thread Gunther Hagleitner (JIRA)
Error reporting for failed MR jobs
--

 Key: PIG-781
 URL: https://issues.apache.org/jira/browse/PIG-781
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner


If we have multiple MR jobs to run and some of them fail the behavior of the 
system is to not stop on the first failure but to keep going. That way jobs 
that do not depend on the failed job might still succeed.

The question is to how best report this scenario to a user. How do we tell 
which jobs failed and which didn't?

One way could be to tie jobs to stores and report which store locations won't 
have data and which ones do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-780) Code refactoring: PlanPrinters

2009-04-23 Thread Gunther Hagleitner (JIRA)
Code refactoring: PlanPrinters
--

 Key: PIG-780
 URL: https://issues.apache.org/jira/browse/PIG-780
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
Priority: Minor


There seems to be quite a bit of duplicated code/functionality with all the 
PlanPrinters in the system. It would make things easier, if that was 
consolidated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-779) Warning from javacc

2009-04-23 Thread Gunther Hagleitner (JIRA)
Warning from javacc
---

 Key: PIG-779
 URL: https://issues.apache.org/jira/browse/PIG-779
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner


This warning needs fixing:

 Reading from file 
.../src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj . . .
   [javacc] Warning: Choice conflict in (...)* construct at line 560, column 9.
   [javacc]  Expansion nested within construct and expansion following 
construct
   [javacc]  have common prefixes, one of which is: "-param"
   [javacc]  Consider using a lookahead of 2 or more for nested 
expansion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-778) ReversibleLoadStore Semantics

2009-04-23 Thread Gunther Hagleitner (JIRA)
ReversibleLoadStore Semantics
-

 Key: PIG-778
 URL: https://issues.apache.org/jira/browse/PIG-778
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner


The question about how to use the ReversibleLoadStore function came up in 2 
scenarios recently:

a) Can we generate a load operator from a store by simply taking the same store 
function string, if the store function is a ReversibleLoadStore function? I 
would like to use that to remove unnecessary compiler generated stores, if we 
can change the depending load operators to load from a different store. 

b) Is it sufficient to check whether a pair of store and load operations on the 
same location is reversible to know whether we can eliminate it without 
changing the data? This is done in the pig server for logical plans right now.

If I go by PigStorage then, the answer to (a) is yes. The answer to (b) is no, 
but we also need to check that both load and store use the same parameter to 
the reversible function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-04-23 Thread Gunther Hagleitner (JIRA)
Code refactoring: Create optimization out of store/load post processing code


 Key: PIG-777
 URL: https://issues.apache.org/jira/browse/PIG-777
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner


The postProcessing method in the pig server checks whether a logical graph 
contains stores to and loads from the same location. If so, it will either 
connect the store and load, or optimize by throwing out the load and connecting 
the store predecessor with the successor of the load.

Ideally the introduction of the store and load connection should happen in the 
query compiler, while the optimization should then happen in an separate 
optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-776) Code refactoring: Move "moveResults" code from JobControlCompiler to MapReduceLauncher

2009-04-23 Thread Gunther Hagleitner (JIRA)
Code refactoring: Move "moveResults" code from JobControlCompiler to 
MapReduceLauncher
--

 Key: PIG-776
 URL: https://issues.apache.org/jira/browse/PIG-776
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner
Priority: Minor


It makes more sense for the moveResults code to live in the launcher rather 
than the compiler.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-16 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: error_handling_0416.patch

Fixed some issues with the error handling patch (0415):

   * Duplicated error code 2129
   * Unclear string "splitter"
   * Added native exception message to error msg in store operator.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: doc-fix.patch, error_handling_0415.patch, 
> error_handling_0416.patch, file_cmds-0305.patch, fix_store_prob.patch, 
> merge-041409.patch, merge_741727_HEAD__0324.patch, 
> merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: error_handling_0415.patch

This patch contains:

   * Error codes/msg
   * Javadoc changes
   * fix the merge error in parser ("aliases" cmd)
   * updated golden files

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: doc-fix.patch, error_handling_0415.patch, 
> file_cmds-0305.patch, fix_store_prob.patch, merge-041409.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
> multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
> multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
> multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-15 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: doc-fix.patch

javadoc changes only. doc-fix.patch contains "fixes" to silence javadoc 
warnings.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: doc-fix.patch, file_cmds-0305.patch, 
> fix_store_prob.patch, merge-041409.patch, merge_741727_HEAD__0324.patch, 
> merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge-041409.patch

merge-041409.patch contains the latest merge from trunk to branch.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge-041409.patch, merge_741727_HEAD__0324.patch, 
> merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-13 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: streaming-fix.patch

Some fixes in the patch "streaming-fix.patch":

   * The split operator wasn't always playing nicely with the way we run the 
pipeline one extra time in the mapper's or reducer's close function if there's 
a stream operator present
   * Moved the MR optimizer that sets "stream in map" and "stream in reduce" to 
the end of the queue.
   * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That 
can result in the NoopFilterRemoval stage failing, because it's looking in the 
wrong plan.
   * Setting the jobname by default to the scriptname came in through the last 
merge, but didn't work anymore

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
> multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
> multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
> multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-08 Thread Gunther Hagleitner (JIRA)
HBaseStorage scheme for Load/Slice function
---

 Key: PIG-759
 URL: https://issues.apache.org/jira/browse/PIG-759
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


We would like to change the HBaseStorage function to use a scheme when loading 
a table in pig. The scheme we are thinking of is: "hbase". So in order to load 
an hbase table in a pig script the statement should read:

{noformat}
table = load 'hbase://' using HBaseStorage();
{noformat}

If the scheme is omitted pig would assume the tablename to be an hdfs path and 
the storage function would use the last component of the path as a table name 
and output a warning.

For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-757) Using schemes in load and store paths

2009-04-08 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner resolved PIG-757.


Resolution: Duplicate

> Using schemes in load and store paths
> -
>
> Key: PIG-757
> URL: https://issues.apache.org/jira/browse/PIG-757
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>
> As part of the multiquery optimization work there's a need to use absolute 
> paths for load and store operations (because the current directory changes 
> during the execution of the script). In order to do so, the suggestion is to 
> change the semantics of the location/filename string used in LoadFunc and 
> Slicer/Slice.
> The proposed change is:
>* Load locations without a scheme part are expected to be hdfs (mapreduce 
> mode) or local (local mode) paths
>* Any hdfs or local path will be translated to a fully qualified absolute 
> path before it is handed to either a LoadFunc or Slicer
>* Any scheme other than file or hdfs will result in the load path be 
> passed through to the LoadFunc or Slicer without any modification.
> Example:
> If you have a LoadFunc that reads from a database, right now the following 
> could be used:
> {{{
> a = load 'table' using DBLoader();
> }}}
> With the proposed changes table would be translated into an hdfs path though 
> ("hdfs:///table"). Probably not what the loader wants to see. So in order 
> to make this work one would use:
> {{{
> a = load 'sql://table' using DBLoader();
> }}}
> Now the DBLoader would see the unchanged string "sql://table". And pig will 
> not use the string as an hdfs location.
> This is an incompatible change but it's hopefully few existing 
> Slicers/Loaders that are affected. This behavior is part of the multiquery 
> work and can be turned off (reverted back) by using the "no_multiquery" flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-758:
---

Description: 
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than "file" or "hdfs" will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{noformat}
a = load 'table' using DBLoader();
{noformat}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{noformat}
a = load 'sql://table' using DBLoader();
{noformat}

Now the DBLoader would see the unchanged string "sql://table".

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the "no_multiquery" pig flag.

  was:
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than "file" or "hdfs" will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{code}
a = load 'table' using DBLoader();
{code}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{code}
a = load 'sql://table' using DBLoader();
{code}

Now the DBLoader would see the unchanged string "sql://table".

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the "no_multiquery" pig flag.


> Converting load/store locations into fully qualified absolute paths
> ---
>
> Key: PIG-758
> URL: https://issues.apache.org/jira/browse/PIG-758
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>
> As part of the multiquery optimization work there is a need to use absolute 
> paths for load and store operations (because the current directory changes 
> during the execution of the script). In order to do so, we are suggesting a 
> change to the semantics of the location/filename string used in LoadFunc and 
> Slicer/Slice.
> The proposed change is:
>* Load locations without a scheme part are expected to be hdfs (mapreduce 
> mode) or local (local mode) paths
>* Any hdfs or local path will be translated to a fully qualified absolute 
> path before it is handed to either a LoadFunc or Slicer
>* Any scheme other than "file" or "hdfs" will result in the load path to 
> be passed through to the LoadFunc or Slicer without any modification.
> Example:
> If you have a LoadFunc that reads from a database, in the current system the 
> following could be used:
> {noformat}
> a = load 'table' using DBLoader();
> {noformat}
> With the proposed changes table would be translated into an hdfs path though 
> ("hdfs:///table"). Probably not what the DBLoader would want to see. In 
> order to make it work one could use:
> {noformat}
> a = load 'sql://table' using DBLoader();
> {noformat}
> Now the DBLoader would see the unchanged string "sql://table".
> This is an incompatible change, but hopefully not affecting many existing 
> Loaders/Slicers. Since this is needed with the multiquery feature, the 
> behavior can be reverted back by using the "no_multiquery" pig flag.

-- 
This message is automatically generated by J

[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-758:
---

Description: 
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than "file" or "hdfs" will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{code}
a = load 'table' using DBLoader();
{code}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{code}
a = load 'sql://table' using DBLoader();
{code}

Now the DBLoader would see the unchanged string "sql://table".

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the "no_multiquery" pig flag.

  was:
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than "file" or "hdfs" will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string "sql://table".

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the "no_multiquery" pig flag.


> Converting load/store locations into fully qualified absolute paths
> ---
>
> Key: PIG-758
> URL: https://issues.apache.org/jira/browse/PIG-758
> Project: Pig
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>
> As part of the multiquery optimization work there is a need to use absolute 
> paths for load and store operations (because the current directory changes 
> during the execution of the script). In order to do so, we are suggesting a 
> change to the semantics of the location/filename string used in LoadFunc and 
> Slicer/Slice.
> The proposed change is:
>* Load locations without a scheme part are expected to be hdfs (mapreduce 
> mode) or local (local mode) paths
>* Any hdfs or local path will be translated to a fully qualified absolute 
> path before it is handed to either a LoadFunc or Slicer
>* Any scheme other than "file" or "hdfs" will result in the load path to 
> be passed through to the LoadFunc or Slicer without any modification.
> Example:
> If you have a LoadFunc that reads from a database, in the current system the 
> following could be used:
> {code}
> a = load 'table' using DBLoader();
> {code}
> With the proposed changes table would be translated into an hdfs path though 
> ("hdfs:///table"). Probably not what the DBLoader would want to see. In 
> order to make it work one could use:
> {code}
> a = load 'sql://table' using DBLoader();
> {code}
> Now the DBLoader would see the unchanged string "sql://table".
> This is an incompatible change, but hopefully not affecting many existing 
> Loaders/Slicers. Since this is needed with the multiquery feature, the 
> behavior can be reverted back by using the "no_multiquery" pig flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a 

[jira] Created: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)
Converting load/store locations into fully qualified absolute paths
---

 Key: PIG-758
 URL: https://issues.apache.org/jira/browse/PIG-758
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than "file" or "hdfs" will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string "sql://table".

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the "no_multiquery" pig flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-757) Using schemes in load and store paths

2009-04-08 Thread Gunther Hagleitner (JIRA)
Using schemes in load and store paths
-

 Key: PIG-757
 URL: https://issues.apache.org/jira/browse/PIG-757
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


As part of the multiquery optimization work there's a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, the suggestion is to 
change the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path be passed 
through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, right now the following 
could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
("hdfs:///table"). Probably not what the loader wants to see. So in order 
to make this work one would use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string "sql://table". And pig will not 
use the string as an hdfs location.

This is an incompatible change but it's hopefully few existing Slicers/Loaders 
that are affected. This behavior is part of the multiquery work and can be 
turned off (reverted back) by using the "no_multiquery" flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-07 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge_trunk_to_branch.patch

Merge latest trunk changes to branch

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
> multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
> multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
> multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-04 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: non_reversible_store_load_dependencies_2.patch

Same as above plus:

   * Fix for explain when a script has execution points inside. 

Like:

{{{
a = load ...
...
store a
exec;
b = load ...
...
}}}

This will run explain once for each execution block.


> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> non_reversible_store_load_dependencies.patch, 
> non_reversible_store_load_dependencies_2.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-02 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: non_reversible_store_load_dependencies.patch

This patch takes care of two things:

   * Cases where in a script you have a store followed by load where the 
Load/StoreFunc is either not reversible or they are different functions.
   * PlanSetter for physical plans in the JobControlCompiler (right now only 
the outermost plan's elements are set)


> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> non_reversible_store_load_dependencies.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-01 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: noop_filter_absolute_path_flag_0401.patch

This one is the same as before, but:

   * Added some comments
   * Reversed the multiquery flag (on by default)
   * HBase stuff works without the "hbase://" but will print warning
   * Fixed problem in NoopStoreRemover

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> noop_filter_absolute_path_flag.patch, 
> noop_filter_absolute_path_flag_0401.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-30 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: noop_filter_absolute_path_flag.patch

This patch contains three items:

- Removes the noop stores as described above
- Makes load and store paths absolute and canonical
- Introduces a flag that turns multiquery on and off (default is off)

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch, 
> noop_filter_absolute_path_flag.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-25 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: fix_store_prob.patch

This patch addresses an issue with the way we deal with scripts that do:
{{{
...
store a into 'foo';
a = load 'foo';
...
}}}

In the logical plan this will end up as a split with one branch storing into 
'foo' and the other continuing the processing after the load. The actual load 
is removed.

This works well but has an unfortunate side effect. If the store/load mark the 
boundary between two map-reduce jobs the MRCompiler has to insert a tmp 
store-load bridge - which means that we now end up with two stores.

This fix detects this case in the optimizing phase after the compilation. It 
removes the unnecessary store and loads from the other one.


> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
> merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-726) Stop printing scope as part of Operator.toString()

2009-03-25 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689268#action_12689268
 ] 

Gunther Hagleitner commented on PIG-726:


I've made a simpler change that had similar effects in the multiquery branch. I 
basically set the scope to an integer (the scope is not really used right now 
as I understand it. It's a leftover from times when pig was designed as a 
standalone server). That way each operator will say: ForEach 1-4 (or 2-4 
depending on how many instances of the pig server you have in your jvm.)

The alternative is to change all the logical operators name() function. They 
look something like: return  + mKey.scope + "-" + mKey.id; For 
physical operators we could get away with the proposed change to key.toString() 
function.

That seems more painful.


> Stop printing scope as part of Operator.toString()
> --
>
> Key: PIG-726
> URL: https://issues.apache.org/jira/browse/PIG-726
> Project: Pig
>  Issue Type: Improvement
>Reporter: Thejas M Nair
>
> When an operator is printed in pig, it prints a string with the user name and 
> date at which the grunt shell was started. This information is not useful and 
> makes the output very verbose.
> For example, a line in explain is like -
> ForEach tejas-Thu Mar 19 11:25:23 PDT 2009-4 Schema: {themap: map[ ]} Type: 
> bag
> I am proposing that it should change to -
> ForEach (id:4) Schema: {themap: map[ ]} Type: bag
> That string comes from scope in OperatorKey class. We don't use make use of 
> it anywhere, so we should stop printing it. The change is only in 
> OperatorKey.toString();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-24 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge_741727_HEAD__0324_2.patch

Seems like the last merge patch didn't correctly contain the entire new 
TestFinish.java file. Well, this one does.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, 
> merge_741727_HEAD__0324_2.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-24 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge_741727_HEAD__0324.patch

Merge of trunk (741727:HEAD) into multiquery branch. Aka merge from hell :-)

I ran all unit tests, the multiquery tests and the nightly tests and everything 
looks fine (no errors).



> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, merge_741727_HEAD__0324.patch, 
> multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
> multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
> multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-19 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_explain_fix.patch

Fixes three issues with explain:

a) Ceci n'est pas un bug. Splits in interactive mode still need this branch.
b) explain needs to discard batch iff it was loading a script
c) Split is now a nested operator (and explain needs to know)

This patch doesn't have any overlapped files with Richards last patch.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Olga Natkovich
> Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery-phase2_0313.patch, multiquery_0223.patch, 
> multiquery_0224.patch, multiquery_0306.patch, multiquery_explain_fix.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-05 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679500#action_12679500
 ] 

Gunther Hagleitner commented on PIG-627:


Oh. I also took out the restriction of the openIterator in batch mode. That was 
no longer needed.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-05 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: file_cmds-0305.patch

This patch is for the multi query branch again. It mostly fixes the problem 
with certain commands in the script that require immediate execution (in batch 
mode).

So if you do stuff like:

...
store a into 'tmp_foo';
...
rm tmp_foo
...

The rm will trigger execution and the file will be there for you to delete, 
copyToLocal, move, etc. You can also use the "exec" statement without params in 
a script now, to force execution of what we've seen so far.

This patch also contains a minor fix with the computation of progress in MR 
jobs (which I screwed up in the last patch).



> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: file_cmds-0305.patch, multi-store-0303.patch, 
> multi-store-0304.patch, multiquery_0223.patch, multiquery_0224.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-04 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multi-store-0304.patch

Same as the other one except: 

- Documented the createStoreFunction method some more.
- Removed unnecessary fields in the path parsing
- Moved tear down of stores below extra streaming run (in PigMapBase's, 
PigMapReduce's close function)

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: multi-store-0303.patch, multi-store-0304.patch, 
> multiquery_0223.patch, multiquery_0224.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-03-03 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multi-store-0303.patch

This patch introduces the functionality to support multiple stores in a single 
MR job. It's for the multiquery branch and it is needed to unblock concurrent 
dev on the split operator.

There aren't enough unit tests in this patch yet. They will be provided once 
the split operator can use multi stores (right now, nothing actually uses these 
stores, so testing is difficult). In order to test the patch, I had temporarily 
turned multi store on for all queries (even if they only have one store) and 
then ran all the unit tests. All tests passed.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: multi-store-0303.patch, multiquery_0223.patch, 
> multiquery_0224.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-24 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_0224.patch

This patch includes the multiquery unit test cases.

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: multiquery_0223.patch, multiquery_0224.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-02-23 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: multiquery_0223.patch

This is for the multiquery branch. It's phase 1. It contains a lot of 
infrastructural work to be able to look at entire scripts during evaluation 
(batch mode). It will look at a script plan and insert splits whenever there is 
a shared sequence of operations. The split execution is still the same as it 
was before (load-store bridge).

> PERFORMANCE: multi-query optimization
> -
>
> Key: PIG-627
> URL: https://issues.apache.org/jira/browse/PIG-627
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: types_branch
>Reporter: Olga Natkovich
> Fix For: types_branch
>
> Attachments: multiquery_0223.patch
>
>
> Currently, if your Pig script contains multiple stores and some shared 
> computation, Pig will execute several independent queries. For instance:
> A = load 'data' as (a, b, c);
> B = filter A by a > 5;
> store B into 'output1';
> C = group B by b;
> store C into 'output2';
> This script will result in map-only job that generated output1 followed by a 
> map-reduce job that generated output2. As the resuld data is read, parsed and 
> filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-574) run command for grunt

2009-02-11 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-574:
---

Attachment: run_command_params_021109.patch

Good point. I felt it was a little strange to specify "-param" on the grunt 
shell, but it is easier to remember if your using it outside the shell already.

So, this patch does the same as the last one, but the syntax is:

run myscript.pig -param LIMIT=5 -param FILE=/foo/bar.txt -param_file 
myparams.ppf

> run command for grunt
> -
>
> Key: PIG-574
> URL: https://issues.apache.org/jira/browse/PIG-574
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: David Ciemiewicz
>Priority: Minor
> Attachments: run_command.patch, run_command_params.patch, 
> run_command_params_021109.patch
>
>
> This is a request for a "run file" command in grunt which will read a script 
> from the local file system and execute the script interactively while in the 
> grunt shell.
> One of the things that slows down iterative development of large, complicated 
> Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
> cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
> cluster for each iteration.  I would prefer not to preallocate a cluster of 
> nodes (though I could).
> Instead, I'd like to have one window open and edit my Pig script using vim or 
> emacs, write it, and then type "run myscript.pig" at the grunt shell until I 
> get things right.
> I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-574) run command for grunt

2009-02-11 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672570#action_12672570
 ] 

Gunther Hagleitner commented on PIG-574:


Oh, I also ran the unit tests. They pass.

> run command for grunt
> -
>
> Key: PIG-574
> URL: https://issues.apache.org/jira/browse/PIG-574
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: David Ciemiewicz
>Priority: Minor
> Attachments: run_command.patch, run_command_params.patch
>
>
> This is a request for a "run file" command in grunt which will read a script 
> from the local file system and execute the script interactively while in the 
> grunt shell.
> One of the things that slows down iterative development of large, complicated 
> Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
> cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
> cluster for each iteration.  I would prefer not to preallocate a cluster of 
> nodes (though I could).
> Instead, I'd like to have one window open and edit my Pig script using vim or 
> emacs, write it, and then type "run myscript.pig" at the grunt shell until I 
> get things right.
> I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-574) run command for grunt

2009-02-11 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-574:
---

Attachment: run_command_params.patch

Thanks for reviewing the patch!

I tried to address the 3 issues you pointed out:

1) You can now specify parameters and param files in both the exec and run 
command

grunt> run myscript.pig using param_file myparams.ppf
or:
grunt> run myscript.pig using param LIMIT=5 param_file myparams.ppf

The syntax mimics what you can do on the command line when executing a script 
without the "-"s.

2) The script lines are now added to the command history in interactive mode

3) The double grunt... That's actually harder to fix than it thought, but I 
added a newline, so it won't say:

grunt> grunt>

but:

grunt>
grunt>

Let's just tell everyone that that's because they have extra newlines in their 
scripts. Maybe they won't find out. ;-)

> run command for grunt
> -
>
> Key: PIG-574
> URL: https://issues.apache.org/jira/browse/PIG-574
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: David Ciemiewicz
>Priority: Minor
> Attachments: run_command.patch, run_command_params.patch
>
>
> This is a request for a "run file" command in grunt which will read a script 
> from the local file system and execute the script interactively while in the 
> grunt shell.
> One of the things that slows down iterative development of large, complicated 
> Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
> cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
> cluster for each iteration.  I would prefer not to preallocate a cluster of 
> nodes (though I could).
> Instead, I'd like to have one window open and edit my Pig script using vim or 
> emacs, write it, and then type "run myscript.pig" at the grunt shell until I 
> get things right.
> I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-574) run command for grunt

2009-02-10 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-574:
---

Status: Patch Available  (was: Open)

> run command for grunt
> -
>
> Key: PIG-574
> URL: https://issues.apache.org/jira/browse/PIG-574
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: David Ciemiewicz
>Priority: Minor
> Attachments: run_command.patch
>
>
> This is a request for a "run file" command in grunt which will read a script 
> from the local file system and execute the script interactively while in the 
> grunt shell.
> One of the things that slows down iterative development of large, complicated 
> Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
> cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
> cluster for each iteration.  I would prefer not to preallocate a cluster of 
> nodes (though I could).
> Instead, I'd like to have one window open and edit my Pig script using vim or 
> emacs, write it, and then type "run myscript.pig" at the grunt shell until I 
> get things right.
> I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-574) run command for grunt

2009-02-10 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-574:
---

Attachment: run_command.patch

Introduces run and exec command

> run command for grunt
> -
>
> Key: PIG-574
> URL: https://issues.apache.org/jira/browse/PIG-574
> Project: Pig
>  Issue Type: New Feature
>  Components: grunt
>Reporter: David Ciemiewicz
>Priority: Minor
> Attachments: run_command.patch
>
>
> This is a request for a "run file" command in grunt which will read a script 
> from the local file system and execute the script interactively while in the 
> grunt shell.
> One of the things that slows down iterative development of large, complicated 
> Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
> cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
> cluster for each iteration.  I would prefer not to preallocate a cluster of 
> nodes (though I could).
> Instead, I'd like to have one window open and edit my Pig script using vim or 
> emacs, write it, and then type "run myscript.pig" at the grunt shell until I 
> get things right.
> I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.