[jira] Commented: (PIG-729) Use of default parallelism

2009-03-31 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694144#action_12694144
 ] 

Milind Bhandarkar commented on PIG-729:
---

+1 for option 3. Make parallel keyword mandatory on all statements that require 
it.

To elaborate:

Option 1. There can be no default that satisfies the majority.
Option 2. Unless it is an error that terminates execution, messages are usually 
ignored.
Option 3. Making parallel keyword mandatory increases awareness of its relation 
with number of reducers and number of part files.

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-22 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-656:
--

Attachment: reserved.patch

This patch allows the use of reserved words in function names. To avoid parsing 
ambiguity, the first part of the fully qualified function name (i.e. test 
before the first .) cannot be a reserved word. But the rest of the parts in 
fully qualified function names can be any identifier, including a reserved word.

So, for example, with this patch, the statement:

{code}
define X com.yahoo.load();
{code}

or

{code}
modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
{code}

Now compiles and runs perfectly well.

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.1
Reporter: Viraj Bhat
Assignee: Milind Bhandarkar
 Fix For: 0.3.0

 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-23 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-656:
--

Status: Open  (was: Patch Available)

modifying patch to include test case.

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.1
Reporter: Viraj Bhat
Assignee: Milind Bhandarkar
 Fix For: 0.3.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-23 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-656:
--

Attachment: (was: reserved.patch)

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.1
Reporter: Viraj Bhat
Assignee: Milind Bhandarkar
 Fix For: 0.3.0

 Attachments: mywordcount.txt, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-23 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-656:
--

Attachment: reserved.patch

Uploading a modified patch that now includes a test case. The findbugs warning 
is not new to this patch.

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.1
Reporter: Viraj Bhat
Assignee: Milind Bhandarkar
 Fix For: 0.3.0

 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-656) Use of eval or any other keyword in the package hierarchy of a UDF causes parse exception

2009-05-23 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-656:
--

Status: Patch Available  (was: Open)

 Use of eval or any other keyword in the package hierarchy of a UDF causes 
 parse exception
 -

 Key: PIG-656
 URL: https://issues.apache.org/jira/browse/PIG-656
 Project: Pig
  Issue Type: Bug
  Components: documentation, grunt
Affects Versions: 0.2.1
Reporter: Viraj Bhat
Assignee: Milind Bhandarkar
 Fix For: 0.3.0

 Attachments: mywordcount.txt, reserved.patch, TOKENIZE.jar


 Consider a Pig script which does something similar to a word count. It uses 
 the built-in TOKENIZE function, but packages it inside a class hierarchy such 
 as mypackage.eval
 {code}
 register TOKENIZE.jar
 my_src  = LOAD '/user/viraj/mywordcount.txt' USING PigStorage('\t')  AS 
 (mlist: chararray);
 modules = FOREACH my_src GENERATE FLATTEN(mypackage.eval.TOKENIZE(mlist));
 describe modules;
 grouped = GROUP modules BY $0;
 describe grouped;
 counts  = FOREACH grouped GENERATE COUNT(modules), group;
 ordered = ORDER counts BY $0;
 dump ordered;
 {code}
 The parser complains:
 ===
 2009-02-05 01:17:29,231 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1000: Error during parsing. Invalid alias: mypackage in {mlist: chararray}
 ===
 I looked at the following source code at 
 (src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt) and it seems 
 that : EVAL is a keyword in Pig. Here are some clarifications:
 1) Is there documentation on what the EVAL keyword actually is?
 2) Is EVAL keyword actually implemented?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)
run -param -param; is a valid grunt command
---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar


By mistake, I typed 

{code}
run -param -param;
{code}

in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Status: Patch Available  (was: Open)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Attachment: invalidparam.patch

This patch makes the -arguments to pig commands actually contain an argument 
when needed after them, rather than accepting commands such as run -param 
-param.

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Attachment: (was: invalidparam.patch)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar

 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Status: Open  (was: Patch Available)

cancelling patch, so as to fix the testcase.

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar

 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-779) Warning from javacc

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar resolved PIG-779.
---

Resolution: Duplicate

Fix for PIG-819 fixes this.

 Warning from javacc
 ---

 Key: PIG-779
 URL: https://issues.apache.org/jira/browse/PIG-779
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner

 This warning needs fixing:
  Reading from file 
 .../src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj . . .
[javacc] Warning: Choice conflict in (...)* construct at line 560, column 
 9.
[javacc]  Expansion nested within construct and expansion 
 following construct
[javacc]  have common prefixes, one of which is: -param
[javacc]  Consider using a lookahead of 2 or more for nested 
 expansion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Attachment: invalidparam.patch

Attaching fixed patch with testcase.

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Status: Patch Available  (was: Open)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Attachment: (was: invalidparam.patch)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar

 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Status: Open  (was: Patch Available)

somehow uploaded the old patch again :-(

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar

 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Status: Patch Available  (was: Open)

yes, it is the right patch :-)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-819) run -param -param; is a valid grunt command

2009-05-26 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-819:
--

Attachment: invalidparam.patch

Hoping to upload the right patch this time :-)

 run -param -param; is a valid grunt command
 ---

 Key: PIG-819
 URL: https://issues.apache.org/jira/browse/PIG-819
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: all
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: invalidparam.patch


 By mistake, I typed 
 {code}
 run -param -param;
 {code}
 in grunt. And was surprised to find it to be  a valid grunt command.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-28 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714233#action_12714233
 ] 

Milind Bhandarkar commented on PIG-796:
---

Can't the user simply do:

{code}
foreach input generate (chararray)((int)mymap#'key') as myvalue;
{code}

Minimizing implicit casting is a good thing (tm) anyway.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-28 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714278#action_12714278
 ] 

Milind Bhandarkar commented on PIG-796:
---

So, can we live with the classcastexception generated by the front end ? I 
recall reading somewhere that pigs do what they are told. If they are told to 
do things that are even impossible for humans to comprehend, i.e. somehow 
interpret a byte array to be an integer, and then to convert them to strings, 
how would they cope up ?

IMHO, eliminating such implicit casts would reduce complexity of pig, and would 
fit in the pig philosphy. But that means being able to convert everything to a 
chararray at most. If someone request a chararray cast of a bytearray, give 
them a hex representation, and have them write a UDF to convert hex string to 
string (i.e. toInt('0x'+myvalue) in the above code.)

thoughts ?

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-28 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714279#action_12714279
 ] 

Milind Bhandarkar commented on PIG-796:
---

Modifying my earlier comment:

 So, can we live with the classcastexception generated by the front end ?

I meant the back end of course.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-852) pig -version or pig -help returns exit code of 1

2009-06-15 Thread Milind Bhandarkar (JIRA)
pig -version or pig -help returns exit code of 1


 Key: PIG-852
 URL: https://issues.apache.org/jira/browse/PIG-852
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: All
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar


{code}
java -jar pig.jar -x local [-version|-help]
{code}

returns an exit code of 1 to the shell.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-852) pig -version or pig -help returns exit code of 1

2009-06-15 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-852:
--

Attachment: rc.patch

 pig -version or pig -help returns exit code of 1
 

 Key: PIG-852
 URL: https://issues.apache.org/jira/browse/PIG-852
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: All
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: rc.patch


 {code}
 java -jar pig.jar -x local [-version|-help]
 {code}
 returns an exit code of 1 to the shell.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-852) pig -version or pig -help returns exit code of 1

2009-06-15 Thread Milind Bhandarkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milind Bhandarkar updated PIG-852:
--

Status: Patch Available  (was: Open)

Making patch available. Manual testing done.

 pig -version or pig -help returns exit code of 1
 

 Key: PIG-852
 URL: https://issues.apache.org/jira/browse/PIG-852
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: All
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: rc.patch


 {code}
 java -jar pig.jar -x local [-version|-help]
 {code}
 returns an exit code of 1 to the shell.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-852) pig -version or pig -help returns exit code of 1

2009-06-16 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719942#action_12719942
 ] 

Milind Bhandarkar commented on PIG-852:
---

Note that JUnit tests that test the return code of a completely different JVM 
are kludgy at best. Therefore writing a test case of checking the System.exit() 
return value is insane. Hadoop folks have *mostly* fixed this issue by having a 
static public run() method that can be invoked directly in the tests, and can 
have it's return value (which is what main() uses as an exit code) checked.

So, committers, please ignore the no-tests warning.

 pig -version or pig -help returns exit code of 1
 

 Key: PIG-852
 URL: https://issues.apache.org/jira/browse/PIG-852
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
 Environment: All
Reporter: Milind Bhandarkar
Assignee: Milind Bhandarkar
 Attachments: rc.patch


 {code}
 java -jar pig.jar -x local [-version|-help]
 {code}
 returns an exit code of 1 to the shell.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721361#action_12721361
 ] 

Milind Bhandarkar commented on PIG-832:
---

If we include the piggybank functions in the default import list, we need to 
make sure that they are compiled and tested in the default build, and that the 
releases will be blocked due to them not compiling etc. Is that the intention ?

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721396#action_12721396
 ] 

Milind Bhandarkar commented on PIG-832:
---

Olga, what I am saying is to have a default import list: which contains default 
UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the 
same list can be added to / overridden on the command-line. This has several 
advantages. Pig built-ins do not have to be reserved words, and can be 
overridden. For example, recent mails on pig-users have mentioned that 
tokenize+flatten should be a single udf. This can be done by providing a 
flatten (which is null), and tokenize, which does tokenize+flatten, and 
existing scripts will still work. This simplifies pig grammar as well. Users 
can create udf libraries, and use them with:

{code}
java -Dimport.list += `cat my-udf-lib.import`
{code}

Thoughts ?

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721413#action_12721413
 ] 

Milind Bhandarkar commented on PIG-832:
---

Instead of a list, if you make it a map (i.e. short name - fully qualified 
class name), it will be much easier, as it will guarantee that each name has 
exactly one udf class associated with it. It will also allow users to use udfs 
that have class names which are pig reserved words. For example, If I have an 
existing UDF with a class name such as load or store, I can still use them with 
a different name like myload, without having to rename the class.

So, I suggest:

{code}
java -jar pig.jar 
-Dimport.list+=MyLoad:com..Load,Flatten:com..Flatten,... 
{code}

If I do not specify -Dimport.list on the pig command line, then the default 
import.list is used.

Thoughts ?

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-857) Pig should implement Tool interface from Hadoop

2009-06-18 Thread Milind Bhandarkar (JIRA)
Pig should implement Tool interface from Hadoop
---

 Key: PIG-857
 URL: https://issues.apache.org/jira/browse/PIG-857
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
 Environment: All
Reporter: Milind Bhandarkar


Hadoop, Hadoop Streaming, and Hadoop Pipes all use Tool interface, which 
provides support for parsing generic options. This has resulted in consistent 
options for all three hadoop launch mechanisms. Pig should also implement Tool 
(or use GenericOptionsParser directly.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721490#action_12721490
 ] 

Milind Bhandarkar commented on PIG-832:
---

Daniel: For that to work, user's class will have to be called PigStorage. And 
also, inserting user's jars before pig jar for looking up methods can have 
major unintended consequences. pig.jar should always be the first in the 
classpath.

Olga: My use case cannot use parameter substitution, because PigMix scrips does 
not specify PigStorage as, say, $storage. The solution I proposed is as simple 
to implement as Daniel's original proposal (+= is a syntactic sugar. even = can 
be used with the same effect.), and it fixes a specific ask, and also allows 
for extensibility. Am I missing something here ?

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721514#action_12721514
 ] 

Milind Bhandarkar commented on PIG-832:
---

Olga, specifying a list of packages as a path list will have the same issues as

{code}
import com.xyz.package.*;
{code}

in java, where it is considered to be a bad practice. So, in the solution that 
I have proposed, I am assuming the class name is specified on the commandline 
and not the package name.


 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721523#action_12721523
 ] 

Milind Bhandarkar commented on PIG-832:
---

Daniel,

Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 
10 entries in the command line, right?

No, thats why I have a `cat myudflist` allowed on the command-line.



 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721550#action_12721550
 ] 

Milind Bhandarkar commented on PIG-832:
---

Daniel,

Pig streaming already uses backquotes for executing external programs. So, 
users are familiar with this syntax. I believe an ordinary pig user already 
knows about doing such things in unix shells. But anyway, as Olga said, she is 
looking for requirements, and not solutions, so, here is a requirement:

I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I 
want to use function1 from xyz.jar, and function2 from abc.jar. How do I use 
function2 from abc.jar with full confidence that xyz.jar does not contain a UDF 
named function2? How do you propose I do that  without modifying a whole bunch 
of pig scripts that I am testing for my functions ?

In the solution that I proposed, I can just change function2 mapping by 
including -Dimport.list=function2:com.yahoo.milind.function2 on the 
command-line.

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.3.0


 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721562#action_12721562
 ] 

Milind Bhandarkar commented on PIG-832:
---

Olga,

As long the suggested improvements do not result in redundancy / make the 
original solutions obsolete, its fine. But I believe that the core issue, which 
is, how does pig resolve UDFs?, is not addressed properly in the small 
change to current implementation.

 Make import list configurable
 -

 Key: PIG-832
 URL: https://issues.apache.org/jira/browse/PIG-832
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
Assignee: Daniel Dai

 Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721574#action_12721574
 ] 

Milind Bhandarkar commented on PIG-856:
---

+1 on seeing performance differences. But, is there code in pig to determine 
that the output of a previous map-reduce stage is not accessible because of 
datanode failures (as opposed to some other reason), and repeat the map-reduce 
stage ? Because a single datanode failure with replication 1 will cause 
temporary data to be unavailable, and is  very likely for long-running queries.

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721581#action_12721581
 ] 

Milind Bhandarkar commented on PIG-856:
---

+1. I will file a separate Jira (if replication of 1 is decided upon) so that 
Pig retries a map-reduce stage if it fails for *external* reasons.

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721594#action_12721594
 ] 

Milind Bhandarkar commented on PIG-856:
---

+1 to both Alan and Olga. Default should still be hadoop's default 
dfs.replication.

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-18 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721599#action_12721599
 ] 

Milind Bhandarkar commented on PIG-856:
---

+1 to Sathosh to documenting Knobs. Better to add and document knobs rather 
than modify language like this:

{code}
%TempReplicate 2
store A into PigStorage('\t') with replication 2;
{code}

 PERFORMANCE: reduce number of replicas
 --

 Key: PIG-856
 URL: https://issues.apache.org/jira/browse/PIG-856
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Olga Natkovich

 Currently Pig uses the default number of replicas between MR jobs. Currently, 
 the number is 3. Given the temp nature of the data, we should never need more 
 than 2 and should explicitely set it to improve performance and to be nicer 
 to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-07-06 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727660#action_12727660
 ] 

Milind Bhandarkar commented on PIG-872:
---

A couple of things:

As Pradeep says, only the hadoop job that performs FR join needs to add the 
replicated dataset to distributed cache.

Second, make sure that the replicated dataset has high replication, such as 10 
(or the same replication as job.jar). For already materialized dataset, Pig 
need not do anything but only warn if the replication factor is small (e.g. 3) 
But if the replicated dataset is being produced as an intermediate output by 
Pig, it needs to be generated with high replication factor. 

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729859#action_12729859
 ] 

Milind Bhandarkar commented on PIG-879:
---

I see some long term issues with all the approaches/options.

First, not all loaders require a path. (e.g. DBLoader) Some paths (e.g. hftp:// 
or hsftp://) do not have a notion of relative or absolute. Indeed, the right 
way to fix this is to change the syntax of load and store statements, so that 
the loader itself deals with the path handling, and not pig. Second, take out 
copyToLocal, cp, mv, and all the dfs shell functionality from pig. These are 
side effects and impose a barrier for optimization. In the current form, they 
do not belong in a dataflow language. Grunt could still support it.

 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.