[jira] Created: (PIG-1400) add option for null field JOIN semantics

2010-04-30 Thread David Ciemiewicz (JIRA)
add option for null field JOIN semantics


 Key: PIG-1400
 URL: https://issues.apache.org/jira/browse/PIG-1400
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: David Ciemiewicz


Currently JOIN supports SQL semantics for joining null values in fields - they 
aren't matched.

However, GROUP ... and COGROUP ... semantics DO match on null values in fields.

This violated the principle of least astonishment for me - I expected JOIN on 
null value fields to work.

As a work around, I must now go through ALL of my code to convert chararray 
null values to empty strings to get the JOIN to work appropriately.

{code}
A = foreach A generate
((a is not null) ? a : '') as a,
((b is not null) ? b : '') as b,
...
{code}

This does not really a satisfactory work around.


My preference is that JOIN support an option (ala FULL, LEFT, RIGHT, OUTER) 
that directs JOIN to support null match join semantics just like COGROUP does.

Something like:

{code}
AB = JOIN A by ( key, subkey ) FULL OUTER MATCHNULLS, B by ( key, subkey );
{code}

Don't know if it should be called JOIN_NULLS, MATCHNULLS, NULLS, NULLSEMANTICS, 
what have you.

I just think it would be much cleaner for the end user to be able get these 
semantics.

We might also consider being explicit about the SQL null semantics by adding 
the option SQLNULLS or NONULLMATCH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-42) Pig should be able to split Gzip files like it can split Bzip files

2010-04-06 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854208#action_12854208
 ] 

David Ciemiewicz commented on PIG-42:
-

Hadoop Archives are not really the solution here.  I want my code to with 
exactly the same file name references whether I have 100 gzip compressed (or 
bzip2 compressed) part files or a single concatenation of the individually 
compressed part files.

I have to change all my filename references to use a har.

What we really want are simple concatenations of gzip files and bzip2 files 
that work with map reduce.





 Pig should be able to split Gzip files like it can split Bzip files
 ---

 Key: PIG-42
 URL: https://issues.apache.org/jira/browse/PIG-42
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Benjamin Reed
Assignee: Benjamin Reed
 Attachments: gzip.patch


 It would be nice to be able to split gzip files like we can split bzip files. 
 Unfortunately, we don't have a sync point for the split in the gzip format.
 Gzip file format supports the notion of concatenate gzipped files. When 
 gzipped files are concatenated together they are treated as a single file. So 
 to make a gzipped file splittable we can used an empty compressed file with 
 some salt in the headers as a sync signature. Then we can make the gzip file 
 splittable by using this sync signature between compressed segments of the 
 file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-282) Custom Partitioner

2010-03-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849279#action_12849279
 ] 

David Ciemiewicz commented on PIG-282:
--

How will the custom partitioner be used in Pig?

Is this for map partitioning and/or output partitioning?

For instance, I'd love to have something that created separate directories 
based on the value of some key.

 Custom Partitioner
 --

 Key: PIG-282
 URL: https://issues.apache.org/jira/browse/PIG-282
 Project: Pig
  Issue Type: New Feature
Reporter: Amir Youssefi
Priority: Minor

 By adding custom partitioner we can give control over which output partition 
 a key (/value) goes to. We can add keywords to language e.g. 
 PARTITION BY UDF(...)
 or a similar syntax. UDF returns a number between 0 and n-1 where n is number 
 of output partitions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-02-19 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz reopened PIG-1182:
---


Corinne, not sure what you are so resistant to following the basic principles 
of documenting ALL syntax, including comments, in the reference manual. If the 
document is open to the community to edit, I'm more than willing to do the work 
myself since I have contibuted as a technical writer for programming language 
reference manuals in my past as well as having been a developer of compilers 
and software development tools.

Also, I think the passage you sited could use a little work on the English: 

Using Comments in Scripts
If you place Pig Latin statements in a script, the script can include comments.

For multi-line comments use /*  */
For single line comments use --
/* myscript.pig
My script includes three simple Pig Latin Statements.
*/

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); 
-- load statement
B = FOREACH A GENERATE name;  -- foreach statement
DUMP B;  --dump statement
Case Sensitivity


 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz
Assignee: Corinne Chandel
 Fix For: 0.7.0


 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2010-01-22 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803795#action_12803795
 ] 

David Ciemiewicz commented on PIG-752:
--

Jeff,

What do you mean when you say local mode has been removed?

Does this mean that the option -exectype local has been removed?
Or does this mean that the local mode execution code has been replaced or will 
be replaced by a M/R execution engine that operates on the users local computer 
without the need for an HDFS grid.

If the former (no local exection), this is nuts.
If the latter (M/R execution for local execution), and this will supply the 
means of doing bzip compression reading and writing, then this isn't a WON'T 
FIX, this is a FIXED by change in execution engine?

So which is it?

 local mode doesn't read bzip2 and gzip compressed data files
 

 Key: PIG-752
 URL: https://issues.apache.org/jira/browse/PIG-752
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: David Ciemiewicz
Assignee: Jeff Zhang
 Attachments: Pig_752.Patch


 Problem 1)  use of .bz2 file extension does not store results bzip2 
 compressed in Local mode (-exectype local)
 If I use the .bz2 filename extension in a STORE statement on HDFS, the 
 results are stored with bzip2 compression.
 If I use the .bz2 filename extension in a STORE statement on local file 
 system, the results are NOT stored with bzip2 compression.
 compact.bz2.pig:
 {code}
 A = load 'events.test' using PigStorage();
 store A into 'events.test.bz2' using PigStorage();
 C = load 'events.test.bz2' using PigStorage();
 C = limit C 10;
 dump C;
 {code}
 {code}
 -bash-3.00$ pig -exectype local compact.bz2.pig
 -bash-3.00$ file events.test
 events.test: ASCII English text, with very long lines
 -bash-3.00$ file events.test.bz2
 events.test.bz2: ASCII English text, with very long lines
 -bash-3.00$ cat events.test | bzip2  events.test.bz2
 -bash-3.00$ file events.test.bz2
 events.test.bz2: bzip2 compressed data, block size = 900k
 {code}
 The output format in local mode is definitely not bzip2, but it should be.
 {code}
 Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
 should, to be consistent with HDFS
 read.bz2.pig:
 {code}
 A = load 'events.test.bz2' using PigStorage();
 A = limit A 10;
 dump A;
 {code}
 The output should be human readable but is instead garbage, indicating no 
 decompression took place during the load:
 {code}
 -bash-3.00$ pig -exectype local read.bz2.pig
 USING: /grid/0/gs/pig/current
 2009-04-03 18:26:30,455 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-04-03 18:26:30,456 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 (BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??)
 ((R? 6?*mg, 
 ?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
 ??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J)
 a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
 (mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?)
 (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
 a???e)B??9?  ?44
 ((?...@4?)
 (f)
 (?...@+?d?0@?U)
 (Q?SR)
 -bash-3.00$ 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-01-08 Thread David Ciemiewicz (JIRA)
Pig reference manual does not mention syntax for comments
-

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz


The Pig 0.5.0 reference manual does not mention how to write comments in your 
pig code using -- (two dashes).
http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html

Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1182) Pig reference manual does not mention syntax for comments

2010-01-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798128#action_12798128
 ] 

David Ciemiewicz commented on PIG-1182:
---

Corinne, I made no changes.

I'm pointing out that it is an omission to not have the comment syntax 
documented in the reference manual.

Reference manuals for programming languages SHOULD ALWAYS have information on 
ALL syntax including comment syntax.

Once you are done with learning things in the User's Guide, most of the time 
programmer's just go back to the Reference Manual for quick look up of 
information and syntax.

So the documentation on comment syntax should be in BOTH the User's Guide AND 
the Reference Manual.



 Pig reference manual does not mention syntax for comments
 -

 Key: PIG-1182
 URL: https://issues.apache.org/jira/browse/PIG-1182
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.5.0
Reporter: David Ciemiewicz
Assignee: Corinne Chandel
 Fix For: 0.7.0


 The Pig 0.5.0 reference manual does not mention how to write comments in your 
 pig code using -- (two dashes).
 http://hadoop.apache.org/pig/docs/r0.5.0/piglatin_reference.html
 Also, does /* */ also work?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1097) Pig do not support group by boolean type

2009-11-19 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780223#action_12780223
 ] 

David Ciemiewicz commented on PIG-1097:
---

I think that one could argue that Filter functions are REALLY just 
EvalBoolean functions in disguise.

That Filter functions were a way of adding return type to Pig for Boolean cases 
when Pig had no types.

Further, I'd argue, that now that Pig does have data types, that Filter should 
be deprecated and all Filter functions should now become EvalBoolean.

In otherwords, I believe it was an oversight in the types migration to not 
migrate Filter to EvalBoolean

 Pig do not support group by boolean type
 

 Key: PIG-1097
 URL: https://issues.apache.org/jira/browse/PIG-1097
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Minor
 Fix For: 0.6.0


 My Script is as following, the TestUDF return boolean type.
 {color:blue}
 DEFINE testUDF org.apache.pig.piggybank.util.TestUDF();
 raw = LOAD 'data/input';
 raw = FOREACH raw GENERATE testUDF();
 raw = GROUP raw BY $0;
 DUMP raw;
 {color}
 *The above script will throw exception:*
 Exception in thread main 
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias raw
   at org.apache.pig.PigServer.openIterator(PigServer.java:481)
   at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
   at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
   at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
   at org.apache.pig.PigServer.registerScript(PigServer.java:409)
   at PigExample.main(PigExample.java:13)
 Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: 
 Unable to store alias raw
   at org.apache.pig.PigServer.store(PigServer.java:536)
   at org.apache.pig.PigServer.openIterator(PigServer.java:464)
   ... 5 more
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
 Unexpected error during execution.
   at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:269)
   at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:780)
   at org.apache.pig.PigServer.store(PigServer.java:528)
   ... 6 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException:
  ERROR 2036: Unhandled key type boolean
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.selectComparator(JobControlCompiler.java:856)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:561)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:251)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:128)
   at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:249)
   ... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1034) Pig does not support ORDER ... BY group alias

2009-10-21 Thread David Ciemiewicz (JIRA)
Pig does not support ORDER ... BY group alias
-

 Key: PIG-1034
 URL: https://issues.apache.org/jira/browse/PIG-1034
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


GROUP ... ALL and GROUP ... BY produce an alias group.

Pig produces a syntax error if you attempt to ORDER ... BY group.

This does seem like a perfectly reasonable thing to do.

The workaround is to create an alias for group using an AS clause.  But I think 
this workaround should be unnecessary.

Here's sample code which elicits the syntax error:

{code}
A = load 'one.txt' using PigStorage as (one: int);

B = group A all;

C = foreach B generate
group,
COUNT(A) as count;

D = order C by group parallel 1; -- group is one of the aliases in C, why does 
this throw a syntax error?

dump D;
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-979) Acummulator Interface for UDFs

2009-09-25 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12759813#action_12759813
 ] 

David Ciemiewicz commented on PIG-979:
--

This JIRA doesn't quite get the gist of why I believe the Accumulator interface 
is of interest.  It isn't just about performance and avoiding retreading over 
the same data over and over again.

It is also about providing an interface to support CUMMULATIVE_SUM, RANK, and 
other functions of it's ilk.

A better code example for justifying this would be:

{code}
A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = order A by count desc parallel 1;
C = foreach B generate
query,
count,
CUMULATIVE_SUM(count) as cumulative_count,
RANK(count) as rank;
{code}

These functions RANK and CUMULATIVE_SUM would have persistent state and yet 
would emit a value per value or tuple passed.  Bags would not be appropriate as 
coded.

Additionally, the reason for the Accumulator inteface is to avoid multiple 
passes over the same data:

For instance, consider the example:

{code}
A = load 'data' using PigStorage() as ( query: chararray, int: count );
B = group A all;
C = foreach B generate
group,
SUM(A.count),
AVG(A.count),
VAR(A.count),
STDEV(A.count),
MIN(A.count),
MAX(A.count),
MEDIAN(A.count);
{code}

Repeatedly shuffling the same values just isn't an optimal way to process data.



 Acummulator Interface for UDFs
 --

 Key: PIG-979
 URL: https://issues.apache.org/jira/browse/PIG-979
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Ying He

 Add an accumulator interface for UDFs that would allow them to take a set 
 number of records at a time instead of the entire bag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)
ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER 
BY
-

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-900:
-

Description: 
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


  was:
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}



 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and 
 FILTER BY
 -

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 With GROUP BY, you must put parentheses around the aliases in the BY clause:
 {code}
 B = group A by ( a, b, c );
 {code}
 With FILTER BY, you can optionally put parentheses around the aliases in the 
 BY clause:
 {code}
 B = filter A by ( a is not null and b is not null and c is not null );
 {code}
 However, with ORDER BY, if you put parenthesis around the BY clause, you get 
 a syntax error:
 {code}
  A = order A by ( a, b, c);
 {code}
 Produces the error:
 {code}
 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 
 19.
 Was expecting:
 ) ...
 {code}
 This is an annoyance really.
 Here's my full code example ...
 {code}
 A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
 chararray );
 A = order A by ( a, b, c );
 dump A;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-900) ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and FILTER BY

2009-08-03 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-900:
-

Description: 
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c );
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}


  was:
With GROUP BY, you must put parentheses around the aliases in the BY clause:

{code}
B = group A by ( a, b, c );
{code}

With FILTER BY, you can optionally put parentheses around the aliases in the BY 
clause:

{code}
B = filter A by ( a is not null and b is not null and c is not null );
{code}

However, with ORDER BY, if you put parenthesis around the BY clause, you get a 
syntax error:

{code}
 A = order A by ( a, b, c);
{code}

Produces the error:

{code}
2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 19.
Was expecting:
) ...
{code}

This is an annoyance really.

Here's my full code example ...

{code}
A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
chararray );

A = order A by ( a, b, c );

dump A;
{code}



 ORDER BY syntax wrt parentheses is somewhat different than GROUP BY and 
 FILTER BY
 -

 Key: PIG-900
 URL: https://issues.apache.org/jira/browse/PIG-900
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 With GROUP BY, you must put parentheses around the aliases in the BY clause:
 {code}
 B = group A by ( a, b, c );
 {code}
 With FILTER BY, you can optionally put parentheses around the aliases in the 
 BY clause:
 {code}
 B = filter A by ( a is not null and b is not null and c is not null );
 {code}
 However, with ORDER BY, if you put parenthesis around the BY clause, you get 
 a syntax error:
 {code}
  A = order A by ( a, b, c );
 {code}
 Produces the error:
 {code}
 2009-08-03 18:26:29,544 [main] ERROR org.apache.pig.tools.grunt.Grunt -
 ERROR 1000: Error during parsing. Encountered  , ,  at line 3, column 
 19.
 Was expecting:
 ) ...
 {code}
 This is an annoyance really.
 Here's my full code example ...
 {code}
 A = load 'data.txt' using PigStorage as (a: chararray, b: chararray, c: 
 chararray );
 A = order A by ( a, b, c );
 dump A;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-875) Making COUNT and AVG semantics SQL compliant

2009-07-20 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733400#action_12733400
 ] 

David Ciemiewicz commented on PIG-875:
--

Can I suggest that we might the default to behavior to be to not count nulls, 
but we might want a way for nulls to be counted with AVG_WITH_NULLS and 
COUNT_WITH_NULLS or that we might want a DEFINE statement to set an option to 
turn on and off null count behavior.

 Making COUNT and AVG semantics SQL compliant
 

 Key: PIG-875
 URL: https://issues.apache.org/jira/browse/PIG-875
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.4.0


 Currently both AVG and COUNT counts NULLs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-752) local mode doesn't read bzip2 and gzip compressed data files

2009-07-03 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-752:
-

Summary: local mode doesn't read bzip2 and gzip compressed data files  
(was: bzip2 compression and local mode bugs)

 local mode doesn't read bzip2 and gzip compressed data files
 

 Key: PIG-752
 URL: https://issues.apache.org/jira/browse/PIG-752
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 Problem 1)  use of .bz2 file extension does not store results bzip2 
 compressed in Local mode (-exectype local)
 If I use the .bz2 filename extension in a STORE statement on HDFS, the 
 results are stored with bzip2 compression.
 If I use the .bz2 filename extension in a STORE statement on local file 
 system, the results are NOT stored with bzip2 compression.
 compact.bz2.pig:
 {code}
 A = load 'events.test' using PigStorage();
 store A into 'events.test.bz2' using PigStorage();
 C = load 'events.test.bz2' using PigStorage();
 C = limit C 10;
 dump C;
 {code}
 {code}
 -bash-3.00$ pig -exectype local compact.bz2.pig
 -bash-3.00$ file events.test
 events.test: ASCII English text, with very long lines
 -bash-3.00$ file events.test.bz2
 events.test.bz2: ASCII English text, with very long lines
 -bash-3.00$ cat events.test | bzip2  events.test.bz2
 -bash-3.00$ file events.test.bz2
 events.test.bz2: bzip2 compressed data, block size = 900k
 {code}
 The output format in local mode is definitely not bzip2, but it should be.
 {code}
 Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
 should, to be consistent with HDFS
 read.bz2.pig:
 {code}
 A = load 'events.test.bz2' using PigStorage();
 A = limit A 10;
 dump A;
 {code}
 The output should be human readable but is instead garbage, indicating no 
 decompression took place during the load:
 {code}
 -bash-3.00$ pig -exectype local read.bz2.pig
 USING: /grid/0/gs/pig/current
 2009-04-03 18:26:30,455 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
 2009-04-03 18:26:30,456 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
 (BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??)
 ((R? 6?*mg, 
 ?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
 ??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J)
 a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
 (mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?)
 (?oou?t???Kwl?3?nCM?WS?;l???P?s?x
 a???e)B??9?  ?44
 ((?...@4?)
 (f)
 (?...@+?d?0@?U)
 (Q?SR)
 -bash-3.00$ 
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-26 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724703#action_12724703
 ] 

David Ciemiewicz commented on PIG-793:
--

Alan,

This sounds good, but it sounds like it is only 12 out of 174 bytes that you 
are saving or less than 10%.

Amdahl's law says this isn't sufficient in the grand scheme of things and so I 
won't expect a huge payback.

It seems like an optimal encoding of the same tuple would be something like:

1 or 2 bytes for an index to the structure describing the contents of the tuple 
(keep a list of these tuple structures)
4 bytes for the int
8 bytes for the double
1 or 2 bytes for string length in fixed positions
20 bytes for string

Total is 36 bytes or an 80% reduction in memory versus 174 bytes.

If memory and not CPU is what is slowing down Pig processing, then Hong Tang's 
LazyTuple or something like it ultimately going to be what is needed.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-863) Function (UDF) automatic namespace resolution is really needed

2009-06-24 Thread David Ciemiewicz (JIRA)
Function (UDF) automatic namespace resolution is really needed
--

 Key: PIG-863
 URL: https://issues.apache.org/jira/browse/PIG-863
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz


The Apache PiggyBank documentation says that to reference a function, I need to 
specify a function as:

org.apache.pig.piggybank.evaluation.string.UPPER(text)

As in the example:

{code}
REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug  = FILTER Tweets BY 
org.apache.pig.piggybank.evaluation.string.UPPER(text) MATCHES 
'.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
{code}

Why can't we implement automatic name space resolution as so we can just 
reference UPPER without namespace qualifiers?

{code}
REGISTER /public/share/pig/contrib/piggybank/java/piggybank.jar ;
TweetsInaug  = FILTER Tweets BY UPPER(text) MATCHES 
'.*(INAUG|OBAMA|BIDEN|CHENEY|BUSH).*' ;
{code}

I know about the workaround:

{code}
define org.apache.pig.piggybank.evaluation.string.UPPER UPPER
{code}

But this is really a pain to do if I have lots of functions.

Just warn if there is a collision and suggest I use the define workaround in 
the warning messages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-826) DISTINCT as Function/Operator rather than statement/operator - High Level Pig

2009-06-02 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715726#action_12715726
 ] 

David Ciemiewicz commented on PIG-826:
--

Alan, thanks!  But what if I want to do the following:

{code}
foreach Grouped {
   dcountryurl = distinct Logs.(country,url);
   generate COUNT(dcountryurl);
};
{code}

Projecting multiple aliases doesn't seem to work. I also tried the following 
and it doesn't work either.

{code}
foreach Grouped {
   dcountryurl = distinct Logs.country, Logs.url;
   generate COUNT(dcountryurl);
};
{code}

 DISTINCT as Function/Operator rather than statement/operator - High Level 
 Pig
 ---

 Key: PIG-826
 URL: https://issues.apache.org/jira/browse/PIG-826
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz

 In SQL, a user would think nothing of doing something like:
 {code}
 select
 COUNT(DISTINCT(user)) as user_count,
 COUNT(DISTINCT(country)) as country_count,
 COUNT(DISTINCT(url) as url_count
 from
 server_logs;
 {code}
 But in Pig, we'd need to do something like the following.  And this is about 
 the most
 compact version I could come up with.
 {code}
 Logs = load 'log' using PigStorage()
 as ( user: chararray, country: chararray, url: chararray);
 DistinctUsers = distinct (foreach Logs generate user);
 DistinctCountries = distinct (foreach Logs generate country);
 DistinctUrls = distinct (foreach Logs generate url);
 DistinctUsersCount = foreach (group DistinctUsers all) generate
 group, COUNT(DistinctUsers) as user_count;
 DistinctCountriesCount = foreach (group DistinctCountries all) generate
 group, COUNT(DistinctCountries) as country_count;
 DistinctUrlCount = foreach (group DistinctUrls all) generate
 group, COUNT(DistinctUrls) as url_count;
 AllDistinctCounts = cross
 DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
 Report = foreach AllDistinctCounts generate
 DistinctUsersCount::user_count,
 DistinctCountriesCount::country_count,
 DistinctUrlCount::url_count;
 store Report into 'log_report' using PigStorage();
 {code}
 It would be good if there was a higher level version of Pig that permitted 
 code to be written as:
 {code}
 Logs = load 'log' using PigStorage()
 as ( user: chararray, country: chararray, url: chararray);
 Report = overall Logs generate
 COUNT(DISTINCT(user)) as user_count,
 COUNT(DISTINCT(country)) as country_count,
 COUNT(DISTINCT(url)) as url_count;
 store Report into 'log_report' using PigStorage();
 {code}
 I do want this in Pig and not as SQL.  I'd expect High Level Pig to generate 
 Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-826) DISTINCT as Function rather than statement - High Level Pig

2009-05-31 Thread David Ciemiewicz (JIRA)
DISTINCT as Function rather than statement - High Level Pig
-

 Key: PIG-826
 URL: https://issues.apache.org/jira/browse/PIG-826
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz


In SQL, a user would think nothing of doing something like:

{code}
select
COUNT(DISTINCT(user)) as user_count,
COUNT(DISTINCT(country)) as country_count,
COUNT(DISTINCT(url) as url_count
from
server_logs;
{code}

But in Pig, we'd need to do something like the following.  And this is about 
the most
compact version I could come up with.

{code}
Logs = load 'log' using PigStorage()
as ( user: chararray, country: chararray, url: chararray);

DistinctUsers = distinct (foreach Logs generate user);
DistinctCountries = distinct (foreach Logs generate country);
DistinctUrls = distinct (foreach Logs generate url);

DistinctUsersCount = foreach (group DistinctUsers all) generate
group, COUNT(DistinctUsers) as user_count;
DistinctCountriesCount = foreach (group DistinctCountries all) generate
group, COUNT(DistinctCountries) as country_count;
DistinctUrlCount = foreach (group DistinctUrls all) generate
group, COUNT(DistinctUrls) as url_count;

AllDistinctCounts = cross
DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;

Report = foreach AllDistinctCounts generate
DistinctUsersCount::user_count,
DistinctCountriesCount::country_count,
DistinctUrlCount::url_count;

store Report into 'log_report' using PigStorage();
{code}

It would be good if there was a higher level version of Pig that permitted code 
to be written as:

{code}
Logs = load 'log' using PigStorage()
as ( user: chararray, country: chararray, url: chararray);

Report = overall Logs generate
COUNT(DISTINCT(user)) as user_count,
COUNT(DISTINCT(country)) as country_count,
COUNT(DISTINCT(url)) as url_count;

store Report into 'log_report' using PigStorage();
{code}

I do want this in Pig and not as SQL.  I'd expect High Level Pig to generate 
Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

2009-05-31 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714972#action_12714972
 ] 

David Ciemiewicz commented on PIG-801:
--

I'm very much beginning to like the idea of introducing some syntactic sugar 
in Pig for an forall  or overall statement that would allow one to write 
the high level pig for this case as:

{code}
Total = forall CountryPopulations generate SUM(CountryPopulations.population) 
as population;
{code}

or as:

{code}
Total = overall CountryPopulations generate SUM(CountryPopulations.population) 
as population;
{code}

Yeah, I know I could use construct:

{code}
Total = foreach (group CountryPopulations all) generate 
SUM(CountryPopulations.population) as population;
{code}

 But I like syntactic sugar.

Then again, it would be really good if Pig just supported:  Since this would 
need to be done for SQL, it could be done for Pig as well.

{code}
CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
chararray, population: long );
PopulationProportions = foreach CountryPopulations generate
country, population, (double)population / (double)SUM(population) as 
global_proportion;
{code}








 Pig needs to handle scalar aliases to improve programmer and code execution 
 efficiency
 --

 Key: PIG-801
 URL: https://issues.apache.org/jira/browse/PIG-801
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz

 In Pig, it is often the case that the result of an operation is a scalar 
 value that needs to be applied to the next step of processing.
 For example:
 * FILTER by MAX of group -- See: PIG-772
 * Compute proportions by dividing by total (SUM) of grouped alias
 Today Pig programmers need to go through distasteful and slow contortions of 
 using FLATTEN or CROSS to propagate the scalar computation to EVERY row of 
 data to perform these operations creating needless copies of data.  Or, the 
 user must write the global sum to a file, then read it back in to gain the 
 efficiency.
 If the language were simply extended to have the notion of scalar aliases, 
 then coding would be simplified without contortions for the programmer and, I 
 believe, execution of the code would be faster too.
 For instance, to compute global proportions, I want to do the following:
 {code}
 CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
 chararray, population: long );
 AllCountryPopulations= group CountryPopulations all;
 Total = foreach AllCountryPopulations generate 
 SUM(CountryPopulations.population) as population;
 PopulationProportions = foreach CountryPopulations generate
 country, population, (double)population / (double)Total.population as 
 global_proportion;
 {code}
 One of the very distasteful workarounds for this is to do something like:
 {code}
 CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
 chararray, population: long );
 AllCountryPopulations= group CountryPopulations all;
 Total = foreach AllCountryPopulations generate 
 SUM(CountryPopulations.population) as population;
 CountryPopulationsTotal = cross CountryPopulations, Total;
 PopulationProportions = foreach CountryPopulations generate
 CountryPopulations::country,
 CountryPopulations::population,
 (double)CountryPopulations::population / (double)Total::population as 
 global_proportion;
 {code}
 This just makes me cringe every time I have to do it.  Constructing new rows 
 of data simply to apply
 the same scalar value row after row after row for potentially billions of 
 rows of data just feels horribly wrong
 and inefficient both from the coding standpoint and from the execution 
 standpoint.
 In SQL, I'd just code this as:
 {code}
 select
  country,
  population,
  population / SUM(population)
 from
  CountryPopulations;
 {code}
 In writing a SQL to Pig translator, it would seem that this construct or 
 idiom would need to be supported, so why not create a higher level of Pig 
 which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-05-31 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-753:
-

Summary: Provide support for UDFs without parameters  (was: Do not support 
UDF not providing parameter)

 Provide support for UDFs without parameters
 ---

 Key: PIG-753
 URL: https://issues.apache.org/jira/browse/PIG-753
 Project: Pig
  Issue Type: Improvement
Reporter: Jeff Zhang

 Pig do not support UDF without parameters, it force me provide a parameter.
 like the following statement:
  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
 provide a parameter like following
  B = FOREACH A GENERATE bagGenerator($0);
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-28 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714227#action_12714227
 ] 

David Ciemiewicz commented on PIG-807:
--

@Yiping

I see what you mean.  Maybe we should have FOREACH and FORALL as in B = FORALL 
A GENERATE SUM(m);

Another version of this my be B = OVER A GENERATE SUM(m); or B = OVERALL A 
GENERATE SUM(m);


There was a hallway conversation about the situation of:

{code}
B = GROUP A BY key;
C = FOREACH B {
SORTED = ORDER A BY value;
GENERATE
COUNT(SORTED) as count,
QUANTILES(SORTED.value, 0.0, 0.5, 0.75, 0.9, 1.0) as quantiles: 
(p00, p50, p75, p90, p100);
};
{code}

I was told that a ReadOnce bag would not solve this problem because we'd need 
to pass through SORTED twice because there were two UDFs.

I disagree.  It is possible to pass over this data once and only once if we 
create a class of Accumulating or Running functions that differs from the 
current DataBag and AlgebraicDataBag functions.

First, functions like SUM, COUNT, AVG, VAR, MIN, MAX, STDEV, ResevoirSampling, 
statistics.SUMMARY, can all computed on a ReadOnce / Streaming DataBag of 
unknown length or size.  For each of these functions, we simply add or 
accumulate  the values on row at a time, we can invoke a combiner for 
intermediate results across partitions, and produce a final result, all without 
materializing a DataBag as implemented today.

QUANTILES is a different beast.  To compute quantiles, the data must be sorted, 
which I prefer to do outside the UDF at this time.  Also, the COUNT of the data 
is needed a prior.  Fortunately sorting COULD produce a ReadOnce / Streaming 
DataBag of KNOWN as opposed to unknown length or size so only two scans through 
the data (sorting and quantiles) are needed without needing three scans (sort, 
count, quantiles).

So, if Pig could understand two additional data types

ReadOnceSizeUnknown -- COUNT() counts all individual rows
ReadOnceSizeKnown -- COUNT() just returns size attribute of ReadOnce data 
reference

And if Pig had RunningEval and RunningAlgebraicEval classes of functions which 
accumulate values a row at a time, many computations in Pig could be much much 
more efficient.

In case anyone doesn't get what I mean by having running functions, here's 
some Perl code that implements what I'm suggesting. I'll leave it as an 
exercise for the Pig development team to figure out the RunningAlgebraicEval 
versions of these functions/classes. :^)

runningsums.pl
{code}
#! /usr/bin/perl

use RunningSum;
use RunningCount;

$a_count = RunningCount-new();
$a_sum = RunningSum-new();
$b_sum = RunningSum-new();
$c_sum = RunningSum-new();

while ()
{
s/\r*\n*//g;

($a, $b, $c) = split(/\t/);

$a_count-accumulate($a);
$a_sum-accumulate($a);
$b_sum-accumulate($b);
$c_sum-accumulate($c);
}

print join(\t,
$a_count-final(),
$a_sum-final(),
$b_sum-final(),
$c_sum-final()
), \n;
{code}

RunningCount.pm
{code}
package RunningCount;

sub new
{
my $class = shift;
my $self = {};
bless $self, $class;
return $self;
}

sub accumulate
{
my $self = shift;
my $value = shift;

$self-{'count'} ++;
}

sub final
{
my $self = shift;
return $self-{'count'};
}

1;
{code}

RunningSum.pl
{code}
package RunningSum;

sub new
{
my $class = shift;
my $self = {};
bless $self, $class;
return $self;
}

sub accumulate
{
my $self = shift;
my $value = shift;

$self-{'sum'} += $value;
}

sub final
{
my $self = shift;
return $self-{'sum'};
}

1;
{code}








 PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
 Hadoop values iterator)
 

 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently all bags resulting from a group or cogroup are materialized as bags 
 containing all of the contents. The issue with this is that if a particular 
 key has many corresponding values, all these values get stuffed in a bag 
 which may run out of memory and hence spill causing slow down in performance 
 and sometime memory exceptions. In many cases, the udfs which use these bags 
 coming out a group and cogroup only need to iterate over the bag in a 
 unidirectional read-once manner. This can be implemented by having the bag 
 implement its iterator by simply iterating over the underlying hadoop 
 iterator provided in the reduce. This kind of a bag is also needed in 
 

[jira] Commented: (PIG-807) PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the Hadoop values iterator)

2009-05-13 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709007#action_12709007
 ] 

David Ciemiewicz commented on PIG-807:
--

Certainly SUM, COUNT, AVG could all use this.

In fact, technically speaking, SUM, COUNT, and AVG shouldn't even necessarily 
need a prior GROUP ... ALL statement.  How would this factor into the 
thinking on this?

While you're thinking about this, we might also consider another optimization 
as well ... what if I have 10 to 100 SUM operations in the same FOREACH ... 
GENERATE statement.

Materializing a DataBag or even a ReadOnce Bag for each column of data is 
REALLY slow.  In working through this, providing access to the underlying 
hadoop iterators permit a single scan through the data rather than multiple 
scans, one for each column?

Example:

{code}
A = load ...

B = group A all;

C = foreach B generate
COUNT(A),
SUM(A.m),
SUM(A.n),
SUM(A.o),
SUM(A.p),
SUM(A.q),
SUM(A.r),
...
{code}

 PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the 
 Hadoop values iterator)
 

 Key: PIG-807
 URL: https://issues.apache.org/jira/browse/PIG-807
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.1
Reporter: Pradeep Kamath
 Fix For: 0.3.0


 Currently all bags resulting from a group or cogroup are materialized as bags 
 containing all of the contents. The issue with this is that if a particular 
 key has many corresponding values, all these values get stuffed in a bag 
 which may run out of memory and hence spill causing slow down in performance 
 and sometime memory exceptions. In many cases, the udfs which use these bags 
 coming out a group and cogroup only need to iterate over the bag in a 
 unidirectional read-once manner. This can be implemented by having the bag 
 implement its iterator by simply iterating over the underlying hadoop 
 iterator provided in the reduce. This kind of a bag is also needed in 
 http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for 
 this issue too. The other part of this issue is to have some way for the udfs 
 to communicate to Pig that any input bags that they need are read once bags 
 . This can be achieved by having an Interface - say UsesReadOnceBags  which 
 is serves as a tag to indicate the intent to Pig. Pig can then rewire its 
 execution plan to use ReadOnceBags is feasible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-734) Non-string keys in maps

2009-05-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707377#action_12707377
 ] 

David Ciemiewicz commented on PIG-734:
--

Alan, if I don't think this is going to be that problematic.

Even if I try to pass in a map dereference with an integer such as mymap#1, 
would pig automagically convert the 1 to a string equivalent to mymap#'1'.  If 
so, I think this would be quite acceptable.

 Non-string keys in maps
 ---

 Key: PIG-734
 URL: https://issues.apache.org/jira/browse/PIG-734
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor
 Fix For: 0.3.0

 Attachments: PIG-734.patch


 With the addition of types to pig, maps were changed to allow any atomic type 
 to be a key.  However, in practice we do not see people using keys other than 
 strings.  And allowing multiple types is causing us issues in serializing 
 data (we have to check what every key type is) and in the design for non-java 
 UDFs (since many scripting languages include associative arrays such as 
 Perl's hash).
 So I propose we scope back maps to only have string keys.  This would be a 
 non-compatible change.  But I am not aware of anyone using non-string keys, 
 so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-801) Pig needs to handle scalar aliases to improve programmer and code execution efficiency

2009-05-07 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-801:
-

Summary: Pig needs to handle scalar aliases to improve programmer and code 
execution efficiency  (was: Pig needs to handle scalar aliases to programmer 
and code execution efficiency)

 Pig needs to handle scalar aliases to improve programmer and code execution 
 efficiency
 --

 Key: PIG-801
 URL: https://issues.apache.org/jira/browse/PIG-801
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz

 In Pig, it is often the case that the result of an operation is a scalar 
 value that needs to be applied to the next step of processing.
 For example:
 * FILTER by MAX of group -- See: PIG-772
 * Compute proportions by dividing by total (SUM) of grouped alias
 Today Pig programmers need to go through distasteful and slow contortions of 
 using FLATTEN or CROSS to propagate the scalar computation to EVERY row of 
 data to perform these operations creating needless copies of data.  Or, the 
 user must write the global sum to a file, then read it back in to gain the 
 efficiency.
 If the language were simply extended to have the notion of scalar aliases, 
 then coding would be simplified without contortions for the programmer and, I 
 believe, execution of the code would be faster too.
 For instance, to compute global proportions, I want to do the following:
 {code}
 CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
 chararray, population: long );
 AllCountryPopulations= group CountryPopulations all;
 Total = foreach AllCountryPopulations generate 
 SUM(CountryPopulations.population) as population;
 PopulationProportions = foreach CountryPopulations generate
 country, population, (double)population / (double)Total.population as 
 global_proportion;
 {code}
 One of the very distasteful workarounds for this is to do something like:
 {code}
 CountryPopulations = load 'country.dat' using PigStorage() as ( country: 
 chararray, population: long );
 AllCountryPopulations= group CountryPopulations all;
 Total = foreach AllCountryPopulations generate 
 SUM(CountryPopulations.population) as population;
 CountryPopulationsTotal = cross CountryPopulations, Total;
 PopulationProportions = foreach CountryPopulations generate
 CountryPopulations::country,
 CountryPopulations::population,
 (double)CountryPopulations::population / (double)Total::population as 
 global_proportion;
 {code}
 This just makes me cringe every time I have to do it.  Constructing new rows 
 of data simply to apply
 the same scalar value row after row after row for potentially billions of 
 rows of data just feels horribly wrong
 and inefficient both from the coding standpoint and from the execution 
 standpoint.
 In SQL, I'd just code this as:
 {code}
 select
  country,
  population,
  population / SUM(population)
 from
  CountryPopulations;
 {code}
 In writing a SQL to Pig translator, it would seem that this construct or 
 idiom would need to be supported, so why not create a higher level of Pig 
 which would support the notion of scalars efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-602) Pass global configurations to UDF

2009-05-04 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705812#action_12705812
 ] 

David Ciemiewicz commented on PIG-602:
--

JIRA PIG-477 is related to this, I think.

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
Assignee: Alan Gates

 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-477) passing properties from command line to the backend

2009-05-04 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705814#action_12705814
 ] 

David Ciemiewicz commented on PIG-477:
--

PIG-602 is related to this, I think.

 passing properties from command line to the backend
 ---

 Key: PIG-477
 URL: https://issues.apache.org/jira/browse/PIG-477
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich

 We have users that would like to be able to pass paramters from command line 
 to their UDFs.
 A natural way to do that would be pass them as properties from the client to 
 the compute node and make them available through System.getProperties on the 
 backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-04-29 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704452#action_12704452
 ] 

David Ciemiewicz commented on PIG-741:
--

Thanks Alan!

The fact that LIMIT in this case doesn't use the combiner is probably not an 
issue.  In most of the instances I have, I usually don't have more than a 
million things in the grouped databag, most of the time I only have under 1000 
to 1 things so the combiner won't have much value.

 Add LIMIT as a statement that works in nested FOREACH
 -

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-741.patch


 I'd like to compute the top 10 results in each group.
 The natural way to express this in Pig would be:
 {code}
 A = load '...' using PigStorage() as (
 date: int,
 count: int,
 url: chararray
 );
 B = group A by ( date );
 C = foreach B {
 D = order A by count desc;
 E = limit D 10;
 generate
 FLATTEN(E);
 };
 dump C;
 {code}
 Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
 since LIMIT already exists as a statement, it seems like it should also work 
 in the nested foreach context.
 Example workaround code.
 {code}
 C = foreach B {
 D = order A by count desc;
 E = util.TOP(D, 10);
 generate
 FLATTEN(E);
 };
 dump C;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-777) Code refactoring: Create optimization out of store/load post processing code

2009-04-28 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703764#action_12703764
 ] 

David Ciemiewicz commented on PIG-777:
--

Another thing ...

If you eliminate the D = load statement, could you provide some information to 
the user that this optimization is taking place?

It would help me immensely with code maintenance if I could eliminate the D = 
load steps which often require recoding the AS clause schema.

 Code refactoring: Create optimization out of store/load post processing code
 

 Key: PIG-777
 URL: https://issues.apache.org/jira/browse/PIG-777
 Project: Pig
  Issue Type: Improvement
Reporter: Gunther Hagleitner

 The postProcessing method in the pig server checks whether a logical graph 
 contains stores to and loads from the same location. If so, it will either 
 connect the store and load, or optimize by throwing out the load and 
 connecting the store predecessor with the successor of the load.
 Ideally the introduction of the store and load connection should happen in 
 the query compiler, while the optimization should then happen in an separate 
 optimizer step as part of the optimizer framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??

2009-04-27 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703219#action_12703219
 ] 

David Ciemiewicz commented on PIG-771:
--

I'm just using Mac OS terminal to connect to a RHEL-4 gateway server to a 
RHEL-4 grid.

I changed the code to use PigDump() storage format for the STORE statement and 
reran the code, trying to eliminate the terminal aspect.  Pig itself is writing 
the question marks ('?', 0x3f).

{code}
-bash-3.00$ cat ch2.pig
A = load 'ch.txt' using PigStorage() as (str: chararray);
store A into 'ch.dmp' using PigDump();

-bash-3.00$ hadoop fs -cat ch.dmp/*
()

-bash-3.00$ hadoop fs -cat ch.dmp/* | od -xc
000 3f28 3f3f 293f 000a
  (   ?   ?   ?   ?   )  \n  \0
007
{code}

 PigDump does not properly output Chinese UTF8 characters - they are displayed 
 as question marks ??
 --

 Key: PIG-771
 URL: https://issues.apache.org/jira/browse/PIG-771
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 PigDump does not properly output Chinese UTF8 characters.
 The reason for this is that the function Tuple.toString() is called.
 DefaultTuple implements Tuple.toString() and it calls Object.toString() on 
 the opaque object d.
 Instead, I think that the code should be changed instead to call the new 
 DataType.toString() function.
 {code}
 @Override
 public String toString() {
 StringBuilder sb = new StringBuilder();
 sb.append('(');
 for (IteratorObject it = mFields.iterator(); it.hasNext();) {
 Object d = it.next();
 if(d != null) {
 if(d instanceof Map) {
 sb.append(DataType.mapToString((MapObject, Object)d));
 } else {
 sb.append(DataType.toString(d));  //  Change this one 
 line
 if(d instanceof Long) {
 sb.append(L);
 } else if(d instanceof Float) {
 sb.append(F);
 }
 }
 } else {
 sb.append();
 }
 if (it.hasNext())
 sb.append(,);
 }
 sb.append(')');
 return sb.toString();
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-786) Default job.names - script name, load file pattern, store file pattern, sub task type

2009-04-26 Thread David Ciemiewicz (JIRA)
Default job.names - script name, load file pattern, store file pattern, sub 
task type
-

 Key: PIG-786
 URL: https://issues.apache.org/jira/browse/PIG-786
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
Priority: Trivial


I have very complex Pig scripts which are often concatenations and iterations 
of a large number of map reduce tasks.

I've gotten into the habit of using the following construct in my code:

{code}set job.name '$DIR/$DATE/summary.bz';

A = load ...
...
store Z into '$DIR/$DATE/summary.bz' using PigStorage();{code}

But it would be really useful if Pig script parsing automagically set these 
job.name values.

Ideally I'd like to have Pig just automagically construct job names for me so I 
can trace execution of multihour jobs in the HOD progress pages.  Something 
like:

{code}process-dates.pig
A = LOAD /data/logs/daily/20090408
...
STORE Z into mysummary/20090408/summary.bz
map-group-combiner-sort{code}

Okay you say, I could construct this kind of job.name myself if this is what I 
want.

Well:

1) I'd really like to have a default constructed by Pig so I don't have to
2) Pig has information about what is happening that I don't have such as:
* The name of the script passed to Pig
* The glob expansion of the file pathname in the LOAD statement
* The execution plan of pig that would tell me what the 
map-group-combine-sort-reduce group looks like
* The name of intermediate STORE operations that are being performed


 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-786) Default job.names - script name, load file pattern, store file pattern, sub task type

2009-04-26 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702931#action_12702931
 ] 

David Ciemiewicz commented on PIG-786:
--

I think the desire for global configuration information and the notion of 
environment variables might be related: PIG-602



 Default job.names - script name, load file pattern, store file pattern, sub 
 task type
 -

 Key: PIG-786
 URL: https://issues.apache.org/jira/browse/PIG-786
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
Priority: Trivial

 I have very complex Pig scripts which are often concatenations and iterations 
 of a large number of map reduce tasks.
 I've gotten into the habit of using the following construct in my code:
 {code}set job.name '$DIR/$DATE/summary.bz';
 A = load ...
 ...
 store Z into '$DIR/$DATE/summary.bz' using PigStorage();{code}
 But it would be really useful if Pig script parsing automagically set these 
 job.name values.
 Ideally I'd like to have Pig just automagically construct job names for me so 
 I can trace execution of multihour jobs in the HOD progress pages.  Something 
 like:
 {code}process-dates.pig
 A = LOAD /data/logs/daily/20090408
 ...
 STORE Z into mysummary/20090408/summary.bz
 map-group-combiner-sort{code}
 Okay you say, I could construct this kind of job.name myself if this is what 
 I want.
 Well:
 1) I'd really like to have a default constructed by Pig so I don't have to
 2) Pig has information about what is happening that I don't have such as:
 * The name of the script passed to Pig
 * The glob expansion of the file pathname in the LOAD statement
 * The execution plan of pig that would tell me what the 
 map-group-combine-sort-reduce group looks like
 * The name of intermediate STORE operations that are being performed
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-755) Difficult to debug parameter substitution problems based on the error messages when running in local mode

2009-04-25 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702703#action_12702703
 ] 

David Ciemiewicz commented on PIG-755:
--

Thanks.  Didn't know about the dry-run option.

Hopefully it will someday produce UTF-8 text given some of the parameters will 
be in Chinese or Japanese characters. :^)

 Difficult to debug parameter substitution problems based on the error 
 messages when running in local mode
 -

 Key: PIG-755
 URL: https://issues.apache.org/jira/browse/PIG-755
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: inputfile.txt, localparamsub.pig


 I have a script in which I do a parameter substitution for the input file. I 
 have a use case where I find it difficult to debug based on the error 
 messages in local mode.
 {code}
 A = load '$infile' using PigStorage() as
  (
date: chararray,
count   : long,
gmean   : double
 );
 dump A;
 {code}
 1) I run it in local mode with the input file in the current working directory
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main 
 -exectype local -param infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:03:51,967 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore
  - Received error from storer function: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 setup the load function.
 2009-04-07 00:03:51,970 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - Failed jobs!!
 2009-04-07 00:03:51,971 [main] INFO  
 org.apache.pig.backend.local.executionengine.LocalPigLauncher - 1 out of 1 
 failed!
 2009-04-07 00:03:51,974 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1066: Unable to open iterator for alias A
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062631414.log
 
 ERROR 1066: Unable to open iterator for alias A
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:88)
 at org.apache.pig.Main.main(Main.java:352)
 Caused by: java.io.IOException: Job terminated with anomalous status FAILED
 at org.apache.pig.PigServer.openIterator(PigServer.java:433)
 ... 5 more
 
 2) I run it in map reduce mode
 {code}
 prompt  $ java -cp pig.jar:/path/to/hadoop/conf/ org.apache.pig.Main -param 
 infile='inputfile.txt' localparamsub.pig
 {code}
 2009-04-07 00:07:31,660 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: hdfs://localhost:9000
 2009-04-07 00:07:32,074 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to map-reduce job tracker at: localhost:9001
 2009-04-07 00:07:34,543 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the arguments. Applications should 
 implement Tool for the same.
 2009-04-07 00:07:39,540 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-04-07 00:07:39,540 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2009-04-07 00:07:39,563 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: inputfile does not exist.
 
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1239062851400.log
 
 ERROR 2100: inputfile does not exist.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias A
 at org.apache.pig.PigServer.openIterator(PigServer.java:439)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:359)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:193)
 at 
 

[jira] Commented: (PIG-506) Does pig need a NATIVE keyword?

2009-04-25 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702704#action_12702704
 ] 

David Ciemiewicz commented on PIG-506:
--

Alan,

This seems much cleaner way to set up native Hadoop map-reduce jobs than the 
command line interfaces people use today.  Might be worth it just for that 
alone.

I think you'd need to gather some examples from non-Pig users and prototype 
them as Pig/NATIVE scripts to demonstrate what the advantages would be.

For me, as a primary Pig user, there is some appeal because I could benefit 
from borrowing other's code.

 Does pig need a NATIVE keyword?
 ---

 Key: PIG-506
 URL: https://issues.apache.org/jira/browse/PIG-506
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates
Priority: Minor

 Assume a user had a job that broke easily into three pieces.  Further assume 
 that pieces one and three were easily expressible in pig, but that piece two 
 needed to be written in map reduce for whatever reason (performance, 
 something that pig could not easily express, legacy job that was too 
 important to change, etc.).  Today the user would either have to use map 
 reduce for the entire job or manually handle the stitching together of pig 
 and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
 would allow the script to pass off the data stream to the underlying system 
 (in this case map reduce).  The semantics of NATIVE would vary by underlying 
 system.  In the map reduce case, we would assume that this indicated a 
 collection of one or more fully contained map reduce jobs, so that pig would 
 store the data, invoke the map reduce jobs, and then read the resulting data 
 to continue.  It might look something like this:
 {code}
 A = load 'myfile';
 X = load 'myotherfile';
 B = group A by $0;
 C = foreach B generate group, myudf(B);
 D = native (jar=mymr.jar, infile=frompig outfile=topig);
 E = join D by $0, X by $0;
 ...
 {code}
 This differs from streaming in that it allows the user to insert an arbitrary 
 amount of native processing, whereas streaming allows the insertion of one 
 binary.  It also differs in that, for streaming, data is piped directly into 
 and out of the binary as part of the pig pipeline.  Here the pipeline would 
 be broken, data written to disk, and the native block invoked, then data read 
 back from disk.
 Another alternative is to say this is unnecessary because the user can do the 
 coordination from java, using the PIgServer interface to run pig and calling 
 the map reduce job explicitly.  The advantages of the native keyword are that 
 the user need not be worried about coordination between the jobs, pig will 
 take care of it.  Also the user can make use of existing java applications 
 without being a java programmer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-602) Pass global configurations to UDF

2009-04-25 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702707#action_12702707
 ] 

David Ciemiewicz commented on PIG-602:
--

This sounds a lot like shell script environment variables.
As such maybe it should follow the same rich level of operations and semantics 
that you get with environment variables.

How is PigConf different from set properties in Pig?
Why can't both use the same mechanism?
Should they use the same mechanism?

Can / should this same mechanism let my UDFs know when Pig is in local mode 
versus hdfs mode? [JIRA PIG-756] (or should something different be used?

When in grunt, how can I inspect what the current PigConf values are? (Useful 
for logging and debugging)

By what mechanism can I set or override these values from within my Pig script?
Can I set the values to be one thing at one point in the Pig script and change 
it later to a new value in the Pig script?

 Pass global configurations to UDF
 -

 Key: PIG-602
 URL: https://issues.apache.org/jira/browse/PIG-602
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Yiping Han
Assignee: Alan Gates

 We are seeking an easy way to pass a large number of global configurations to 
 UDFs.
 Since our application contains many pig jobs, and has a large number of 
 configurations. Passing configurations through command line is not an ideal 
 way (i.e. modifying single parameter needs to change multiple command lines). 
 And to put everything into the hadoop conf is not an ideal way either.
 We would like to see if Pig can provide such a facility that allows us to 
 pass a configuration file in some format(XML?) and then make it available 
 through out all the UDFs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-784) PigStorage() - need ability to turn off Attempt to access field warnings

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702528#action_12702528
 ] 

David Ciemiewicz commented on PIG-784:
--

@Santhosh

Hmmm.  I'm running Pig in local mode with the latest published build and I get 
lots of warnings and they are not aggregated:

-bash-3.00$ pig -exectype local -latest cat.pig
USING: /grid/0/gs/pig/current
2009-04-24 20:02:55,666 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,667 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,668 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,669 [main] WARN  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigHadoopLogger - 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject:
 Attempt to access field which was not found in the input
2009-04-24 20:02:55,672 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-24 20:02:55,672 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,1,42.0F)
(,,)
(,,)
(,,)
(,,)

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702533#action_12702533
 ] 

David Ciemiewicz commented on PIG-774:
--

A somewhat related bug is JIRA PIG-755 - the difficulty of debugging issues 
related to passed parameters.

If Pig produced an output file of the code with parameter substitutions made, 
we could have more rapidly isolated the problem.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 

[jira] Commented: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702535#action_12702535
 ] 

David Ciemiewicz commented on PIG-759:
--

If hbase has named columns in it's schema, why wouldn't it be appropriate to 
say something like:

table = load '$tablename/$subsection' using HBaseStorage() as (a, b);

Since HBaseStorage() is specified:

1) Isn't hbase:// implicit?
2) Shouldn't I be able to just specify the names in the AS clause?



 HBaseStorage scheme for Load/Slice function
 ---

 Key: PIG-759
 URL: https://issues.apache.org/jira/browse/PIG-759
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner

 We would like to change the HBaseStorage function to use a scheme when 
 loading a table in pig. The scheme we are thinking of is: hbase. So in 
 order to load an hbase table in a pig script the statement should read:
 {noformat}
 table = load 'hbase://tablename' using HBaseStorage();
 {noformat}
 If the scheme is omitted pig would assume the tablename to be an hdfs path 
 and the storage function would use the last component of the path as a table 
 name and output a warning.
 For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??

2009-04-20 Thread David Ciemiewicz (JIRA)
PigDump does not properly output Chinese UTF8 characters - they are displayed 
as question marks ??
--

 Key: PIG-771
 URL: https://issues.apache.org/jira/browse/PIG-771
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


PigDump does not properly output Chinese UTF8 characters.

The reason for this is that the function Tuple.toString() is called.

DefaultTuple implements Tuple.toString() and it calls Object.toString() on the 
opaque object d.

Instead, I think that the code should be changed instead to call the new 
DataType.toString() function.

{code}
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append('(');
for (IteratorObject it = mFields.iterator(); it.hasNext();) {
Object d = it.next();
if(d != null) {
if(d instanceof Map) {
sb.append(DataType.mapToString((MapObject, Object)d));
} else {
sb.append(DataType.toString(d));  //  Change this one 
line
if(d instanceof Long) {
sb.append(L);
} else if(d instanceof Float) {
sb.append(F);
}
}
} else {
sb.append();
}
if (it.hasNext())
sb.append(,);
}
sb.append(')');
return sb.toString();
}
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??

2009-04-20 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700845#action_12700845
 ] 

David Ciemiewicz commented on PIG-771:
--

I was going to submit a patch for this one line change, but I discovered in 
compiling the code that DataType.toString(d) throws an ExecException.

Oddly, DataType.mapToString DOES NOT throw any Exceptions which is inconsistent 
with the other DataType.to... functions.

I am not sure how to best implement the try / catch / throw for this particular 
case.

Also, in doing the code review of DataType.mapToString(...) I discovered that 
it will also have problems with correctly dumping the data contained within it 
because it too uses Object.toString() on opaque data handles.

So, the code for DataType.mapToString(...) should also use 
DataType.toString(Object);

But now I witness a recursion problem.  DataType.toString(Object) does not work 
for complex types.  So maps of maps will not be recursed properly.

So DataType.toString(Object) should probably be enhanced to work on Maps as 
well.

But now we have another problem ... PigDump wants to append L and F for Long 
values and Float values.  But this won't work for nested structures.

 PigDump does not properly output Chinese UTF8 characters - they are displayed 
 as question marks ??
 --

 Key: PIG-771
 URL: https://issues.apache.org/jira/browse/PIG-771
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 PigDump does not properly output Chinese UTF8 characters.
 The reason for this is that the function Tuple.toString() is called.
 DefaultTuple implements Tuple.toString() and it calls Object.toString() on 
 the opaque object d.
 Instead, I think that the code should be changed instead to call the new 
 DataType.toString() function.
 {code}
 @Override
 public String toString() {
 StringBuilder sb = new StringBuilder();
 sb.append('(');
 for (IteratorObject it = mFields.iterator(); it.hasNext();) {
 Object d = it.next();
 if(d != null) {
 if(d instanceof Map) {
 sb.append(DataType.mapToString((MapObject, Object)d));
 } else {
 sb.append(DataType.toString(d));  //  Change this one 
 line
 if(d instanceof Long) {
 sb.append(L);
 } else if(d instanceof Float) {
 sb.append(F);
 }
 }
 } else {
 sb.append();
 }
 if (it.hasNext())
 sb.append(,);
 }
 sb.append(')');
 return sb.toString();
 }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-750) Use combiner when a mix of algebraic and non-algebraic functions are used

2009-04-17 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700195#action_12700195
 ] 

David Ciemiewicz commented on PIG-750:
--

Also consider the application of a scalar function to the result of an 
aggregation function:

3) foreach X generate EXP(AVG(b))

 Use combiner when a mix of algebraic and non-algebraic functions are used
 -

 Key: PIG-750
 URL: https://issues.apache.org/jira/browse/PIG-750
 Project: Pig
  Issue Type: Improvement
Reporter: Amir Youssefi
Priority: Minor

 Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG 
 etc.) in foreach:
 foreach X generate a,b,c,... 
  It's a performance improvement if it uses combiner when a mix of algebraic 
 and non-algebraic functions are used as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-760) Serialize schemas for PigStorage() and other storage types.

2009-04-09 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697631#action_12697631
 ] 

David Ciemiewicz commented on PIG-760:
--

Sure, you could do that, create PigStorageSchema.

The thing is, I don't think it is necessary and it is possible to do this in a 
backward compatible way.

First, if the user specifies a LOAD ... AS clause schema, then PigStorage could 
simply use that casting to override what is in the .schema.  Of course, 
PigStorage might want to warn that there is an override at run time or do a 
smart warning only if there are incompatible differences between the 
serialized schema and the explicit AS clause schema.

Next, is there really any harm in creating the serialized shema file on each 
and every STORE?

Finally, why sub class when we could parameterize?

In other words, instead of writing:

store A into 'file' using PigStorageSchema();

Why not do:

store A into 'file' using PigStorage('schema=yes');  -- redundant schema=yes is 
default

I think it would be more useful to have single classes with parameterized 
options than a proliferation of classes.

Or, better yet, why can't I just define the behavior of PigStorage() for all of 
the instances in my script:

define PigStorage PigStorage(
'sep=\t',
'schema=yes',
'erroronmissingcolumn=no'
);

I have recently done similar things for other functions and it turns out to be 
a nice way of capturing global parameterizations for cleaner Pig code.




 Serialize schemas for PigStorage() and other storage types.
 ---

 Key: PIG-760
 URL: https://issues.apache.org/jira/browse/PIG-760
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz

 I'm finding PigStorage() really convenient for storage and data interchange 
 because it compresses well and imports into Excel and other analysis 
 environments well.
 However, it is a pain when it comes to maintenance because the columns are in 
 fixed locations and I'd like to add columns in some cases.
 It would be great if load PigStorage() could read a default schema from a 
 .schema file stored with the data and if store PigStorage() could store a 
 .schema file with the data.
 I have tested this out and both Hadoop HDFS and Pig in -exectype local mode 
 will ignore a file called .schema in a directory of part files.
 So, for example, if I have a chain of Pig scripts I execute such as:
 A = load 'data-1' using PigStorage() as ( a: int , b: int );
 store A into 'data-2' using PigStorage();
 B = load 'data-2' using PigStorage();
 describe B;
 describe B should output something like { a: int, b: int }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-729) Use of default parallelism

2009-04-09 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697641#action_12697641
 ] 

David Ciemiewicz commented on PIG-729:
--

Ah wait, I just read what Olga wrote again.  I think there might be hybrid 
solution that handles both cases without having to do -param.

We should add to Pig a -set option that let's us set values for things that we 
would set in our scripts.

pig -set parallelism=5

is equivalent to following idiom in my pig script.

set parallelism 5;

Command line -set options should override explicit set statements in the pig 
script with a warning of the override.

I think this generalized mechanism would satisfy both my desires as a developer 
and Olga's desire to reduce pig development team code maintenance headaches.

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2009-04-08 Thread David Ciemiewicz (JIRA)
UDFs should have API for transparently opening and reading files from HDFS or 
from local file system with only relative path


 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


I have a utility function util.INSETFROMFILE() that I pass a file name during 
initialization.

{code}
define inQuerySet util.INSETFROMFILE(analysis/queries);
A = load 'logs' using PigStorage() as ( date int, query chararray );
B = filter A by inQuerySet(query);
{code}

This provides a computationally inexpensive way to effect map-side joins for 
small sets plus functions of this style provide the ability to encapsulate more 
complex matching rules.

For rapid development and debugging purposes, I want this code to run without 
modification on both my local file system when I do pig -exectype local and on 
HDFS.

Pig needs to provide an API for UDFs which allow them to either:

1) know  when they are in local or HDFS mode and let them open and read from 
files as appropriate
2) just provide a file name and read statements and have pig transparently 
manage local or HDFS opens and reads for the UDF

UDFs need to read configuration information off the filesystem and it 
simplifies the process if one can just flip the switch of -exectype local.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697043#action_12697043
 ] 

David Ciemiewicz commented on PIG-756:
--

BTW, there used to be a mechanism to do this in early versions of Pig that was 
last in the transition to the new execution system.

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697094#action_12697094
 ] 

David Ciemiewicz commented on PIG-745:
--

Alan,

I realized several things.

1) The question of what to do about BOOLEAN case.  My original suggestion was 
to convert the BOOLEAN case to 1 and 0 but in the patch, I just used the 
Boolean.toString() function.  Not sure if that matters or not.

2) I didn't see other test cases for the other DataType.toInteger(), ... 
conversions so I didn't create one for DataType.toString().

3) We are just using the default conversion of Float.toString() and 
Double.toString().  I don't know if this is actually best since I don't know 
if these operations present the floating-point values in full precision or not. 
 At this point, it may not really matter so much as the primary reason for 
creating DataType.toString() is to allow string functions to operate on any 
data type (like in Perl) without generating cast errors.



 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-753) Do not support UDF not providing parameter

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697108#action_12697108
 ] 

David Ciemiewicz commented on PIG-753:
--

I think Jeff means that Pig does not support UDFs without parameters, but 
should.

I agree.

 Do not support UDF not providing parameter
 --

 Key: PIG-753
 URL: https://issues.apache.org/jira/browse/PIG-753
 Project: Pig
  Issue Type: Improvement
Reporter: Jeff Zhang

 Pig do not support UDF without parameters, it force me provide a parameter.
 like the following statement:
  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
 provide a parameter like following
  B = FOREACH A GENERATE bagGenerator($0);
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697146#action_12697146
 ] 

David Ciemiewicz commented on PIG-697:
--

Some thoughts on optimization problems and patterns from SQL and coding Pig and 
my desire for a higher level version of Pig than we have today.

I know this may come off as distraction but hopefully you'll have some time 
to hear me out.

* after a conversation with Santhosh about the SQL to Pig translation work 
* multiple issues I have countered with nested foreach statements including 
redundant function execution 
* nested FOREACH statement assignment computation bugs 
* hand coding chains of foreach statements so I can get the Algebraic combiner 
to kick 
* hand coding chains of foreach statements and grouping statements rather than 
using a single statement

I think I might have stumbled on a potentially improved model for Pig to Pig 
execution plan generation:

{code}
High Level Pig to Low Level Pig translation
{code}

I think this would potentially benefit the SQL to Pig efforts and provide for 
programmer coding efficiency in Pig as well.

This will be a bit protracted, but I hope you have some time to consider it.

Take the following SQL idiom that the SQL to Pig translator will need to 
support:

{code}
select
EXP(AVG(LN(time+0.1))) as geomean_time
from
events
where
time is not null and
time = 0;
{code}

In high level pig, I have wanted to code this as
 
{code}
A = load 'events' using PigStorage() as ( time: int );
B = filter A by time is not null and time = 0;
C = group B all;
D = foreach C generate EXP(AVG(LN(B.time+0.1))) as geomean_time;
{code}

In fact, this would seem to provide a nice translation path from SQL to low 
level pig via high level pig.

Unfortunately, this won't work.  We developers must write Pig scripts at a 
lower level and break all of this apart into various steps.

An additional issue is that, because of some, um, workarounds, in the execution 
plan optimizations, the combiner won't kick in if we don't do further steps.

So the most performant version of the desired pig script is the following 
really low level pig where D is broken into 3 steps, merging one with B and 
the remaining 2 steps as separate D steps:

 
{code}
A = load 'events' using PigStorage() as ( time: int );
B = filter A by time is not null and time = 0;
B = foreach A generate LOG(time+0.1) as log_time;
C = group B all;
D = foreach C generate group, AVG(B.log_time) as mean_log_time;
-- note that group alias is required for 
Algebraic combiner to kick in
D = foreach D generate EXP(mean_log_time) as geomean_time;
{code}

If we can figure out how to translate SQL into this last low-level set of 
statements, why couldn't we or shouldn't we have high level pig as well and 
permit more efficient code writing and optimization?


Next example

I do a bunch of nested intermediate computations in a nested FOREACH statement:

{code}
C = foreach C {
curr_mean_log_timetonextevent = curr_sum_log_timetonextevent / 
(double)count;
curr_meansq_log_timetonextevent = curr_sumsq_log_timetonextevent / 
(double)count;
curr_var_log_timetonextevent = curr_meansq_log_timetonextevent - 
(curr_mean_log_timetonextevent * 
curr_mean_log_timetonextevent);
curr_sterr_log_timetonextevent = math.SQRT(curr_var_log_timetonextevent 
/ (double)count);
 

curr_geomean_timetonextevent = math.EXP(curr_mean_log_timetonextevent);
curr_geosterr_timetonextevent = 
math.EXP(curr_sterr_log_timetonextevent);
curr_mean_timetonextevent = curr_sum_log_timetonextevent / 
(double)count;
curr_meansq_timetonextevent = curr_sumsq_log_timetonextevent / 
(double)count;
curr_var_timetonextevent = curr_meansq_timetonextevent - 
(curr_mean_timetonextevent * curr_mean_timetonextevent);

curr_sterr_timetonextevent = math.SQRT(curr_var_timetonextevent / 
count);

generate
...
{code}

The code for nested statements in Pig has been particularly problematic and 
buggy including problems such as:

* redundant execution of functions such as SUM, AVG
* nested function problems
* mathematical operator problems (illustrated in this bug)
* no type propagation
* the need to use AS clauses to name nested alias assignments projected in the 
GENERATE clauses

What if instead of trying to do all of these operations in some specialized 
execution code, what if this was treated as high level pig that translated 
all of these intermediate statements into two or more low level foreach 
expansions.

[jira] Commented: (PIG-564) Parameter Substitution using -param option does not seem to work when parameters contain special characters such as +,=,-,?,'

2009-04-06 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12696266#action_12696266
 ] 

David Ciemiewicz commented on PIG-564:
--

Period (.) is also a special character that seems to cause problems.

See related JIRA PIG-754

 Parameter Substitution using -param option does not seem to work when 
 parameters contain special characters such as +,=,-,?,' 
 ---

 Key: PIG-564
 URL: https://issues.apache.org/jira/browse/PIG-564
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat

 Consider the following Pig script which uses parameter substitution
 {code}
 %default qual '/user/viraj'
 %default mydir 'mydir_myextraqual'
 VISIT_LOGS = load '$qual/$mydir' as (a,b,c);
 dump VISIT_LOGS;
 {code}
 If you run the script as:
 ==
 java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main 
 -param mydir=mydir-myextraqual mypigparamsub.pig
 ==
 You get the following error:
 ==
 2008-12-15 19:49:43,964 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - java.io.IOException: /user/viraj/mydir does not exist
 at 
 org.apache.pig.backend.executionengine.PigSlicer.validate(PigSlicer.java:109)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.validate(ValidatingInputFileSpec.java:59)
 at 
 org.apache.pig.impl.io.ValidatingInputFileSpec.init(ValidatingInputFileSpec.java:44)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:200)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:742)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 java.io.IOException: Unable to open iterator for alias: VISIT_LOGS [Job 
 terminated with anomalous status FAILED]
 at org.apache.pig.PigServer.openIterator(PigServer.java:389)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:269)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:178)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: java.io.IOException: Job terminated with anomalous status FAILED
 ... 6 more
 ==
 Also tried using:  -param mydir='mydir\-myextraqual'
 This behavior occurs if the parameter value contains characters such as +,=, 
 ?. 
 A workaround for this behavior is using a param_file which contains 
 param_name=param_value on each line, with the param_value enclosed by 
 quotes. For example:
 mydir='mydir-myextraqual' and then running the pig script as:
 java -cp pig.jar:${HADOOP_HOME}/conf/ -Dhod.server='' org.apache.pig.Main 
 -param_file myparamfile mypigparamsub.pig
 The following issues need to be fixed:
 1) In -param option if parameter value contains special characters, it is 
 truncated
 2) In param_file, if  param_value contains a special characters, it should be 
 enclosed in quotes
 3) If 2 is a known issue then it should be documented in 
 http://wiki.apache.org/pig/ParameterSubstitution

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-05 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695921#action_12695921
 ] 

David Ciemiewicz commented on PIG-745:
--

The more I think about this one, the more I realize that not having 
DataType.toString() is an oversight for the DataType package.


 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz

 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-05 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-745:
-

Attachment: PIG-745.patch

PIG-745.patch attached.

Patch for consideration to add DataTypes.toString() function.

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-05 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-745:
-

Status: Patch Available  (was: Open)

PIG-745.patch adds DataType.toString() function to DataType package.

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-754) Bugs with load and store and filenames passed with -param containing periods

2009-04-05 Thread David Ciemiewicz (JIRA)
Bugs with load and store and filenames passed with -param containing periods


 Key: PIG-754
 URL: https://issues.apache.org/jira/browse/PIG-754
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


This one drove me batty.

I have two files file and file.right.

file:
{code}
WRONG 
This is file, not file.right.
{code}

file.right:
{code}
RIGHT
This is file.right..
{code}

infile.pig:
{code}
A = load '$infile' using PigStorage();
dump A;
{code}

When I pass in file.right as the infile parameter value, the wrong file is read:

{code}
-bash-3.00$ pig -exectype local -param infile=file.right infile.pig
USING: /grid/0/gs/pig/current
2009-04-05 23:18:36,291 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-05 23:18:36,292 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(WRONG )
(This is file, not file.right.)
{code}

However, if I pass in infile as ./file.right, the script magically works.

{code}
-bash-3.00$ pig -exectype local -param infile=./file.right infile.pig
USING: /grid/0/gs/pig/current
2009-04-05 23:20:46,735 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-05 23:20:46,736 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(RIGHT)
(This is file.right.)
{code}

I do not have this problem if I use the file name with a period in the script 
itself:

infile2.pig
{code}
A = load 'file.right' using PigStorage();
dump A;
{code}

{code}
-bash-3.00$ pig -exectype local infile2.pig
USING: /grid/0/gs/pig/current
2009-04-05 23:22:47,022 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-05 23:22:47,023 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(RIGHT)
(This is file.right.)
{code}

I also experience similar problems when I try to pass in param outfile in a 
store statement.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-752) bzip2 compression and local mode bugs

2009-04-03 Thread David Ciemiewicz (JIRA)
bzip2 compression and local mode bugs
-

 Key: PIG-752
 URL: https://issues.apache.org/jira/browse/PIG-752
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


Problem 1)  use of .bz2 file extension does not store results bzip2 compressed 
in Local mode (-exectype local)

If I use the .bz2 filename extension in a STORE statement on HDFS, the results 
are stored with bzip2 compression.
If I use the .bz2 filename extension in a STORE statement on local file system, 
the results are NOT stored with bzip2 compression.

compact.bz2.pig:
{code}
A = load 'events.test' using PigStorage();
store A into 'events.test.bz2' using PigStorage();

C = load 'events.test.bz2' using PigStorage();
C = limit C 10;

dump C;
{code}

{code}
-bash-3.00$ pig -exectype local compact.bz2.pig

-bash-3.00$ file events.test
events.test: ASCII English text, with very long lines
-bash-3.00$ file events.test.bz2
events.test.bz2: ASCII English text, with very long lines

-bash-3.00$ cat events.test | bzip2  events.test.bz2
-bash-3.00$ file events.test.bz2
events.test.bz2: bzip2 compressed data, block size = 900k
{code}

The output format in local mode is definitely not bzip2, but it should be.
{code}


Problem 2) pig in local mode does not decompress bzip2 compressed files, but 
should, to be consistent with HDFS

read.bz2.pig:
{code}
A = load 'events.test.bz2' using PigStorage();
A = limit A 10;
dump A;
{code}

The output should be human readable but is instead garbage, indicating no 
decompression took place during the load:

{code}
-bash-3.00$ pig -exectype local read.bz2.pig
USING: /grid/0/gs/pig/current
2009-04-03 18:26:30,455 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-04-03 18:26:30,456 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(BZh91AYsyoz?u?...@{x_?d?|u-??mK???;??4?C??)
((R? 6?*mg, 
?6?Zj?k,???0?QT?d???hY?#mJ?[j???z?m?t?u?K)??K5+??)?m?E7j?X?8a??
??U?p@@MT?$?B?P??N??=???(z}gk...@c$\??i]?g:?J)
a(R?,?u?v???...@?i@??J??!D?)???A?PP?IY??m?
(mP(i?4,#F[?I)@?...@??|7^?}U??wwg,?u?$?T???((Q!D?=`*?}hP??_|??=?(??2???m=?xG?(?rC?B?(33??:4?N???t|??T?*??k??NT?x???=?fyv?wf??4z???4t?)
(?oou?t???Kwl?3?nCM?WS?;l???P?s?x
a???e)B??9?  ?44
((?...@4?)
(f)
(?...@+?d?0@?U)
(Q?SR)
-bash-3.00$ 
{code}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-729) Use of default parallelism

2009-04-03 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695599#action_12695599
 ] 

David Ciemiewicz commented on PIG-729:
--

I've been through this battle before.  And I write LOTS of Pig scripts.

Here's what I want:

1) Use default parallelism of 1 reducer.  BUT WARN ME that I've got a default 
parallelism of 1 reducer. (I'd actually prefer what ever works on a single 
node).

2) Allow me a command line option such as -parallel # or -mappers # -reducers #.

3) Allow me a set parameter inside my Pig scripts such as:

set parallel #
set mappers #
set reducers #

4) DO NOT require me to add a PARALLEL clause to each and every one of my 
reducer statements.
PARALLEL clauses are a code maintenance nightmare. 
Sometimes the grid is fat on available nodes and so I want to take advantage of 
this and run my job across as many nodes as possible.
Sometimes the grid is scarce on available nodes and so I want back off on the 
parallelism.

I DO NOT WANT to change EVERY PARALLEL clause in my code each time I run my 
script.
I DO NOT WANT to change parameter values for the PARALLEL clause each time I 
run my script.

I really, really, really want to make this a run-time decision on the execution 
of the script at the time that I invoke the script and I want this to be the 
default behavior in PIg.

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-02 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694893#action_12694893
 ] 

David Ciemiewicz commented on PIG-745:
--

Actually, the proposed function DataTypes.toString() is the following:
{code}
public static String toString(Object o) throws ExecException {
try {
switch (findType(o)) {
case BOOLEAN: if (((Boolean)o) == true) return new 
String('1'); else return new String('0');
case BYTE: return ((Byte)o).toString();
case INTEGER: return ((Integer)o).toString();
case LONG: return ((Long)o).toString();
case FLOAT: return ((Float)o).toString();
case DOUBLE: return ((Double)o).toString();
case BYTEARRAY: return ((DataByteArray)o).toString();
case CHARARRAY: return (String)o;
case NULL: return null;
case MAP:
case TUPLE:
case BAG:
case UNKNOWN:
default:
int errCode = 1071;
String msg = Cannot convert a  + 
findTypeName(o) +  to an String;
throw new ExecException(msg, errCode, 
PigException.INPUT);
}
}
catch (ExecException ee) { throw ee; }
catch (Exception e) {
int errCode = 2054; String msg = Internal error. Could 
not convert  + o +  to String.;
throw new ExecException(msg, errCode, PigException.BUG);
}
}
{code}

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz

 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-746) Works in --exectype local, fails on grid - ERROR 2113: SingleTupleBag should never be serialized

2009-04-02 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695198#action_12695198
 ] 

David Ciemiewicz commented on PIG-746:
--

I'd still like to use the combiner in other instances in my combined Pig 
scripts (I concatentate several pig scripts together to create compound Pig 
scripts).

It would be nice if Pig had a per statement option to turn off or force on the 
combiner.

In the mean time, I discovered a feature (flaw?) in Pig that turns off the 
combiner - perform a scalar operation (such as +0L) on the Algebraic 
aggregation function.

D = foreach B generate
group,
SUM(A.matched) + 0L  as matchedcount, -- +0L :flaw turns off combiner
A;
describe D;

I have tried this workaround and it works, at least in the current version of 
Pig.  Until someone figures out how to permit use of the combiner for combined 
Algebraic and scalar  operations.

 Works in --exectype local, fails on grid - ERROR 2113: SingleTupleBag should 
 never be serialized
 

 Key: PIG-746
 URL: https://issues.apache.org/jira/browse/PIG-746
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 The script below works on Pig 2.0 local mode but fails when I run the same 
 program on the grid.
 I was attempting to create a workaround for PIG-710.
 Here's the error:
 {code}
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2113: 
 SingleTupleBag should never be serialized
 or serialized.
 at org.apache.pig.data.SingleTupleBag.write(SingleTupleBag.java:129)
 at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:147)
 at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
 at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:439)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 {code}
 Here's the program:
 {code}
 A = load 'filterbug.data' using PigStorage() as ( id, str );
 A = foreach A generate
 id,
 str,
 (
 str matches 'hello' or
 str matches 'hello'
 ? 1 : 0
 )   as matched;
 describe A;
 B = group A by ( id );
 describe B;
 D = foreach B generate
 group,
 SUM(A.matched)  as matchedcount,
 A;
 describe D;
 E = filter D by matchedcount  0;
 describe E;
 F = foreach E generate
 FLATTEN(A);
 describe F;
 dump F;
 {code}
 Here's the data filterbug.data
 {code}
 a   hello
 a   goodbye
 b   goodbye
 c   hello
 c   hello
 c   hello
 e   what
 {code}
   

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach

2009-04-02 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695201#action_12695201
 ] 

David Ciemiewicz commented on PIG-747:
--

Another workaround is to split this into a chain of foreach statements:

{code}
B = foreach A generate
*,
(double)col1 / (double)col2 as d,
(double)col3 / (double)col2 as e;

B = foreach B generate
e - d * d as newcol;

dump B;
{code}

 Logical to Physical Plan Translation fails when temporary alias are created 
 within foreach
 --

 Key: PIG-747
 URL: https://issues.apache.org/jira/browse/PIG-747
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: physicalplan.txt, physicalplanprob.pig


 Consider a the pig script which calculates a new column F inside the foreach 
 as:
 {code}
 A = load 'physicalplan.txt' as (col1,col2,col3);
 B = foreach A {
D = col1/col2;
E = col3/col2;
F = E - (D*D);
generate
F as newcol;
 };
 dump B;
 {code}
 This gives the following error:
 ===
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
  ERROR 2015: Invalid physical operators in the physical plan
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377)
 at 
 org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63)
 at 
 org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246)
 ... 10 more
 Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
 operator of type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide
  multiple outputs.  This operator does not support multiple outputs.
 at 
 org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373)
 ... 19 more
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-01 Thread David Ciemiewicz (JIRA)
Please add DataTypes.toString() conversion function
---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz


I'm doing some work in string manipulation UDFs and I've found that it would be 
very convenient if I could always convert the argument to a chararray 
(internally a Java String).

For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, 
chararray, int, long, double, or float, it should be treated as a string and 
operated on.

The simplest and most foolproof method would be if the DataTypes added a static 
function of  DataTypes.toString which did all of the argument type checking and 
provided consistent translation.

I believe that this function might be coded as:

public static String toString(Object o) throws ExecException {
try {
switch (findType(o)) {
case BOOLEAN:
if (((Boolean)o) == true) return new String('1');
else return new String('0');

case BYTE:
return ((Byte)o).toString();

case INTEGER:
return ((Integer)o).toString();

case LONG:
return ((Long)o).toString();

case FLOAT:
return ((Float)o).toString();

case DOUBLE:
return ((Double)o).toString();

case BYTEARRAY:
return ((DataByteArray)o).toString();

case CHARARRAY:
return (String)o;

case NULL:
return null;

case MAP:
case TUPLE:
case BAG:
case UNKNOWN:
default:
int errCode = 1071;
String msg = Cannot convert a  + findTypeName(o) +
 to an String;
throw new ExecException(msg, errCode, 
PigException.INPUT);
}
} catch (ExecException ee) {
throw ee;
} catch (Exception e) {
int errCode = 2054;
String msg = Internal error. Could not convert  + o + 
 to String.;
throw new ExecException(msg, errCode, PigException.BUG);
}
}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-03-31 Thread David Ciemiewicz (JIRA)
Add LIMIT as a statement that works in nested FOREACH
-

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz


I'd like to compute the top 10 results in each group.

The natural way to express this in Pig would be:

{code}
A = load '...' using PigStorage() as (
date: int,
count: int,
url: chararray
);

B = group A by ( date );

C = foreach B {
D = order A by count desc;
E = limit D 10;
generate
FLATTEN(E);
};

dump C;
{code}

Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
since LIMIT already exists as a statement, it seems like it should also work in 
the nested foreach context.

Example workaround code.

{code}
C = foreach B {
D = order A by count desc;
E = util.TOP(D, 10);
generate
FLATTEN(E);
};

dump C;
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-710) Filtering bag in nested foreach does not produce expected results

2009-03-09 Thread David Ciemiewicz (JIRA)
Filtering bag in nested foreach does not produce expected results
-

 Key: PIG-710
 URL: https://issues.apache.org/jira/browse/PIG-710
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


I have an idiom I used to use in older versions of pig (prior to types branch) 
which would group into a collection and then filter the output if any of the 
collection contained a particular string.

This relies on FILTER statements within a FOREACH ... { ... GENERATE ... } 
statement.

ORDER ... BY in the FOREACH ... { ... GENERATE ... } statement does not seem to 
have a problem so it seems to be something isolated to the FILTER.

{code}
A = load 'filterbug.data' using PigStorage() as ( id, str );

B = group A by ( id );
describe B;
dump B;

D = foreach B generate
group,
COUNT(A),
A.str;
describe D;
dump D;

C = foreach B {
D = order A by str;
matchedcount = COUNT(D);
generate
group,
matchedcount as matchedcount,
D.str;
};
describe C;
dump C;

Cfiltered = foreach B {
D = filter A by (
str matches 'hello'
);
matchedcount = COUNT(D);
generate
group,
matchedcount as matchedcount,
A.str;
};
describe Cfiltered;
dump Cfiltered;
{code}

Here's the output:

{code}
-bash-3.00$ pig -exectype local -latest filterbug.pig
USING: /grid/0/gs/pig/current

B: {group: bytearray,A: {id: bytearray,str: bytearray}}
2009-03-10 03:14:14,838 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-03-10 03:14:14,839 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,{(a,hello),(a,goodbye)})
(b,{(b,goodbye)})
(c,{(c,hello),(c,hello),(c,hello)})
(d,{(d,what)})

D: {group: bytearray,long,str: {str: bytearray}}
2009-03-10 03:14:14,920 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-03-10 03:14:14,920 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,2L,{(hello),(goodbye)})
(b,1L,{(goodbye)})
(c,3L,{(hello),(hello),(hello)})
(d,1L,{(what)})

C: {group: bytearray,matchedcount: long,str: {str: bytearray}}
2009-03-10 03:14:14,985 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-03-10 03:14:14,985 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,2L,{(goodbye),(hello)})
(b,1L,{(goodbye)})
(c,3L,{(hello),(hello),(hello)})
(d,1L,{(what)})
2009-03-10 03:14:15,018 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

Cfiltered: {group: bytearray,matchedcount: long,str: {str: bytearray}}
2009-03-10 03:14:15,044 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
2009-03-10 03:14:15,057 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2009-03-10 03:14:15,057 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(a,1L,{(hello),(goodbye)})
{code}

What I expect for the output of Cfiltered is actually:

(a,1L,{(hello),(goodbye)})
(b,0L,{(goodbye)})
(c,3L,{(hello),(hello),(hello)})
(d,0L,{(what)})


The data file is:

{code}
a   hello
a   goodbye
b   goodbye
c   hello
c   hello
c   hello
d   what
{code}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-574) run command for grunt

2009-02-11 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672655#action_12672655
 ] 

David Ciemiewicz commented on PIG-574:
--

Thanks!

This will make so iterative development faster and less painful than 
preallocating a HOD subcluster and then forgetting to delete it.

 run command for grunt
 -

 Key: PIG-574
 URL: https://issues.apache.org/jira/browse/PIG-574
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: David Ciemiewicz
Priority: Minor
 Attachments: run_command.patch, run_command_params.patch


 This is a request for a run file command in grunt which will read a script 
 from the local file system and execute the script interactively while in the 
 grunt shell.
 One of the things that slows down iterative development of large, complicated 
 Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
 cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) 
 cluster for each iteration.  I would prefer not to preallocate a cluster of 
 nodes (though I could).
 Instead, I'd like to have one window open and edit my Pig script using vim or 
 emacs, write it, and then type run myscript.pig at the grunt shell until I 
 get things right.
 I'm used to doing similar things with Oracle, MySQL, and R. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-600) PiggyBank compilation instructions don't work

2009-01-07 Thread David Ciemiewicz (JIRA)
PiggyBank compilation instructions don't work
-

 Key: PIG-600
 URL: https://issues.apache.org/jira/browse/PIG-600
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: David Ciemiewicz


I know that PiggyBank is as-is but the instructions are incomplete and should 
be complete enough (all steps) required to compile PiggyBank.

http://wiki.apache.org/pig/PiggyBank

I checked out the types branch version of PiggyBank by modifying the 
instructions to check out:

svn co 
http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/

At step 2 it says:

To build a jar file that contains all available user defined functions (UDFs), 
please follow the steps:

1. Checkout UDF code: svn co 
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank
2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. 
This will generate piggybank.jar in the same directory.


So I went into the piggybank/java directory and and ran ant and got the 
following errors:

{code}
-bash-3.00$ ant
Buildfile: build.xml

init:

compile:
 [echo]  *** Compiling Pig UDFs ***
[javac] Compiling 70 source files to 
/homes/ciemo/piggybank/java/build/classes
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25:
 cannot find symbol
[javac] symbol  : class EvalFunc
[javac] location: package org.apache.pig
[javac] import org.apache.pig.EvalFunc;
[javac]  ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26:
 cannot find symbol
[javac] symbol  : class FuncSpec
[javac] location: package org.apache.pig
[javac] import org.apache.pig.FuncSpec;
[javac]  ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27:
 package org.apache.pig.data does not exist
[javac] import org.apache.pig.data.Tuple;
[javac]   ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28:
 package org.apache.pig.impl.logicalLayer.schema does not exist
[javac] import org.apache.pig.impl.logicalLayer.schema.Schema;
[javac]   ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29:
 package org.apache.pig.data does not exist
[javac] import org.apache.pig.data.DataType;
[javac]   ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30:
 package org.apache.pig.impl.logicalLayer does not exist
[javac] import org.apache.pig.impl.logicalLayer.FrontendException;
[javac]^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31:
 package org.apache.pig.impl.util does not exist
[javac] import org.apache.pig.impl.util.WrappedIOException;
[javac]^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61:
 cannot find symbol
[javac] symbol: class EvalFunc
[javac] public class ABS extends EvalFuncDouble{
[javac]  ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67:
 cannot find symbol
[javac] symbol  : class Tuple
[javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
[javac] public Double exec(Tuple input) throws IOException {
[javac]^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
 cannot find symbol
[javac] symbol  : class Schema
[javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
[javac] public Schema outputSchema(Schema input) {
[javac]^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
 cannot find symbol
[javac] symbol  : class Schema
[javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
[javac] public Schema outputSchema(Schema input) {
[javac]^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93:
 cannot find symbol
[javac] symbol  : class FuncSpec
[javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
[javac] public ListFuncSpec getArgToFuncMapping() throws 
FrontendException {
[javac] ^
[javac] 
/homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:93:
 cannot 

[jira] Commented: (PIG-600) PiggyBank compilation instructions don't work

2009-01-07 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661641#action_12661641
 ] 

David Ciemiewicz commented on PIG-600:
--

I think the problem is this ... the build.xml file is looking for pig.jar 
several directories up:

property name=pigjar value=../../../pig.jar /

The thing is, I'm relying on another build of pig and not the whole pig 
directory.

If you take the instructions for PiggyBank literally (as I did) you will not 
get a successful build.

 PiggyBank compilation instructions don't work
 -

 Key: PIG-600
 URL: https://issues.apache.org/jira/browse/PIG-600
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: David Ciemiewicz

 I know that PiggyBank is as-is but the instructions are incomplete and 
 should be complete enough (all steps) required to compile PiggyBank.
 http://wiki.apache.org/pig/PiggyBank
 I checked out the types branch version of PiggyBank by modifying the 
 instructions to check out:
 svn co 
 http://svn.apache.org/repos/asf/hadoop/pig/branches/types/contrib/piggybank/
 At step 2 it says:
 To build a jar file that contains all available user defined functions 
 (UDFs), please follow the steps:
 1. Checkout UDF code: svn co 
 http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/piggybank
 2. Build the jar file: from trunk/contrib/piggybank/java directory run ant. 
 This will generate piggybank.jar in the same directory.
 So I went into the piggybank/java directory and and ran ant and got the 
 following errors:
 {code}
 -bash-3.00$ ant
 Buildfile: build.xml
 init:
 compile:
  [echo]  *** Compiling Pig UDFs ***
 [javac] Compiling 70 source files to 
 /homes/ciemo/piggybank/java/build/classes
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:25:
  cannot find symbol
 [javac] symbol  : class EvalFunc
 [javac] location: package org.apache.pig
 [javac] import org.apache.pig.EvalFunc;
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:26:
  cannot find symbol
 [javac] symbol  : class FuncSpec
 [javac] location: package org.apache.pig
 [javac] import org.apache.pig.FuncSpec;
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:27:
  package org.apache.pig.data does not exist
 [javac] import org.apache.pig.data.Tuple;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:28:
  package org.apache.pig.impl.logicalLayer.schema does not exist
 [javac] import org.apache.pig.impl.logicalLayer.schema.Schema;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:29:
  package org.apache.pig.data does not exist
 [javac] import org.apache.pig.data.DataType;
 [javac]   ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:30:
  package org.apache.pig.impl.logicalLayer does not exist
 [javac] import org.apache.pig.impl.logicalLayer.FrontendException;
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:31:
  package org.apache.pig.impl.util does not exist
 [javac] import org.apache.pig.impl.util.WrappedIOException;
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:61:
  cannot find symbol
 [javac] symbol: class EvalFunc
 [javac] public class ABS extends EvalFuncDouble{
 [javac]  ^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:67:
  cannot find symbol
 [javac] symbol  : class Tuple
 [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
 [javac] public Double exec(Tuple input) throws IOException {
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
  cannot find symbol
 [javac] symbol  : class Schema
 [javac] location: class org.apache.pig.piggybank.evaluation.math.ABS
 [javac] public Schema outputSchema(Schema input) {
 [javac]^
 [javac] 
 /homes/ciemo/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/math/ABS.java:85:
  cannot find symbol

[jira] Created: (PIG-598) Parameter substitution ($PARAMETER) should not be performed in comments

2009-01-05 Thread David Ciemiewicz (JIRA)
Parameter substitution ($PARAMETER) should not be performed in comments
---

 Key: PIG-598
 URL: https://issues.apache.org/jira/browse/PIG-598
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: David Ciemiewicz
Priority: Minor


Compiling the following code example will generate an error that 
$NOT_A_PARAMETER is an Undefined Parameter.

This is problematic as sometimes you want to comment out parts of your code, 
including parameters so that you don't have to define them.

This I think it would be really good if parameter substitution was not 
performed in comments.

{code}
-- $NOT_A_PARAMETER
{code}

{code}
-bash-3.00$ pig -exectype local -latest comment.pig
USING: /grid/0/gs/pig/current
java.lang.RuntimeException: Undefined parameter : NOT_A_PARAMETER
at 
org.apache.pig.tools.parameters.PreprocessorContext.substitute(PreprocessorContext.java:221)
at 
org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.parsePigFile(ParameterSubstitutionPreprocessor.java:106)
at 
org.apache.pig.tools.parameters.ParameterSubstitutionPreprocessor.genSubstitutedFile(ParameterSubstitutionPreprocessor.java:86)
at org.apache.pig.Main.runParamPreprocessor(Main.java:394)
at org.apache.pig.Main.main(Main.java:296)
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions

2009-01-01 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660201#action_12660201
 ] 

David Ciemiewicz commented on PIG-596:
--

Note that specifying the tuple without the tuple designator doesn't work either.

{code}
One = load 'one.txt' using PigStorage() as ( one: int );

LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: 
tuple ( a, b ) };

AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { ( a, b ) };

Tuples = union LabelledTupleInBag, AnonymousTupleInBag;

dump Tuples;
{code}

 Anonymous tuples in bags create ParseExceptions
 ---

 Key: PIG-596
 URL: https://issues.apache.org/jira/browse/PIG-596
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: David Ciemiewicz

 {code}
 One = load 'one.txt' using PigStorage() as ( one: int );
 LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: 
 tuple ( a, b ) };
 AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, 
 b ) }; -- Anonymous tuple creates bug
 Tuples = union LabelledTupleInBag, AnonymousTupleInBag;
 dump Tuples;
 {code}
 java.io.IOException: Encountered { tuple at line 6, column 66.
 Was expecting one of:
 parallel ...
 ; ...
 , ...
 : ...
 ( ...
 { IDENTIFIER ...
 { } ...
 [ ...
 
 at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: 
 Encountered { tuple at line 6, column 66.
 Why can't there be an anonymous tuple at the top level of a bag?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions

2009-01-01 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660202#action_12660202
 ] 

David Ciemiewicz commented on PIG-596:
--

The reason I think it is important to be able to create anonymous tuples is 
because the tuples are anonymous in the LOAD statements.  Because, if you 
FLATTEN a bag such as mybag, any intermediate tuple label is immediately lost 
and the results of the flatten are mybag::a, mybag::b.  They are not 
mybag::tuplelabel::a, mybag::tuplelabel::b;

 Anonymous tuples in bags create ParseExceptions
 ---

 Key: PIG-596
 URL: https://issues.apache.org/jira/browse/PIG-596
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: David Ciemiewicz

 {code}
 One = load 'one.txt' using PigStorage() as ( one: int );
 LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: 
 tuple ( a, b ) };
 AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, 
 b ) }; -- Anonymous tuple creates bug
 Tuples = union LabelledTupleInBag, AnonymousTupleInBag;
 dump Tuples;
 {code}
 java.io.IOException: Encountered { tuple at line 6, column 66.
 Was expecting one of:
 parallel ...
 ; ...
 , ...
 : ...
 ( ...
 { IDENTIFIER ...
 { } ...
 [ ...
 
 at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: 
 Encountered { tuple at line 6, column 66.
 Why can't there be an anonymous tuple at the top level of a bag?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-596) Anonymous tuples in bags create ParseExceptions

2008-12-31 Thread David Ciemiewicz (JIRA)
Anonymous tuples in bags create ParseExceptions
---

 Key: PIG-596
 URL: https://issues.apache.org/jira/browse/PIG-596
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: David Ciemiewicz


{code}
One = load 'one.txt' using PigStorage() as ( one: int );

LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: 
tuple ( a, b ) };

AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b 
) }; -- Anonymous tuple creates bug

Tuples = union LabelledTupleInBag, AnonymousTupleInBag;

dump Tuples;
{code}

java.io.IOException: Encountered { tuple at line 6, column 66.
Was expecting one of:
parallel ...
; ...
, ...
: ...
( ...
{ IDENTIFIER ...
{ } ...
[ ...

at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
at org.apache.pig.Main.main(Main.java:306)
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered 
{ tuple at line 6, column 66.

Why can't there be an anonymous tuple at the top level of a bag?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-579) Adding newlines to format foreach statement with constants causes parse errors

2008-12-24 Thread David Ciemiewicz (JIRA)
Adding newlines to format foreach statement with constants causes parse errors
--

 Key: PIG-579
 URL: https://issues.apache.org/jira/browse/PIG-579
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: David Ciemiewicz


The following code example files with parse errors on step D:

{code}
A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);

B = LOAD 'voter_data' AS (name: chararray, age: int, registration: chararray, 
contributions: float);

C = COGROUP A BY name, B BY name;

D = FOREACH C GENERATE
group,
flatten((not IsEmpty(A) ? A : (bag{tuple(chararray, int, 
float)}){(null, null, null)})),
flatten((not IsEmpty(B) ? B : (bag{tuple(chararray, int, chararray, 
float)}){(null,null,null, null)}));

dump D;
{code}

I get the parse error:
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered 
not IsEmpty ( A ) ? A : ( bag { tuple ( chararray , int , float ) } ; at line 
9, column 18.
Was expecting one of:
( ...
- ...
tuple ...
bag ...
map ...
int ...
long ...
...
However, if I simply remove the new lines from statement D and make it:

{code}
D = FOREACH C GENERATE group, flatten((not IsEmpty(A) ? A : 
(bag{tuple(chararray, int, float)}){(null, null, null)})), flatten((not 
IsEmpty(B) ? B : (bag{tuple(chararray, int, chararray, 
float)}){(null,null,null, null)}));
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-574) run command for grunt

2008-12-22 Thread David Ciemiewicz (JIRA)
run command for grunt
-

 Key: PIG-574
 URL: https://issues.apache.org/jira/browse/PIG-574
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Reporter: David Ciemiewicz
Priority: Minor


This is a request for a run file command in grunt which will read a script 
from the local file system and execute the script interactively while in the 
grunt shell.

One of the things that slows down iterative development of large, complicated 
Pig scripts that must operate on hadoop fs data is that the edit, run, debug 
cycle is slow because I must wait to allocate a Hadoop-on-Demand (hod) cluster 
for each iteration.  I would prefer not to preallocate a cluster of nodes 
(though I could).

Instead, I'd like to have one window open and edit my Pig script using vim or 
emacs, write it, and then type run myscript.pig at the grunt shell until I 
get things right.

I'm used to doing similar things with Oracle, MySQL, and R. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-575) Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema

2008-12-22 Thread David Ciemiewicz (JIRA)
Please extend FieldSchema class with getSchema() member function for iterating 
over complex Schemas in Pig UDF outputSchema
---

 Key: PIG-575
 URL: https://issues.apache.org/jira/browse/PIG-575
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz


I have discovered that it is not possible to recurse through parts of the input 
Schema in the UDF outputSchema function.

I have a function that operates on an input bag of tuples and then creates 
sequential pairings of the rows.

A = foreach One generate { 
( 1, a ),
( 2, b )
}   as  bag { tuple ( seq: int, value: chararray ) };

The output of the PAIRS(A) should be:

{
( ( 1, a ), ( 2, b ) ),
( ( 2, b ), ( null, null ) )
}

The default output schema for the function should be:

bag { tuple ( tuple ( order: int, value: chararray ), tuple ( order: int, 
value: chararray ) ) ) }

The problem I have is that I'm not able to recurse into the internal Schema of 
the FieldSchema in my outputSchema function to get at the tuple within the 
input bag.

Here's my sample outputSchema for PAIRS:

public Schema outputSchema(Schema input) {
try {
System.out.println(input:  + input.toString());

Schema databagSchema = new Schema();
Schema tupleSchema = new Schema();

Schema inputDataBag = new Schema(input.getFields().get(0));
System.out.println(inputDataBag:  + 
input.getFields().get(0).toString());

//
//  RIGHT HERE IS WHERE I WANT TO DO inputDataBag.getFields.get(0).getSchema
//
Schema.FieldSchema inputTuple = inputDataBag.getFields().get(0);  // 
Here's where I want to say  
System.out.println(inputTuple:  + inputTuple.toString());

databagSchema.add(new Schema.FieldSchema(null, DataType.TUPLE));
System.out.println(databagSchema:  + databagSchema.toString());

return new Schema(
new Schema.FieldSchema(
getSchemaName( this.getClass().getName().toLowerCase(), input),
databagSchema,
DataType.BAG
)
);
} catch (Exception e) {
return null;
}
}

Here's the execution output from outputSchema:

input: {A: {seq: int,value: chararray},int,int}
inputDataBag: A: bag({seq: int,value: chararray})
inputTuple: A: bag({seq: int,value: chararray})= what I want to see is ( 
seq: int, value: chararray )
rowSchema: A: bag({seq: int,value: chararray})
rowSchema: A: bag({seq: int,value: chararray})


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-575) Please extend FieldSchema class with getSchema() member function for iterating over complex Schemas in Pig UDF outputSchema

2008-12-22 Thread David Ciemiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-575:
-

Component/s: impl
   Priority: Minor  (was: Major)

 Please extend FieldSchema class with getSchema() member function for 
 iterating over complex Schemas in Pig UDF outputSchema
 ---

 Key: PIG-575
 URL: https://issues.apache.org/jira/browse/PIG-575
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: David Ciemiewicz
Priority: Minor

 I have discovered that it is not possible to recurse through parts of the 
 input Schema in the UDF outputSchema function.
 I have a function that operates on an input bag of tuples and then creates 
 sequential pairings of the rows.
 A = foreach One generate { 
 ( 1, a ),
 ( 2, b )
 }   as  bag { tuple ( seq: int, value: chararray ) };
 The output of the PAIRS(A) should be:
 {
 ( ( 1, a ), ( 2, b ) ),
 ( ( 2, b ), ( null, null ) )
 }
 The default output schema for the function should be:
 bag { tuple ( tuple ( order: int, value: chararray ), tuple ( order: int, 
 value: chararray ) ) ) }
 The problem I have is that I'm not able to recurse into the internal Schema 
 of the FieldSchema in my outputSchema function to get at the tuple within the 
 input bag.
 Here's my sample outputSchema for PAIRS:
 public Schema outputSchema(Schema input) {
 try {
 System.out.println(input:  + input.toString());
 Schema databagSchema = new Schema();
 Schema tupleSchema = new Schema();
 Schema inputDataBag = new Schema(input.getFields().get(0));
 System.out.println(inputDataBag:  + 
 input.getFields().get(0).toString());
 //
 //  RIGHT HERE IS WHERE I WANT TO DO inputDataBag.getFields.get(0).getSchema
 //
 Schema.FieldSchema inputTuple = inputDataBag.getFields().get(0);  // 
 Here's where I want to say  
 System.out.println(inputTuple:  + inputTuple.toString());
 databagSchema.add(new Schema.FieldSchema(null, DataType.TUPLE));
 System.out.println(databagSchema:  + databagSchema.toString());
 return new Schema(
 new Schema.FieldSchema(
 getSchemaName( this.getClass().getName().toLowerCase(), 
 input),
 databagSchema,
 DataType.BAG
 )
 );
 } catch (Exception e) {
 return null;
 }
 }
 Here's the execution output from outputSchema:
 input: {A: {seq: int,value: chararray},int,int}
 inputDataBag: A: bag({seq: int,value: chararray})
 inputTuple: A: bag({seq: int,value: chararray})= what I want to see is ( 
 seq: int, value: chararray )
 rowSchema: A: bag({seq: int,value: chararray})
 rowSchema: A: bag({seq: int,value: chararray})

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.