[jira] Created: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2009-04-08 Thread David Ciemiewicz (JIRA)
UDFs should have API for transparently opening and reading files from HDFS or 
from local file system with only relative path


 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz


I have a utility function util.INSETFROMFILE() that I pass a file name during 
initialization.

{code}
define inQuerySet util.INSETFROMFILE(analysis/queries);
A = load 'logs' using PigStorage() as ( date int, query chararray );
B = filter A by inQuerySet(query);
{code}

This provides a computationally inexpensive way to effect map-side joins for 
small sets plus functions of this style provide the ability to encapsulate more 
complex matching rules.

For rapid development and debugging purposes, I want this code to run without 
modification on both my local file system when I do pig -exectype local and on 
HDFS.

Pig needs to provide an API for UDFs which allow them to either:

1) know  when they are in local or HDFS mode and let them open and read from 
files as appropriate
2) just provide a file name and read statements and have pig transparently 
manage local or HDFS opens and reads for the UDF

UDFs need to read configuration information off the filesystem and it 
simplifies the process if one can just flip the switch of -exectype local.




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697043#action_12697043
 ] 

David Ciemiewicz commented on PIG-756:
--

BTW, there used to be a mechanism to do this in early versions of Pig that was 
last in the transition to the new execution system.

 UDFs should have API for transparently opening and reading files from HDFS or 
 from local file system with only relative path
 

 Key: PIG-756
 URL: https://issues.apache.org/jira/browse/PIG-756
 Project: Pig
  Issue Type: Bug
Reporter: David Ciemiewicz

 I have a utility function util.INSETFROMFILE() that I pass a file name during 
 initialization.
 {code}
 define inQuerySet util.INSETFROMFILE(analysis/queries);
 A = load 'logs' using PigStorage() as ( date int, query chararray );
 B = filter A by inQuerySet(query);
 {code}
 This provides a computationally inexpensive way to effect map-side joins for 
 small sets plus functions of this style provide the ability to encapsulate 
 more complex matching rules.
 For rapid development and debugging purposes, I want this code to run without 
 modification on both my local file system when I do pig -exectype local and 
 on HDFS.
 Pig needs to provide an API for UDFs which allow them to either:
 1) know  when they are in local or HDFS mode and let them open and read 
 from files as appropriate
 2) just provide a file name and read statements and have pig transparently 
 manage local or HDFS opens and reads for the UDF
 UDFs need to read configuration information off the filesystem and it 
 simplifies the process if one can just flip the switch of -exectype local.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-724) Treating integers and strings in PigStorage

2009-04-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697054#action_12697054
 ] 

Alan Gates commented on PIG-724:


Currently Pig doesn't require that all keys and values in a map share the same 
type.  There is a proposal to change it so that key types can only be chararray 
(see PIG-734), as we don't see anyone using anything but chararray and the 
generality is causing us some other issues.  But we still wouldn't require that 
all values in a given map be of the same type.  Are you proposing allowing 
users to put a constraint on a given map so that all values in that particular 
map must be of that type?

 Treating integers and strings in PigStorage
 ---

 Key: PIG-724
 URL: https://issues.apache.org/jira/browse/PIG-724
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, PigStorage cannot treats the materialized string 123 as an integer 
 with the value 123. If the user intended this to be the string 123, 
 PigStorage cannot deal with it. This reasoning also applies to doubles. Due 
 to this issue, maps that contain values which are of the same type but 
 manifest the issue discussed at beginning of the paragraph, Pig throws its 
 hands up at runtime.  An example to illustrate the problem will help.
 In the example below a sample row in the data (map.txt) contains the 
 following:
 [key01#35,key02#value01]
 When Pig tries to convert the stream to a map, it creates a MapObject, 
 Object where the key is a string and the value is an integer. Running the 
 script shown below, results in a run-time error.
 {code}
 grunt a = load 'map.txt' as (themap: map[]);
 grunt b = filter a by (chararray)(themap#'key01') == 'hello';
   
 grunt dump b;
 2009-03-18 15:19:03,773 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 0% complete
 2009-03-18 15:19:28,797 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - Map reduce job failed
 2009-03-18 15:19:28,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1081: Cannot cast to chararray. Expected bytearray but received: int
 {code} 
 There are two ways to resolve this issue:
 1. Change the conversion routine for bytesToMap to return a map where the 
 value is a bytearray and not the actual type. This change breaks backward 
 compatibility
 2. Introduce checks in POCast where conversions that are legal in the type 
 checking world are allowed, i.e., run time checks will be made to check for 
 compatible casts. In the above example, an int can be converted to a 
 chararray and the cast will be made. If on the other hand, it was a chararray 
 to int conversion then an exception will be thrown.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697056#action_12697056
 ] 

Alan Gates commented on PIG-745:


I'm reviewing this patch.

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Ajax library for Pig

2009-04-08 Thread Alan Gates
Sorry if these are silly questions, but I'm not very familiar with  
some of these technologies.  So what you propose is that Pig would be  
installed on some dedicated server machine and a web server would be  
placed in front of it.  Then client libraries would be developed that  
made calls to the web server.  Would these client side libraries  
include presentation in the browser, both for user's submitting  
queries and receiving results?  Also, pig currently does not have a  
server mode, thus any web server would have to spin off threads that  
ran a pig job.


If the above is what you're proposing, I think it would be great.   
Opening up pig to more users by making it browser accessible would be  
nice.


Alan.

On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote:


Hi
Since pig is getting a lot of usage in industries and universities;
how about adding a front-end support for Pig? The plan is to write a
jquery/dojo type of general JavaScript/AJAX library which can be used
over any server technologies (php, jsp, asp, etc.) to call pig
functions over web.

Direct Web Remoting (DWR- http://directwebremoting.org ), an open
source project at Java.net gives a functionality that allows
JavaScript in a browser to interact with Java on a server. Can we
write a JavaScript library exclusively for Pig using DWR? I am not
sure about licensing issues.

The major advantages I can point is
-Use of Pig over HTTP rather SSH.
-User management will become easy as this can be handled easily  
using any CMS


--nitesh

--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun




[jira] Commented: (PIG-712) Need utilities to create schemas for bags and tuples

2009-04-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697075#action_12697075
 ] 

Alan Gates commented on PIG-712:


Jeff,

Thanks for the patch.  I'll take a look at this, but it may be tomorrow before 
I get to it.

 Need utilities to create schemas for bags and tuples
 

 Key: PIG-712
 URL: https://issues.apache.org/jira/browse/PIG-712
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Santhosh Srinivasan
Priority: Minor
 Fix For: 0.3.0

 Attachments: Pig_712_Patch_Merged.txt


 Pig should provide utilities to create bag and tuple schemas. Currently, 
 users return schemas in outputSchema method and end up with very verbose 
 boiler plate code. It will be very nice if Pig encapsulates the boiler plate 
 code in utility methods.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697094#action_12697094
 ] 

David Ciemiewicz commented on PIG-745:
--

Alan,

I realized several things.

1) The question of what to do about BOOLEAN case.  My original suggestion was 
to convert the BOOLEAN case to 1 and 0 but in the patch, I just used the 
Boolean.toString() function.  Not sure if that matters or not.

2) I didn't see other test cases for the other DataType.toInteger(), ... 
conversions so I didn't create one for DataType.toString().

3) We are just using the default conversion of Float.toString() and 
Double.toString().  I don't know if this is actually best since I don't know 
if these operations present the floating-point values in full precision or not. 
 At this point, it may not really matter so much as the primary reason for 
creating DataType.toString() is to allow string functions to operate on any 
data type (like in Perl) without generating cast errors.



 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-753) Do not support UDF not providing parameter

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697108#action_12697108
 ] 

David Ciemiewicz commented on PIG-753:
--

I think Jeff means that Pig does not support UDFs without parameters, but 
should.

I agree.

 Do not support UDF not providing parameter
 --

 Key: PIG-753
 URL: https://issues.apache.org/jira/browse/PIG-753
 Project: Pig
  Issue Type: Improvement
Reporter: Jeff Zhang

 Pig do not support UDF without parameters, it force me provide a parameter.
 like the following statement:
  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
 provide a parameter like following
  B = FOREACH A GENERATE bagGenerator($0);
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-04-08 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697146#action_12697146
 ] 

David Ciemiewicz commented on PIG-697:
--

Some thoughts on optimization problems and patterns from SQL and coding Pig and 
my desire for a higher level version of Pig than we have today.

I know this may come off as distraction but hopefully you'll have some time 
to hear me out.

* after a conversation with Santhosh about the SQL to Pig translation work 
* multiple issues I have countered with nested foreach statements including 
redundant function execution 
* nested FOREACH statement assignment computation bugs 
* hand coding chains of foreach statements so I can get the Algebraic combiner 
to kick 
* hand coding chains of foreach statements and grouping statements rather than 
using a single statement

I think I might have stumbled on a potentially improved model for Pig to Pig 
execution plan generation:

{code}
High Level Pig to Low Level Pig translation
{code}

I think this would potentially benefit the SQL to Pig efforts and provide for 
programmer coding efficiency in Pig as well.

This will be a bit protracted, but I hope you have some time to consider it.

Take the following SQL idiom that the SQL to Pig translator will need to 
support:

{code}
select
EXP(AVG(LN(time+0.1))) as geomean_time
from
events
where
time is not null and
time = 0;
{code}

In high level pig, I have wanted to code this as
 
{code}
A = load 'events' using PigStorage() as ( time: int );
B = filter A by time is not null and time = 0;
C = group B all;
D = foreach C generate EXP(AVG(LN(B.time+0.1))) as geomean_time;
{code}

In fact, this would seem to provide a nice translation path from SQL to low 
level pig via high level pig.

Unfortunately, this won't work.  We developers must write Pig scripts at a 
lower level and break all of this apart into various steps.

An additional issue is that, because of some, um, workarounds, in the execution 
plan optimizations, the combiner won't kick in if we don't do further steps.

So the most performant version of the desired pig script is the following 
really low level pig where D is broken into 3 steps, merging one with B and 
the remaining 2 steps as separate D steps:

 
{code}
A = load 'events' using PigStorage() as ( time: int );
B = filter A by time is not null and time = 0;
B = foreach A generate LOG(time+0.1) as log_time;
C = group B all;
D = foreach C generate group, AVG(B.log_time) as mean_log_time;
-- note that group alias is required for 
Algebraic combiner to kick in
D = foreach D generate EXP(mean_log_time) as geomean_time;
{code}

If we can figure out how to translate SQL into this last low-level set of 
statements, why couldn't we or shouldn't we have high level pig as well and 
permit more efficient code writing and optimization?


Next example

I do a bunch of nested intermediate computations in a nested FOREACH statement:

{code}
C = foreach C {
curr_mean_log_timetonextevent = curr_sum_log_timetonextevent / 
(double)count;
curr_meansq_log_timetonextevent = curr_sumsq_log_timetonextevent / 
(double)count;
curr_var_log_timetonextevent = curr_meansq_log_timetonextevent - 
(curr_mean_log_timetonextevent * 
curr_mean_log_timetonextevent);
curr_sterr_log_timetonextevent = math.SQRT(curr_var_log_timetonextevent 
/ (double)count);
 

curr_geomean_timetonextevent = math.EXP(curr_mean_log_timetonextevent);
curr_geosterr_timetonextevent = 
math.EXP(curr_sterr_log_timetonextevent);
curr_mean_timetonextevent = curr_sum_log_timetonextevent / 
(double)count;
curr_meansq_timetonextevent = curr_sumsq_log_timetonextevent / 
(double)count;
curr_var_timetonextevent = curr_meansq_timetonextevent - 
(curr_mean_timetonextevent * curr_mean_timetonextevent);

curr_sterr_timetonextevent = math.SQRT(curr_var_timetonextevent / 
count);

generate
...
{code}

The code for nested statements in Pig has been particularly problematic and 
buggy including problems such as:

* redundant execution of functions such as SUM, AVG
* nested function problems
* mathematical operator problems (illustrated in this bug)
* no type propagation
* the need to use AS clauses to name nested alias assignments projected in the 
GENERATE clauses

What if instead of trying to do all of these operations in some specialized 
execution code, what if this was treated as high level pig that translated 
all of these intermediate statements into two or more low level foreach 
expansions.

[jira] Created: (PIG-757) Using schemes in load and store paths

2009-04-08 Thread Gunther Hagleitner (JIRA)
Using schemes in load and store paths
-

 Key: PIG-757
 URL: https://issues.apache.org/jira/browse/PIG-757
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


As part of the multiquery optimization work there's a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, the suggestion is to 
change the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path be passed 
through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, right now the following 
could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the loader wants to see. So in order 
to make this work one would use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string sql://table. And pig will not 
use the string as an hdfs location.

This is an incompatible change but it's hopefully few existing Slicers/Loaders 
that are affected. This behavior is part of the multiquery work and can be 
turned off (reverted back) by using the no_multiquery flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)
Converting load/store locations into fully qualified absolute paths
---

 Key: PIG-758
 URL: https://issues.apache.org/jira/browse/PIG-758
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string sql://table.

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the no_multiquery pig flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-758:
---

Description: 
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{code}
a = load 'table' using DBLoader();
{code}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{code}
a = load 'sql://table' using DBLoader();
{code}

Now the DBLoader would see the unchanged string sql://table.

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the no_multiquery pig flag.

  was:
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{{{
a = load 'table' using DBLoader();
}}}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{{{
a = load 'sql://table' using DBLoader();
}}}

Now the DBLoader would see the unchanged string sql://table.

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the no_multiquery pig flag.


 Converting load/store locations into fully qualified absolute paths
 ---

 Key: PIG-758
 URL: https://issues.apache.org/jira/browse/PIG-758
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner

 As part of the multiquery optimization work there is a need to use absolute 
 paths for load and store operations (because the current directory changes 
 during the execution of the script). In order to do so, we are suggesting a 
 change to the semantics of the location/filename string used in LoadFunc and 
 Slicer/Slice.
 The proposed change is:
* Load locations without a scheme part are expected to be hdfs (mapreduce 
 mode) or local (local mode) paths
* Any hdfs or local path will be translated to a fully qualified absolute 
 path before it is handed to either a LoadFunc or Slicer
* Any scheme other than file or hdfs will result in the load path to 
 be passed through to the LoadFunc or Slicer without any modification.
 Example:
 If you have a LoadFunc that reads from a database, in the current system the 
 following could be used:
 {code}
 a = load 'table' using DBLoader();
 {code}
 With the proposed changes table would be translated into an hdfs path though 
 (hdfs:///table). Probably not what the DBLoader would want to see. In 
 order to make it work one could use:
 {code}
 a = load 'sql://table' using DBLoader();
 {code}
 Now the DBLoader would see the unchanged string sql://table.
 This is an incompatible change, but hopefully not affecting many existing 
 Loaders/Slicers. Since this is needed with the multiquery feature, the 
 behavior can be reverted back by using the no_multiquery pig flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths

2009-04-08 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-758:
---

Description: 
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{noformat}
a = load 'table' using DBLoader();
{noformat}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{noformat}
a = load 'sql://table' using DBLoader();
{noformat}

Now the DBLoader would see the unchanged string sql://table.

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the no_multiquery pig flag.

  was:
As part of the multiquery optimization work there is a need to use absolute 
paths for load and store operations (because the current directory changes 
during the execution of the script). In order to do so, we are suggesting a 
change to the semantics of the location/filename string used in LoadFunc and 
Slicer/Slice.

The proposed change is:

   * Load locations without a scheme part are expected to be hdfs (mapreduce 
mode) or local (local mode) paths
   * Any hdfs or local path will be translated to a fully qualified absolute 
path before it is handed to either a LoadFunc or Slicer
   * Any scheme other than file or hdfs will result in the load path to be 
passed through to the LoadFunc or Slicer without any modification.

Example:

If you have a LoadFunc that reads from a database, in the current system the 
following could be used:

{code}
a = load 'table' using DBLoader();
{code}

With the proposed changes table would be translated into an hdfs path though 
(hdfs:///table). Probably not what the DBLoader would want to see. In 
order to make it work one could use:

{code}
a = load 'sql://table' using DBLoader();
{code}

Now the DBLoader would see the unchanged string sql://table.

This is an incompatible change, but hopefully not affecting many existing 
Loaders/Slicers. Since this is needed with the multiquery feature, the behavior 
can be reverted back by using the no_multiquery pig flag.


 Converting load/store locations into fully qualified absolute paths
 ---

 Key: PIG-758
 URL: https://issues.apache.org/jira/browse/PIG-758
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner

 As part of the multiquery optimization work there is a need to use absolute 
 paths for load and store operations (because the current directory changes 
 during the execution of the script). In order to do so, we are suggesting a 
 change to the semantics of the location/filename string used in LoadFunc and 
 Slicer/Slice.
 The proposed change is:
* Load locations without a scheme part are expected to be hdfs (mapreduce 
 mode) or local (local mode) paths
* Any hdfs or local path will be translated to a fully qualified absolute 
 path before it is handed to either a LoadFunc or Slicer
* Any scheme other than file or hdfs will result in the load path to 
 be passed through to the LoadFunc or Slicer without any modification.
 Example:
 If you have a LoadFunc that reads from a database, in the current system the 
 following could be used:
 {noformat}
 a = load 'table' using DBLoader();
 {noformat}
 With the proposed changes table would be translated into an hdfs path though 
 (hdfs:///table). Probably not what the DBLoader would want to see. In 
 order to make it work one could use:
 {noformat}
 a = load 'sql://table' using DBLoader();
 {noformat}
 Now the DBLoader would see the unchanged string sql://table.
 This is an incompatible change, but hopefully not affecting many existing 
 Loaders/Slicers. Since this is needed with the multiquery feature, the 
 behavior can be reverted back by using the no_multiquery pig flag.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue 

[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-08 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697237#action_12697237
 ] 

Alan Gates commented on PIG-745:


Responses to comments:

1) Java's Boolean.toString() is probably the best choice.

2) Unit tests would be nice, but this is pretty basic and you're just calling 
various
java .toString functions.

3) If you're happy with it, it's good enough for now.  We can improve it later 
if people
ask for it.

4) Noted.


 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function

2009-04-08 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-745:
---

   Resolution: Fixed
Fix Version/s: 0.3.0
   Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Ciemo for the contribution.

 Please add DataTypes.toString() conversion function
 ---

 Key: PIG-745
 URL: https://issues.apache.org/jira/browse/PIG-745
 Project: Pig
  Issue Type: Improvement
Reporter: David Ciemiewicz
 Fix For: 0.3.0

 Attachments: PIG-745.patch


 I'm doing some work in string manipulation UDFs and I've found that it would 
 be very convenient if I could always convert the argument to a chararray 
 (internally a Java String).
 For example TOLOWERCASE(arg) shouldn't really care whether arg is a 
 bytearray, chararray, int, long, double, or float, it should be treated as a 
 string and operated on.
 The simplest and most foolproof method would be if the DataTypes added a 
 static function of  DataTypes.toString which did all of the argument type 
 checking and provided consistent translation.
 I believe that this function might be coded as:
 public static String toString(Object o) throws ExecException {
 try {
   switch (findType(o)) {
   case BOOLEAN:
   if (((Boolean)o) == true) return new String('1');
   else return new String('0');
   case BYTE:
   return ((Byte)o).toString();
   case INTEGER:
   return ((Integer)o).toString();
   case LONG:
   return ((Long)o).toString();
   case FLOAT:
   return ((Float)o).toString();
   case DOUBLE:
   return ((Double)o).toString();
   case BYTEARRAY:
   return ((DataByteArray)o).toString();
   case CHARARRAY:
   return (String)o;
   case NULL:
   return null;
   case MAP:
   case TUPLE:
   case BAG:
   case UNKNOWN:
   default:
   int errCode = 1071;
   String msg = Cannot convert a  + findTypeName(o) +
to an String;
   throw new ExecException(msg, errCode, 
 PigException.INPUT);
   }
   } catch (ExecException ee) {
   throw ee;
   } catch (Exception e) {
   int errCode = 2054;
   String msg = Internal error. Could not convert  + o + 
  to String.;
   throw new ExecException(msg, errCode, PigException.BUG);
   }
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-759) HBaseStorage scheme for Load/Slice function

2009-04-08 Thread Gunther Hagleitner (JIRA)
HBaseStorage scheme for Load/Slice function
---

 Key: PIG-759
 URL: https://issues.apache.org/jira/browse/PIG-759
 Project: Pig
  Issue Type: Bug
Reporter: Gunther Hagleitner


We would like to change the HBaseStorage function to use a scheme when loading 
a table in pig. The scheme we are thinking of is: hbase. So in order to load 
an hbase table in a pig script the statement should read:

{noformat}
table = load 'hbase://tablename' using HBaseStorage();
{noformat}

If the scheme is omitted pig would assume the tablename to be an hdfs path and 
the storage function would use the last component of the path as a table name 
and output a warning.

For details on why see jira issue: PIG-758

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-761) ERROR 2086 on simple JOIN

2009-04-08 Thread Vadim Zaliva (JIRA)
ERROR 2086 on simple JOIN
-

 Key: PIG-761
 URL: https://issues.apache.org/jira/browse/PIG-761
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
 Environment: mapreduce mode
Reporter: Vadim Zaliva


ERROR 2086: Unexpected problem during optimization. Could not find all 
LocalRearrange operators.org.apache.pig.impl.logicalLayer.FrontendException: 
ERROR 1002: Unable to store alias 109

doing pretty straightforward join in one of my pig scripts. I am able to 'dump' 
both relationship involved in this join. when I try to join them I am getting 
this error.

Here is a full log:


ERROR 2086: Unexpected problem during optimization. Could not find all
LocalRearrange operators.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
to store alias 109
   at org.apache.pig.PigServer.registerQuery(PigServer.java:296)
   at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529)
   at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280)
   at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
   at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
   at org.apache.pig.Main.main(Main.java:319)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
2043: Unexpected error during execution.
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274)
   at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:700)
   at org.apache.pig.PigServer.execute(PigServer.java:691)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:292)
   ... 5 more
Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException:
ERROR 2086: Unexpected problem during optimization. Could not find all
LocalRearrange operators.
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.handlePackage(POPackageAnnotator.java:116)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.visitMROp(POPackageAnnotator.java:88)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:194)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:43)
   at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65)
   at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
   at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
   at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67)
   at 
org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50)
   at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
   at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLauncher.compile(MapReduceLauncher.java:198)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:80)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:261)
   ... 8 more
ERROR 1002: Unable to store alias 398
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable
to store alias 398
   at org.apache.pig.PigServer.registerQuery(PigServer.java:296)
   at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529)
   at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280)
   at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99)
   at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
   at org.apache.pig.Main.main(Main.java:319)
Caused by: java.lang.NullPointerException
   at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:669)
   at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:330)
   at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:41)
   at 
org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
   at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246)
   at org.apache.pig.PigServer.compilePp(PigServer.java:771)
   at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:697)
   at org.apache.pig.PigServer.execute(PigServer.java:691)
   at org.apache.pig.PigServer.registerQuery(PigServer.java:292)
   ... 5 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.