[jira] Created: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-756) UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path
[ https://issues.apache.org/jira/browse/PIG-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697043#action_12697043 ] David Ciemiewicz commented on PIG-756: -- BTW, there used to be a mechanism to do this in early versions of Pig that was last in the transition to the new execution system. UDFs should have API for transparently opening and reading files from HDFS or from local file system with only relative path Key: PIG-756 URL: https://issues.apache.org/jira/browse/PIG-756 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz I have a utility function util.INSETFROMFILE() that I pass a file name during initialization. {code} define inQuerySet util.INSETFROMFILE(analysis/queries); A = load 'logs' using PigStorage() as ( date int, query chararray ); B = filter A by inQuerySet(query); {code} This provides a computationally inexpensive way to effect map-side joins for small sets plus functions of this style provide the ability to encapsulate more complex matching rules. For rapid development and debugging purposes, I want this code to run without modification on both my local file system when I do pig -exectype local and on HDFS. Pig needs to provide an API for UDFs which allow them to either: 1) know when they are in local or HDFS mode and let them open and read from files as appropriate 2) just provide a file name and read statements and have pig transparently manage local or HDFS opens and reads for the UDF UDFs need to read configuration information off the filesystem and it simplifies the process if one can just flip the switch of -exectype local. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-724) Treating integers and strings in PigStorage
[ https://issues.apache.org/jira/browse/PIG-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697054#action_12697054 ] Alan Gates commented on PIG-724: Currently Pig doesn't require that all keys and values in a map share the same type. There is a proposal to change it so that key types can only be chararray (see PIG-734), as we don't see anyone using anything but chararray and the generality is causing us some other issues. But we still wouldn't require that all values in a given map be of the same type. Are you proposing allowing users to put a constraint on a given map so that all values in that particular map must be of that type? Treating integers and strings in PigStorage --- Key: PIG-724 URL: https://issues.apache.org/jira/browse/PIG-724 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.1 Reporter: Santhosh Srinivasan Fix For: 0.2.1 Currently, PigStorage cannot treats the materialized string 123 as an integer with the value 123. If the user intended this to be the string 123, PigStorage cannot deal with it. This reasoning also applies to doubles. Due to this issue, maps that contain values which are of the same type but manifest the issue discussed at beginning of the paragraph, Pig throws its hands up at runtime. An example to illustrate the problem will help. In the example below a sample row in the data (map.txt) contains the following: [key01#35,key02#value01] When Pig tries to convert the stream to a map, it creates a MapObject, Object where the key is a string and the value is an integer. Running the script shown below, results in a run-time error. {code} grunt a = load 'map.txt' as (themap: map[]); grunt b = filter a by (chararray)(themap#'key01') == 'hello'; grunt dump b; 2009-03-18 15:19:03,773 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-03-18 15:19:28,797 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Map reduce job failed 2009-03-18 15:19:28,817 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1081: Cannot cast to chararray. Expected bytearray but received: int {code} There are two ways to resolve this issue: 1. Change the conversion routine for bytesToMap to return a map where the value is a bytearray and not the actual type. This change breaks backward compatibility 2. Introduce checks in POCast where conversions that are legal in the type checking world are allowed, i.e., run time checks will be made to check for compatible casts. In the above example, an int can be converted to a chararray and the cast will be made. If on the other hand, it was a chararray to int conversion then an exception will be thrown. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697056#action_12697056 ] Alan Gates commented on PIG-745: I'm reviewing this patch. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Ajax library for Pig
Sorry if these are silly questions, but I'm not very familiar with some of these technologies. So what you propose is that Pig would be installed on some dedicated server machine and a web server would be placed in front of it. Then client libraries would be developed that made calls to the web server. Would these client side libraries include presentation in the browser, both for user's submitting queries and receiving results? Also, pig currently does not have a server mode, thus any web server would have to spin off threads that ran a pig job. If the above is what you're proposing, I think it would be great. Opening up pig to more users by making it browser accessible would be nice. Alan. On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote: Hi Since pig is getting a lot of usage in industries and universities; how about adding a front-end support for Pig? The plan is to write a jquery/dojo type of general JavaScript/AJAX library which can be used over any server technologies (php, jsp, asp, etc.) to call pig functions over web. Direct Web Remoting (DWR- http://directwebremoting.org ), an open source project at Java.net gives a functionality that allows JavaScript in a browser to interact with Java on a server. Can we write a JavaScript library exclusively for Pig using DWR? I am not sure about licensing issues. The major advantages I can point is -Use of Pig over HTTP rather SSH. -User management will become easy as this can be handled easily using any CMS --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
[jira] Commented: (PIG-712) Need utilities to create schemas for bags and tuples
[ https://issues.apache.org/jira/browse/PIG-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697075#action_12697075 ] Alan Gates commented on PIG-712: Jeff, Thanks for the patch. I'll take a look at this, but it may be tomorrow before I get to it. Need utilities to create schemas for bags and tuples Key: PIG-712 URL: https://issues.apache.org/jira/browse/PIG-712 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Santhosh Srinivasan Priority: Minor Fix For: 0.3.0 Attachments: Pig_712_Patch_Merged.txt Pig should provide utilities to create bag and tuple schemas. Currently, users return schemas in outputSchema method and end up with very verbose boiler plate code. It will be very nice if Pig encapsulates the boiler plate code in utility methods. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697094#action_12697094 ] David Ciemiewicz commented on PIG-745: -- Alan, I realized several things. 1) The question of what to do about BOOLEAN case. My original suggestion was to convert the BOOLEAN case to 1 and 0 but in the patch, I just used the Boolean.toString() function. Not sure if that matters or not. 2) I didn't see other test cases for the other DataType.toInteger(), ... conversions so I didn't create one for DataType.toString(). 3) We are just using the default conversion of Float.toString() and Double.toString(). I don't know if this is actually best since I don't know if these operations present the floating-point values in full precision or not. At this point, it may not really matter so much as the primary reason for creating DataType.toString() is to allow string functions to operate on any data type (like in Perl) without generating cast errors. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-753) Do not support UDF not providing parameter
[ https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697108#action_12697108 ] David Ciemiewicz commented on PIG-753: -- I think Jeff means that Pig does not support UDFs without parameters, but should. I agree. Do not support UDF not providing parameter -- Key: PIG-753 URL: https://issues.apache.org/jira/browse/PIG-753 Project: Pig Issue Type: Improvement Reporter: Jeff Zhang Pig do not support UDF without parameters, it force me provide a parameter. like the following statement: B = FOREACH A GENERATE bagGenerator(); this will generate error. I have to provide a parameter like following B = FOREACH A GENERATE bagGenerator($0); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697146#action_12697146 ] David Ciemiewicz commented on PIG-697: -- Some thoughts on optimization problems and patterns from SQL and coding Pig and my desire for a higher level version of Pig than we have today. I know this may come off as distraction but hopefully you'll have some time to hear me out. * after a conversation with Santhosh about the SQL to Pig translation work * multiple issues I have countered with nested foreach statements including redundant function execution * nested FOREACH statement assignment computation bugs * hand coding chains of foreach statements so I can get the Algebraic combiner to kick * hand coding chains of foreach statements and grouping statements rather than using a single statement I think I might have stumbled on a potentially improved model for Pig to Pig execution plan generation: {code} High Level Pig to Low Level Pig translation {code} I think this would potentially benefit the SQL to Pig efforts and provide for programmer coding efficiency in Pig as well. This will be a bit protracted, but I hope you have some time to consider it. Take the following SQL idiom that the SQL to Pig translator will need to support: {code} select EXP(AVG(LN(time+0.1))) as geomean_time from events where time is not null and time = 0; {code} In high level pig, I have wanted to code this as {code} A = load 'events' using PigStorage() as ( time: int ); B = filter A by time is not null and time = 0; C = group B all; D = foreach C generate EXP(AVG(LN(B.time+0.1))) as geomean_time; {code} In fact, this would seem to provide a nice translation path from SQL to low level pig via high level pig. Unfortunately, this won't work. We developers must write Pig scripts at a lower level and break all of this apart into various steps. An additional issue is that, because of some, um, workarounds, in the execution plan optimizations, the combiner won't kick in if we don't do further steps. So the most performant version of the desired pig script is the following really low level pig where D is broken into 3 steps, merging one with B and the remaining 2 steps as separate D steps: {code} A = load 'events' using PigStorage() as ( time: int ); B = filter A by time is not null and time = 0; B = foreach A generate LOG(time+0.1) as log_time; C = group B all; D = foreach C generate group, AVG(B.log_time) as mean_log_time; -- note that group alias is required for Algebraic combiner to kick in D = foreach D generate EXP(mean_log_time) as geomean_time; {code} If we can figure out how to translate SQL into this last low-level set of statements, why couldn't we or shouldn't we have high level pig as well and permit more efficient code writing and optimization? Next example I do a bunch of nested intermediate computations in a nested FOREACH statement: {code} C = foreach C { curr_mean_log_timetonextevent = curr_sum_log_timetonextevent / (double)count; curr_meansq_log_timetonextevent = curr_sumsq_log_timetonextevent / (double)count; curr_var_log_timetonextevent = curr_meansq_log_timetonextevent - (curr_mean_log_timetonextevent * curr_mean_log_timetonextevent); curr_sterr_log_timetonextevent = math.SQRT(curr_var_log_timetonextevent / (double)count); curr_geomean_timetonextevent = math.EXP(curr_mean_log_timetonextevent); curr_geosterr_timetonextevent = math.EXP(curr_sterr_log_timetonextevent); curr_mean_timetonextevent = curr_sum_log_timetonextevent / (double)count; curr_meansq_timetonextevent = curr_sumsq_log_timetonextevent / (double)count; curr_var_timetonextevent = curr_meansq_timetonextevent - (curr_mean_timetonextevent * curr_mean_timetonextevent); curr_sterr_timetonextevent = math.SQRT(curr_var_timetonextevent / count); generate ... {code} The code for nested statements in Pig has been particularly problematic and buggy including problems such as: * redundant execution of functions such as SUM, AVG * nested function problems * mathematical operator problems (illustrated in this bug) * no type propagation * the need to use AS clauses to name nested alias assignments projected in the GENERATE clauses What if instead of trying to do all of these operations in some specialized execution code, what if this was treated as high level pig that translated all of these intermediate statements into two or more low level foreach expansions.
[jira] Created: (PIG-757) Using schemes in load and store paths
Using schemes in load and store paths - Key: PIG-757 URL: https://issues.apache.org/jira/browse/PIG-757 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there's a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, the suggestion is to change the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, right now the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the loader wants to see. So in order to make this work one would use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string sql://table. And pig will not use the string as an hdfs location. This is an incompatible change but it's hopefully few existing Slicers/Loaders that are affected. This behavior is part of the multiquery work and can be turned off (reverted back) by using the no_multiquery flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-758) Converting load/store locations into fully qualified absolute paths
Converting load/store locations into fully qualified absolute paths --- Key: PIG-758 URL: https://issues.apache.org/jira/browse/PIG-758 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths
[ https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-758: --- Description: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {code} a = load 'table' using DBLoader(); {code} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {code} a = load 'sql://table' using DBLoader(); {code} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. was: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {{{ a = load 'table' using DBLoader(); }}} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {{{ a = load 'sql://table' using DBLoader(); }}} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. Converting load/store locations into fully qualified absolute paths --- Key: PIG-758 URL: https://issues.apache.org/jira/browse/PIG-758 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {code} a = load 'table' using DBLoader(); {code} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {code} a = load 'sql://table' using DBLoader(); {code} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-758) Converting load/store locations into fully qualified absolute paths
[ https://issues.apache.org/jira/browse/PIG-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-758: --- Description: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {noformat} a = load 'table' using DBLoader(); {noformat} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {noformat} a = load 'sql://table' using DBLoader(); {noformat} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. was: As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {code} a = load 'table' using DBLoader(); {code} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {code} a = load 'sql://table' using DBLoader(); {code} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. Converting load/store locations into fully qualified absolute paths --- Key: PIG-758 URL: https://issues.apache.org/jira/browse/PIG-758 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner As part of the multiquery optimization work there is a need to use absolute paths for load and store operations (because the current directory changes during the execution of the script). In order to do so, we are suggesting a change to the semantics of the location/filename string used in LoadFunc and Slicer/Slice. The proposed change is: * Load locations without a scheme part are expected to be hdfs (mapreduce mode) or local (local mode) paths * Any hdfs or local path will be translated to a fully qualified absolute path before it is handed to either a LoadFunc or Slicer * Any scheme other than file or hdfs will result in the load path to be passed through to the LoadFunc or Slicer without any modification. Example: If you have a LoadFunc that reads from a database, in the current system the following could be used: {noformat} a = load 'table' using DBLoader(); {noformat} With the proposed changes table would be translated into an hdfs path though (hdfs:///table). Probably not what the DBLoader would want to see. In order to make it work one could use: {noformat} a = load 'sql://table' using DBLoader(); {noformat} Now the DBLoader would see the unchanged string sql://table. This is an incompatible change, but hopefully not affecting many existing Loaders/Slicers. Since this is needed with the multiquery feature, the behavior can be reverted back by using the no_multiquery pig flag. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Commented: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12697237#action_12697237 ] Alan Gates commented on PIG-745: Responses to comments: 1) Java's Boolean.toString() is probably the best choice. 2) Unit tests would be nice, but this is pretty basic and you're just calling various java .toString functions. 3) If you're happy with it, it's good enough for now. We can improve it later if people ask for it. 4) Noted. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-745) Please add DataTypes.toString() conversion function
[ https://issues.apache.org/jira/browse/PIG-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-745: --- Resolution: Fixed Fix Version/s: 0.3.0 Status: Resolved (was: Patch Available) Patch checked in. Thanks Ciemo for the contribution. Please add DataTypes.toString() conversion function --- Key: PIG-745 URL: https://issues.apache.org/jira/browse/PIG-745 Project: Pig Issue Type: Improvement Reporter: David Ciemiewicz Fix For: 0.3.0 Attachments: PIG-745.patch I'm doing some work in string manipulation UDFs and I've found that it would be very convenient if I could always convert the argument to a chararray (internally a Java String). For example TOLOWERCASE(arg) shouldn't really care whether arg is a bytearray, chararray, int, long, double, or float, it should be treated as a string and operated on. The simplest and most foolproof method would be if the DataTypes added a static function of DataTypes.toString which did all of the argument type checking and provided consistent translation. I believe that this function might be coded as: public static String toString(Object o) throws ExecException { try { switch (findType(o)) { case BOOLEAN: if (((Boolean)o) == true) return new String('1'); else return new String('0'); case BYTE: return ((Byte)o).toString(); case INTEGER: return ((Integer)o).toString(); case LONG: return ((Long)o).toString(); case FLOAT: return ((Float)o).toString(); case DOUBLE: return ((Double)o).toString(); case BYTEARRAY: return ((DataByteArray)o).toString(); case CHARARRAY: return (String)o; case NULL: return null; case MAP: case TUPLE: case BAG: case UNKNOWN: default: int errCode = 1071; String msg = Cannot convert a + findTypeName(o) + to an String; throw new ExecException(msg, errCode, PigException.INPUT); } } catch (ExecException ee) { throw ee; } catch (Exception e) { int errCode = 2054; String msg = Internal error. Could not convert + o + to String.; throw new ExecException(msg, errCode, PigException.BUG); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-759) HBaseStorage scheme for Load/Slice function
HBaseStorage scheme for Load/Slice function --- Key: PIG-759 URL: https://issues.apache.org/jira/browse/PIG-759 Project: Pig Issue Type: Bug Reporter: Gunther Hagleitner We would like to change the HBaseStorage function to use a scheme when loading a table in pig. The scheme we are thinking of is: hbase. So in order to load an hbase table in a pig script the statement should read: {noformat} table = load 'hbase://tablename' using HBaseStorage(); {noformat} If the scheme is omitted pig would assume the tablename to be an hdfs path and the storage function would use the last component of the path as a table name and output a warning. For details on why see jira issue: PIG-758 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-761) ERROR 2086 on simple JOIN
ERROR 2086 on simple JOIN - Key: PIG-761 URL: https://issues.apache.org/jira/browse/PIG-761 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Environment: mapreduce mode Reporter: Vadim Zaliva ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators.org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 109 doing pretty straightforward join in one of my pig scripts. I am able to 'dump' both relationship involved in this join. when I try to join them I am getting this error. Here is a full log: ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 109 at org.apache.pig.PigServer.registerQuery(PigServer.java:296) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:319) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution. at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:274) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:700) at org.apache.pig.PigServer.execute(PigServer.java:691) at org.apache.pig.PigServer.registerQuery(PigServer.java:292) ... 5 more Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR 2086: Unexpected problem during optimization. Could not find all LocalRearrange operators. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.handlePackage(POPackageAnnotator.java:116) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.plans.POPackageAnnotator.visitMROp(POPackageAnnotator.java:88) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:194) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:43) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:65) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:67) at org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:50) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. MapReduceLauncher.compile(MapReduceLauncher.java:198) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:80) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:261) ... 8 more ERROR 1002: Unable to store alias 398 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias 398 at org.apache.pig.PigServer.registerQuery(PigServer.java:296) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:529) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:280) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:99) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:319) Caused by: java.lang.NullPointerException at org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:669) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:330) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:41) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246) at org.apache.pig.PigServer.compilePp(PigServer.java:771) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:697) at org.apache.pig.PigServer.execute(PigServer.java:691) at org.apache.pig.PigServer.registerQuery(PigServer.java:292) ... 5 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.