[jira] Updated: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-893: --- Resolution: Fixed Release Note: PIG-893: Added casts from chararray to int, long, float, and double. Status: Resolved (was: Patch Available) Patch checked in. Thanks Jeff for your work on this. support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Thejas M Nair Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_893.Patch Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742239#action_12742239 ] Alan Gates commented on PIG-911: Dmitry, First this is great. We've had requests to read Sequence files. Being able to write them also would be great. A few thoughts: 1) This should not extend UTF8StorageConverter. This loader will be returning actual data types, not bytes that need to be interpreted. I would think instead that it should implement the bytesToX() methods itself and just throw an exception saying it didn't expect to do any conversion. 2) The getSampledTuple looks fine if skip is handling getting the stream to the point that reading the next tuple is viable. 3) In the bindTo call, where you obtain the key and value by reflection, should there be a try/catch block there in case the cast to Writable fails? In the same way, in describe schema you're asking how to suppress warnings from the cast in reader.getKeyClass(). But don't you want to check that what you got really is a writable, since there is no guarantee? [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742318#action_12742318 ] Hudson commented on PIG-833: Integrated in Pig-trunk #520 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/520/]) : Added Zebra, new columnar storage mechanism for HDFS. Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742319#action_12742319 ] Hudson commented on PIG-893: Integrated in Pig-trunk #520 (See [http://hudson.zones.apache.org/hudson/job/Pig-trunk/520/]) : Added string - integer, long, float, and double casts. support cast of chararray to other simple types --- Key: PIG-893 URL: https://issues.apache.org/jira/browse/PIG-893 Project: Pig Issue Type: New Feature Affects Versions: 0.4.0 Reporter: Thejas M Nair Assignee: Jeff Zhang Fix For: 0.4.0 Attachments: Pig_893.Patch Pig should support casting of chararray to integer,long,float,double,bytearray. If the conversion fails for reasons such as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-833) Storage access layer
[ https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742321#action_12742321 ] Amr Awadallah commented on PIG-833: --- I am out of office until Aug 14th. I will be checking my email intermittently. If this is urgent then please call my cell phone, otherwise I will reply to your email when I get back. Thanks for your patience, -- amr Storage access layer Key: PIG-833 URL: https://issues.apache.org/jira/browse/PIG-833 Project: Pig Issue Type: New Feature Reporter: Jay Tang Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz A layer is needed to provide a high level data access abstraction and a tabular view of data in Hadoop, and could free Pig users from implementing their own data storage/retrieval code. This layer should also include a columnar storage format in order to provide fast data projection, CPU/space-efficient data serialization, and a schema language to manage physical storage metadata. Eventually it could also support predicate pushdown for further performance improvement. Initially, this layer could be a contrib project in Pig and become a hadoop subproject later on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated PIG-845: - Attachment: (was: merge-join-1.patch) PERFORMANCE: Merge Join --- Key: PIG-845 URL: https://issues.apache.org/jira/browse/PIG-845 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Thsi join would work if the data for both tables is sorted on the join key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742562#action_12742562 ] Dmitriy V. Ryaboy commented on PIG-845: --- Alan, Ashutosh -- maybe I am misunderstanding where null keys come from in the Indexer. I assumed this was due to the processing that happens in the plan the indexer deserializes and attaches to its POLocalRearrange. In regards to errors, I was referring to this: {code} catch(PlanException e){ int errCode = 2034; String msg = Error compiling operator + joinOp.getClass().getCanonicalName(); throw new MRCompilerException(msg, errCode, PigException.BUG, e); {code} The only central place for error codes seems to be the Wiki. A class with a bunch of static+final error codes would be a better place. Ashutosh, I completely disagree with you on changing all tests to run in MR mode. The tests are already impossible to run on a laptop (people, myself included, actually submit patches to jira just to see if tests pass). Running in MR mode will incur significant overhead per test. Only things that actually rely on the MR bits should be tested in MR (and use mock objects if possible.. there's been some advancement on that front in Hadoop 20, I haven't looked at it yet). Would love to see a more efficient indexing MR job (which will reduce load on the JT, keep schedules less busy, and incur less overhead in task startups by requiring fewer tasks), but perhaps not before 0.4 is out the door with existing functionality. Just to be clear, I don't think more than 1 record per block is necessary, but more than one block per task would probably be a good thing. Any thoughts on how to choose which of two relations to index? We get locality on the non-indexed relation, but not on the indexed one, which probably throws a kink in the normal way of thinking about this. PERFORMANCE: Merge Join --- Key: PIG-845 URL: https://issues.apache.org/jira/browse/PIG-845 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Ashutosh Chauhan Attachments: merge-join.patch Thsi join would work if the data for both tables is sorted on the join key. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader
[ https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742565#action_12742565 ] Dmitriy V. Ryaboy commented on PIG-911: --- Alan, Thanks for the feedback. I'll add the try/catch In regards to the UTF8StorageConverter -- I think I added that because before that the code broke if you didn't declare a schema at load time (so, a=load 'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using SequenceFileLoader() as (a:chararray, b:double) I'll figure out what exactly is going on with that and remove the UTF8StorageConverter Will add Store as time allows. [Piggybank] SequenceFileLoader --- Key: PIG-911 URL: https://issues.apache.org/jira/browse/PIG-911 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Attachments: pig_sequencefile.patch The proposed piggybank contribution adds a SequenceFileLoader to the piggybank. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[zebra] LOAD call will hang if only the first column group is queried - Key: PIG-918 URL: https://issues.apache.org/jira/browse/PIG-918 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Yan Zhou Fix For: 0.2.0 Zebra's LOAD call with projections that only nclude column(s) in the first column group will hang because an improper range of random numbers for index to the array of column groups always skips the first element so that if all other column groups are not used, the looping keeps running without a chance to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-917) [zebra]some issues on compression
[ https://issues.apache.org/jira/browse/PIG-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Huang updated PIG-917: --- Affects Version/s: (was: 0.1.0) 0.3.0 Fix Version/s: (was: 0.2.0) 0.4.0 [zebra]some issues on compression - Key: PIG-917 URL: https://issues.apache.org/jira/browse/PIG-917 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Jing Huang Fix For: 0.4.0 These are zebra compression related issues: 1. ColumnGoupParser only recognize gzip not gz. For example, if user specify compress by gz, it will throw org.apache.hadoop.zebra.types.ParseException. 2. BasicTable.dumpInfo is wrong. It will always print Compressor: lzo2 even if the default compressor is gz, or user specifies compress by gzip. So we can not verify if the default compressor can be actually over written. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried
[ https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Zhou updated PIG-918: - Affects Version/s: (was: 0.2.0) 0.3.0 Fix Version/s: (was: 0.2.0) 0.4.0 [zebra] LOAD call will hang if only the first column group is queried - Key: PIG-918 URL: https://issues.apache.org/jira/browse/PIG-918 Project: Pig Issue Type: Bug Affects Versions: 0.3.0 Reporter: Yan Zhou Fix For: 0.4.0 Attachments: pig-zebra.patch Zebra's LOAD call with projections that only nclude column(s) in the first column group will hang because an improper range of random numbers for index to the array of column groups always skips the first element so that if all other column groups are not used, the looping keeps running without a chance to break. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group
[ https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742668#action_12742668 ] Viraj Bhat commented on PIG-919: This problem can be solved simply by casting the firstname to chararray!! Why?? {code} groupgenmap = group filternonnullfirstnames by (chararray)firstname; dump groupgenmap; {code} Is there a problem with the UDF?? Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group -- Key: PIG-919 URL: https://issues.apache.org/jira/browse/PIG-919 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Viraj Bhat Fix For: 0.3.0 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar I have a Pig script, which takes in a student file and generates a bag of maps. I later want to group on the value of the key name0 which corresponds to the first name of the student. {code} register mymapudf.jar; data = LOAD '/user/viraj/studenttab10k' AS (somename:chararray,age:long,marks:float); genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as bp:map[], age, marks; getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks; filternonnullfirstnames = filter getfirstnames by firstname is not null; groupgenmap = group filternonnullfirstnames by firstname; dump groupgenmap; {code} When I execute this code, I get an error in the Map Phase: === java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) === -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.