[jira] Updated: (PIG-893) support cast of chararray to other simple types

2009-08-12 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-893:
---

  Resolution: Fixed
Release Note: PIG-893:  Added casts from chararray to int, long, float, and 
double.
  Status: Resolved  (was: Patch Available)

Patch checked in.  Thanks Jeff for your work on this.

 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Thejas M Nair
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_893.Patch


 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-12 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742239#action_12742239
 ] 

Alan Gates commented on PIG-911:


Dmitry,

First this is great.  We've had requests to read Sequence files.  Being able to 
write them also would be great.

A few thoughts:

1) This should not extend UTF8StorageConverter.  This loader will be returning 
actual data types, not bytes that need to be interpreted.  I would think 
instead that it should implement the bytesToX() methods itself and just throw 
an exception saying it didn't expect to do any conversion.

2) The getSampledTuple looks fine if skip is handling getting the stream to the 
point that reading the next tuple is viable.

3) In the bindTo call, where you obtain the key and value by reflection, should 
there be a try/catch block there in case the cast to Writable fails?  In the 
same way, in describe schema you're asking how to suppress warnings from the 
cast in reader.getKeyClass().  But don't you want to check that what you got 
really is a writable, since there is no guarantee?



 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-08-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742318#action_12742318
 ] 

Hudson commented on PIG-833:


Integrated in Pig-trunk #520 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/520/])
: Added Zebra, new columnar storage mechanism for HDFS.


 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang
 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
 PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
 TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz


 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-893) support cast of chararray to other simple types

2009-08-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742319#action_12742319
 ] 

Hudson commented on PIG-893:


Integrated in Pig-trunk #520 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/520/])
:  Added string - integer, long, float, and double casts.


 support cast of chararray to other simple types
 ---

 Key: PIG-893
 URL: https://issues.apache.org/jira/browse/PIG-893
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.4.0
Reporter: Thejas M Nair
Assignee: Jeff Zhang
 Fix For: 0.4.0

 Attachments: Pig_893.Patch


 Pig should support casting of chararray to 
 integer,long,float,double,bytearray. If the conversion fails for reasons such 
 as overflow, cast should return null and log a warning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-08-12 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742321#action_12742321
 ] 

Amr Awadallah commented on PIG-833:
---

I am out of office until Aug 14th. I will be checking my email
intermittently. If this is urgent then please call my cell phone,
otherwise I will reply to your email when I get back.

Thanks for your patience,

-- amr


 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang
 Attachments: hadoop20.jar.bz2, PIG-833-zebra.patch, 
 PIG-833-zebra.patch.bz2, PIG-833-zebra.patch.bz2, 
 TEST-org.apache.hadoop.zebra.pig.TestCheckin1.txt, test.out, zebra-javadoc.tgz


 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-845) PERFORMANCE: Merge Join

2009-08-12 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-845:
-

Attachment: (was: merge-join-1.patch)

 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan

 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

2009-08-12 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742562#action_12742562
 ] 

Dmitriy V. Ryaboy commented on PIG-845:
---

Alan, Ashutosh -- maybe I am misunderstanding where null keys come from in the 
Indexer. I assumed this was due to the processing that happens in the plan the 
indexer deserializes and attaches to its POLocalRearrange.

In regards to errors, I was referring to this:
{code}
catch(PlanException e){
int errCode = 2034;
String msg = Error compiling operator  + 
joinOp.getClass().getCanonicalName();
throw new MRCompilerException(msg, errCode, PigException.BUG, e);
{code}

The only central place for error codes seems to be the Wiki.  A class with a 
bunch of static+final error codes would be a better place.


Ashutosh, I completely disagree with you on changing all tests to run in MR 
mode.  The tests are already impossible to run on a laptop (people, myself 
included, actually submit patches to jira just to see if tests pass).  Running 
in MR mode will incur significant overhead per test. Only things that actually 
rely on the MR bits should be tested in MR (and use mock objects if possible.. 
there's been some advancement on that front in Hadoop 20, I haven't looked at 
it yet).

Would love to see a more efficient indexing MR job (which will reduce load on 
the JT, keep schedules less busy, and incur less overhead in task startups by 
requiring fewer tasks), but perhaps not before 0.4 is out the door with 
existing functionality.  Just to be clear, I don't think more than 1 record per 
block is necessary, but more than one block per task would probably be a good 
thing.

Any thoughts on how to choose which of two relations to index? We get locality 
on the non-indexed relation, but not on the indexed one, which probably throws 
a kink in the normal way of thinking about this.



 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Attachments: merge-join.patch


 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-911) [Piggybank] SequenceFileLoader

2009-08-12 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742565#action_12742565
 ] 

Dmitriy V. Ryaboy commented on PIG-911:
---

Alan, 
Thanks for the feedback.

I'll add the try/catch

In regards to the UTF8StorageConverter -- I think I added that because before 
that the code broke if you didn't declare a schema at load time (so, a=load 
'foo' using SequenceFileLoader() as (a,b) instead of a=load 'foo' using 
SequenceFileLoader() as (a:chararray, b:double)

I'll figure out what exactly is going on with that and remove the 
UTF8StorageConverter 

Will add Store as time allows.



 [Piggybank] SequenceFileLoader 
 ---

 Key: PIG-911
 URL: https://issues.apache.org/jira/browse/PIG-911
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
 Attachments: pig_sequencefile.patch


 The proposed piggybank contribution adds a SequenceFileLoader to the 
 piggybank.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried

2009-08-12 Thread Yan Zhou (JIRA)
[zebra] LOAD call will hang if only the first column group is queried
-

 Key: PIG-918
 URL: https://issues.apache.org/jira/browse/PIG-918
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Yan Zhou
 Fix For: 0.2.0


Zebra's LOAD call with projections that only nclude column(s) in the first 
column group will hang because an improper range of random numbers for index to 
the array of column groups always skips the first element so that if all other 
column groups are not used, the looping keeps running without a chance to 
break.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-917) [zebra]some issues on compression

2009-08-12 Thread Jing Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Huang updated PIG-917:
---

Affects Version/s: (was: 0.1.0)
   0.3.0
Fix Version/s: (was: 0.2.0)
   0.4.0

 [zebra]some issues on compression
 -

 Key: PIG-917
 URL: https://issues.apache.org/jira/browse/PIG-917
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Jing Huang
 Fix For: 0.4.0


 These are zebra compression related issues:
 1. ColumnGoupParser only recognize gzip not gz. For example, if user 
 specify compress by gz, it will throw 
 org.apache.hadoop.zebra.types.ParseException.
 2. BasicTable.dumpInfo is wrong. It will always print Compressor: lzo2 even 
 if the default compressor is gz, or user specifies compress by gzip.
 So we can not verify if the default compressor can be actually  over written. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-918) [zebra] LOAD call will hang if only the first column group is queried

2009-08-12 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-918:
-

Affects Version/s: (was: 0.2.0)
   0.3.0
Fix Version/s: (was: 0.2.0)
   0.4.0

 [zebra] LOAD call will hang if only the first column group is queried
 -

 Key: PIG-918
 URL: https://issues.apache.org/jira/browse/PIG-918
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Yan Zhou
 Fix For: 0.4.0

 Attachments: pig-zebra.patch


 Zebra's LOAD call with projections that only nclude column(s) in the first 
 column group will hang because an improper range of random numbers for index 
 to the array of column groups always skips the first element so that if all 
 other column groups are not used, the looping keeps running without a chance 
 to break.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-08-12 Thread Viraj Bhat (JIRA)
Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableBytesWritable, recieved 
org.apache.pig.impl.io.NullableText when doing simple group
--

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0


I have a Pig script, which takes in a student file and generates a bag of maps. 
 I later want to group on the value of the key name0 which corresponds to the 
first name of the student.
{code}
register mymapudf.jar;



data = LOAD '/user/viraj/studenttab10k' AS 
(somename:chararray,age:long,marks:float);



genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
bp:map[], age, marks;



getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;



filternonnullfirstnames = filter getfirstnames by firstname is not null;




groupgenmap = group filternonnullfirstnames by firstname;



dump groupgenmap;
{code}

When I execute this code, I get an error in the Map Phase:
===
java.io.IOException: Type mismatch in key from map: expected 
org.apache.pig.impl.io.NullableBytesWritable, recieved 
org.apache.pig.impl.io.NullableText
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-08-12 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742668#action_12742668
 ] 

Viraj Bhat commented on PIG-919:


This problem can be solved simply by casting the firstname to chararray!! Why??
{code}
groupgenmap = group filternonnullfirstnames by (chararray)firstname;

dump groupgenmap;
{code}

Is there a problem with the UDF??

 Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText when doing simple group
 --

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0

 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar


 I have a Pig script, which takes in a student file and generates a bag of 
 maps.  I later want to group on the value of the key name0 which 
 corresponds to the first name of the student.
 {code}
 register mymapudf.jar;
 data = LOAD '/user/viraj/studenttab10k' AS 
 (somename:chararray,age:long,marks:float);
 genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
 bp:map[], age, marks;
 getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;
 filternonnullfirstnames = filter getfirstnames by firstname is not null;
 groupgenmap = group filternonnullfirstnames by firstname;
 dump groupgenmap;
 {code}
 When I execute this code, I get an error in the Map Phase:
 ===
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.