[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725602#action_12725602
 ] 

Hudson commented on PIG-820:


Integrated in Pig-trunk #490 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/490/])
: Change RandomSampleLoader to take a LoadFunc instead of extending 
BinStorage.  Added new Samplable interface for loaders to implement allowing 
them to be used by RandomSampleLoader.


 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-868) indexof / lastindexof / lower / replace / substring udf's

2009-06-30 Thread Bennie Schut (JIRA)
indexof / lastindexof / lower / replace / substring udf's
-

 Key: PIG-868
 URL: https://issues.apache.org/jira/browse/PIG-868
 Project: Pig
  Issue Type: New Feature
Reporter: Bennie Schut
Priority: Trivial


We parse some apache logs using pig and are using some pretty simple udf's like 
this:

B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, 
'.txt')) as lang;

It's pretty simple stuff but I figured someone else might find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's

2009-06-30 Thread Bennie Schut (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bennie Schut updated PIG-868:
-

Attachment: addSomeUDFsPatch.patch

Some udf's including unittests.

 indexof / lastindexof / lower / replace / substring udf's
 -

 Key: PIG-868
 URL: https://issues.apache.org/jira/browse/PIG-868
 Project: Pig
  Issue Type: New Feature
Reporter: Bennie Schut
Priority: Trivial
 Attachments: addSomeUDFsPatch.patch


 We parse some apache logs using pig and are using some pretty simple udf's 
 like this:
 B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, 
 '.txt')) as lang;
 It's pretty simple stuff but I figured someone else might find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's

2009-06-30 Thread Bennie Schut (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bennie Schut updated PIG-868:
-

Attachment: dateExtractorPatch.patch

dateExtractor unittest gave a failure on my machine on the -0600 time zone 
returning the wrong date.  A bit unrelated to the other patch but I fixed it to 
get piggybank unittests to work.

 indexof / lastindexof / lower / replace / substring udf's
 -

 Key: PIG-868
 URL: https://issues.apache.org/jira/browse/PIG-868
 Project: Pig
  Issue Type: New Feature
Reporter: Bennie Schut
Priority: Trivial
 Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch


 We parse some apache logs using pig and are using some pretty simple udf's 
 like this:
 B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, 
 '.txt')) as lang;
 It's pretty simple stuff but I figured someone else might find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-868) indexof / lastindexof / lower / replace / substring udf's

2009-06-30 Thread Bennie Schut (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bennie Schut updated PIG-868:
-

Status: Patch Available  (was: Open)

 indexof / lastindexof / lower / replace / substring udf's
 -

 Key: PIG-868
 URL: https://issues.apache.org/jira/browse/PIG-868
 Project: Pig
  Issue Type: New Feature
Reporter: Bennie Schut
Priority: Trivial
 Attachments: addSomeUDFsPatch.patch, dateExtractorPatch.patch


 We parse some apache logs using pig and are using some pretty simple udf's 
 like this:
 B = FOREACH A GENERATE substring(uri, lastindexof(uri, '/')+1, indexof(uri, 
 '.txt')) as lang;
 It's pretty simple stuff but I figured someone else might find it useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-749) No attempt to check if 'flatten(group) as' has the same cardinality as 'group alias by'

2009-06-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-749:
---

Priority: Minor  (was: Major)

 No attempt to check if 'flatten(group) as' has the same cardinality as 'group 
 alias by'
 ---

 Key: PIG-749
 URL: https://issues.apache.org/jira/browse/PIG-749
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Priority: Minor

 Pig script which does grouping for 3 columns and flattens as 4 columns works 
 when in principle it should not and maybe fail as a front-end error.
 {code}
 A = load 'groupcardinalitycheck.txt' using PigStorage() as (col1:chararray, 
 col2:chararray, col3:int, col4:chararray);
 B = group A by (col1, col2, col3);
 C = foreach B generate
flatten(group) as (col1, col2, col3, col4),
SIZE(A) as frequency;
 dump C;
 {code}
 ==
 Data
 ==
 hello   CC  1   there
 hello   YSO 2   out
 ouchCC  2   hey
 ==
 Result of the preceding script
 ==
 (ouch,CC,2,1L)
 (hello,CC,1,1L)
 (hello,YSO,2,1L)
 ==

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-820:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

v6 of the patch checked in.  Thanks Ashutosh.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-788) Proposal to remove float from Pig data types

2009-06-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-788.


Resolution: Won't Fix

Avro has decided to keep float as a type.

 Proposal to remove float from Pig data types
 

 Key: PIG-788
 URL: https://issues.apache.org/jira/browse/PIG-788
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Alan Gates
Assignee: Alan Gates

 Pig would like to use the new Hadoop Avro serialization package to pass data 
 between MR jobs, and eventually between Pig and UDFs that are not written in 
 Java.  Avro will not be supporting the float data type, but only double (see 
 AVRO-17).  Pig currently support both float and double.  Double is the 
 default floating point type (so if the user says x + 1.0, 1.0 is taken to be 
 a double, not a float).  Float was initially included in the list of Pig 
 types because Hadoop supported it as one of the Writable types, and we were 
 trying to make sure all of Hadoop's writable types could be represented in 
 Pig.  
 In practice we do not see anyone using the float type.   In order to be able 
 to easily use Avro I propose dropping the float type.  
 Please speak up if you are using the float type and you have a compelling 
 reason not to use double.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-861) POJoinPackage lose tuple in large dataset

2009-06-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-861:
---

Affects Version/s: (was: 0.2.0)
   0.3.0
   Status: Patch Available  (was: Open)

 POJoinPackage lose tuple in large dataset
 -

 Key: PIG-861
 URL: https://issues.apache.org/jira/browse/PIG-861
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-861-1.patch


 Some script using POJoinPackage loses records when processing large amount of 
 input data. We do not see this problem in smaller input. We can reproduce 
 this problem, however, the dataset for the test case is too big to be 
 included here. We suspect that POJoinPackage causes the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-861) POJoinPackage lose tuple in large dataset

2009-06-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725880#action_12725880
 ] 

Olga Natkovich commented on PIG-861:


+1, changes look good. Great catch! 

Need to make sure all tests pass before committing

 POJoinPackage lose tuple in large dataset
 -

 Key: PIG-861
 URL: https://issues.apache.org/jira/browse/PIG-861
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0

 Attachments: PIG-861-1.patch


 Some script using POJoinPackage loses records when processing large amount of 
 input data. We do not see this problem in smaller input. We can reproduce 
 this problem, however, the dataset for the test case is too big to be 
 included here. We suspect that POJoinPackage causes the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.