from:"Ashutosh Chauhan \(JIRA\)"

[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-05-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714209#action_12714209
 ] 

Ashutosh Chauhan commented on PIG-796:
--

Since Pig allows values in a map to be of different types caching the type may 
not be safe. There are two possible alternatives:

a) Find type by introspection every time. This will ensure we are always 
correct and can handle all cases (including when values in maps are of 
different types). This though will incur a performance overhead for every cast 
call.
b) Find the type first time and then cache it for subsequent calls. When 
encountered with different type Pig will bail out with a ClassCastException. 
This will avoid performance overhead but Pig will die when values in maps are 
of different types.

In this performance Vs handling all cases trade-off wondering which route 
should we go ?  

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-05-29 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: pig-796.patch

This patch implements the fix as suggested by Alan.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-01 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: 796.patch

updated patch.

This patch fixes the following issue: Sometimes (e.g. when values coming out of 
map lookup) Pig assumes type of element as ByteArray when actually it is of 
some other type. In such cases request for a Cast fails. 

This patch first finds out the actual type of element before casting it 
(specifically when Pig thinks its ByteArray) and then do the cast. It also 
caches the type. When type changes ClassCastException is raised which gets 
caught and cast is then tried again. Cached value of type is also updated. This 
ensures that type is not determined on each cast call as well as handling of 
casts when types changes from one call to the next. 

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: pig-796.patch

Updated patch incorporating suggested changes.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: (was: pig-796.patch)

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-03 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: pig-796.patch

Updated patch incorporating suggested changes.

 support  conversion from numeric types to chararray
 ---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: 796.patch, pig-796.patch, pig-796.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-08 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Status: Patch Available  (was: Open)

Submitting patch

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-08 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717505#action_12717505
 ] 

Ashutosh Chauhan commented on PIG-773:
--

FindBugs is complaining for starting function name with uppercase letter. All 
the method names in QueryParser.jjt starts with uppercase letter. So, following 
that convention I am leaving that function name as it is. 

FindBug warning: 
The method name 
org.apache.pig.impl.logicalLayer.parser.QueryParser.UnionClause(LogicalPlan) 
doesn't start with a lower case letter
Bug type NM_METHOD_NAMING_CONVENTION
In class org.apache.pig.impl.logicalLayer.parser.QueryParser
In method 
org.apache.pig.impl.logicalLayer.parser.QueryParser.UnionClause(LogicalPlan)
At QueryParser.java:[lines 2662-2713]  

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-06-19 Thread Ashutosh Chauhan (JIRA)

Order By followed by replicated join fails while compiling MR-plan from 
physical plan
---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
 Fix For: 0.4.0


Consider the query:
{code}
A = load 'a';
B = order A by $0;
C = join A by $0, B by $0;
explain C;
{code}
works. But if replicated join is used instead
{code}
A = load 'a';
B = order A by $0;
C = join A by $0, B by $0 using replicated;
explain C;
{code}
this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
compiling operator POFRJoin
relevant stacktrace:
{code}
Caused by: java.lang.RuntimeException: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
 ERROR 2034: Error compiling operator POFRJoin
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
at org.apache.pig.PigServer.explain(PigServer.java:574)
... 8 more
Caused by: 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
 ERROR 2034: Error compiling operator POFRJoin
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
... 9 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
... 16 more
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-20 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820.patch

In addition to explanation above SampleOptimizer is introduced which visits the 
compiled MR plan to detect this pattern (MR operator containing only load-store 
followed by MR operator containing sampling job in map plan). If this pattern 
is present, SampleOptimizer deletes the unnecessary predecessor MR operator and 
replaces the POLoad of sampling job with RandomSampleLoader which uses the 
loader of its predecessor. 

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-20 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Affects Version/s: 0.4.0
   Status: Patch Available  (was: Open)

Submitting for both 0.3 and 0.4 branches.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-21 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Attachment: pig-773_v2.patch

This patch doesn't have extra EmptyConstant production but rather matches for 
empty content of bag / tuple / map in their respective productions. As a result 
it avoids the unintuitive logic as Santhosh pointed above.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-21 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Affects Version/s: (was: 0.2.0)
   0.3.0

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-21 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Patch Info:   (was: [Patch Available])

trying to make hudson pick the patch

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-859) Optimizer throw error on self-joins

2009-06-22 Thread Ashutosh Chauhan (JIRA)

Optimizer throw error on self-joins
---

 Key: PIG-859
 URL: https://issues.apache.org/jira/browse/PIG-859
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
 Fix For: 0.4.0


Doing self-join results in exception thrown by Optimizer. Consider the 
following query
{code}
grunt A = load 'a';
grunt B = Join A by $0, A by $0;
grunt explain B;

2009-06-20 15:51:38,303 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1094: Attempt to insert between two nodes that were not connected.
Details at logfile: pig_1245538027026.log
{code}

Relevant stack-trace from log-file:
{code}

Caused by: org.apache.pig.impl.plan.optimizer.OptimizerException: ERROR
2047: Internal error. Unable to introduce split operators.
at
org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:163)
at
org.apache.pig.impl.logicalLayer.optimizer.LogicalOptimizer.optimize(LogicalOptimizer.java:163)
at org.apache.pig.PigServer.compileLp(PigServer.java:844)
at org.apache.pig.PigServer.compileLp(PigServer.java:781)
at org.apache.pig.PigServer.getStorePlan(PigServer.java:723)
at org.apache.pig.PigServer.explain(PigServer.java:566)
... 8 more
Caused by: org.apache.pig.impl.plan.PlanException: ERROR 1094: Attempt
to insert between two nodes that were not connected.
at
org.apache.pig.impl.plan.OperatorPlan.doInsertBetween(OperatorPlan.java:500)
at
org.apache.pig.impl.plan.OperatorPlan.insertBetween(OperatorPlan.java:480)
at
org.apache.pig.impl.logicalLayer.optimizer.ImplicitSplitInserter.transform(ImplicitSplitInserter.java:139)
... 13 more
{code}


A possible workaround is:
{code}

grunt A = load 'a';
grunt B = load 'a';
grunt C = join A by $0, B by $0;
grunt explain C;
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722706#action_12722706
 ] 

Ashutosh Chauhan commented on PIG-820:
--

In the patch RandomSampleLoader is marked as serializable and loader field in 
it is marked as transient. Since loader is  initialized in constructor and is 
used later on findbugs is complaining : This class contains a field that is 
updated at multiple places in the class, thus it seems to be part of the state 
of the class.However, since the field is marked as transient and not set in 
readObject or readResolve, it will contain the default value in any 
deserialized instance of the class.  However there is no need for 
RandomSampleLoader to implement Serializable anyway (and thus loader to be 
marked as transient) because loader is reconstructed from FunSpec later on. 
Because of this reason, both PigStorage and BinStorage also doesnt implement 
serializable. Will be submitting a new patch with the required changes.


 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Attachments: pig-820.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v2.patch

Patch which fixes findbugs warning.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

Submitting to hudson

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Fix Version/s: 0.4.0
 Assignee: Ashutosh Chauhan  (was: Alan Gates)
   Status: Open  (was: Patch Available)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-22 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v3.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723194#action_12723194
 ] 

Ashutosh Chauhan commented on PIG-773:
--

Santhosh, thanks for the review.

1. Will be fixing it in new patch.
2. Test passes while it should fail. Seems like there is an issue how Bag 
handles its schema. Will be investigating it further.
3. Will include test cases which check for existence of constants in the plan.


 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723316#action_12723316
 ] 

Ashutosh Chauhan commented on PIG-820:
--

Thanks Alan and Pradeep for the review.

Will be incorporating SampleOptimizer changes. 
Constructor of RandomSampleLoader can only take string args since it is 
instantiated from FuncSpec on backend. So, cant make changes to types of 
RandomSampleLoader constructor argument. However, instead of String having 
classname of loader , String version of FuncSpec can be used so that loader 
with correct constructor gets instantiated.

Will be uploading a new patch soon.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v4.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Open  (was: Patch Available)

Due to change in LoadFunc interface as a part of PIG-734 commit, my patch won't 
apply cleanly on trunk anymore. Will merge with trunk and regenerate the patch 
again.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-24 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-24 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v5.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-26 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v6.patch

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-26 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-06-26 Thread Ashutosh Chauhan (JIRA)

Performance: Unnnecessary computation in FRJoin
---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0


In POFRJoin implementation POLocalRearrange is used to extract join keys from 
the input tuples. If keys match then to perform actual join input tuples are 
fed to Foreach which does a cross on its inputs. After keys are extracted using 
POLocalRearrange output; function getValueTuple(POLocalRearrange lr, Tuple 
tuple) is called to reconstruct the input tuple. It seems that this function 
call is unnecessary since we already have input tuple at that time. 

This is not a bug, but since this function would get called for every tuple, if 
it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-06-27 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-865:
-

Attachment: pig-865.patch

Patch which fixes the issue described above. A useful side-effect of it is it 
removes the code duplication as function 
getValueTuple(POLocalRearrange lr, Tuple tuple) is also present in 
POPackage.java  

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-865.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-06-27 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-865:
-

Status: Patch Available  (was: Open)

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-865.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-06-27 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724921#action_12724921
 ] 

Ashutosh Chauhan commented on PIG-865:
--

Patch contains no new unit-tests as it neither introduces new functionality nor 
modifies the existing one.

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-865.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-28 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Status: Open  (was: Patch Available)

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-28 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Attachment: pig-773_v3.patch

It turned out that there was a bug in DataType.java where schema for a Bag is 
computed. The patch fixes the bug. Test cases are modified 
to match the expected behavior. Also the values generated by the parser are 
checked against expected values for the parsed constants.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-28 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Fix Version/s: 0.4.0
   Status: Patch Available  (was: Open)

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-06-29 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725368#action_12725368
 ] 

Ashutosh Chauhan commented on PIG-865:
--

Thanks for the review, Pradeep. 
As I was looking into code, I also found that bags used to hold replicate 
contents are recreated everytime, instead same bag object can be cleared and 
used again, thus minimizing object overhead. In the extreme case where every 
value of join key is different for every tuple (of replicate) but matches with 
tuples of fragment, we will end up creating as many bags as there are tuples 
where one bag would do. Will include this change and upload new patch.

   

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-865.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-02 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Status: Patch Available  (was: Open)

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
 pig-773_v4.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-07-02 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reopened PIG-820:
--


Samplable interface introduced as a part of this patch enforces the contract of 
implementing getPosition() and next() on the loaders implementing it. An 
additional requirement for a loader to be a sampler is that they should 
correctly handle getNext() without knowing the position in the file. Current 
patch doesn't include this contract as a part of interface. That should be a 
part of the interface.
Reopening the jira because of this issue.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-07-03 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-865:
-

Status: Patch Available  (was: Open)

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-865.patch, pig-865_v2.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-07-06 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v7.patch

Submitting the patch for review. Currently running tests. Will update the jira 
with the result.

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-07-06 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v8.patch

Thanks Pradeep for the review. skip(1) is not required because reading a byte 
(by calling in.read()) would result in pointer getting advanced by 1. I updated 
that comment in the interface noting the fact that loader implementing the 
interface should not assume that current read position is at the beginning of a 
tuple. 

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, 
 pig-820_v8.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-07-06 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Reopened)

 PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
 another loader
 -

 Key: PIG-820
 URL: https://issues.apache.org/jira/browse/PIG-820
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0, 0.4.0
Reporter: Alan Gates
Assignee: Ashutosh Chauhan
 Fix For: 0.4.0

 Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
 pig-820_v4.patch, pig-820_v5.patch, pig-820_v6.patch, pig-820_v7.patch, 
 pig-820_v8.patch


 Currently a sampling job requires that data already be stored in 
 BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
 order by this
 has mostly been acceptable, because users tend to use order by at the end of 
 their script where other MR jobs have already operated on the data and thus it
 is already being stored in BinaryStorage.  For pig scripts that just did an 
 order by, an entire MR job is required to read the data and write it out
 in BinaryStorage format.
 As we begin work on join algorithms that will require sampling, this 
 requirement to read the entire input and write it back out will not be 
 acceptable.
 Join is often the first operation of a script, and thus is much more likely 
 to trigger this useless up front translation job.
 Instead RandomSampleLoader can be changed to subsume an existing loader, 
 using the user specified loader to read the tuples while handling the skipping
 between tuples itself.  This will require the subsumed loader to implement a 
 Samplable Interface, that will look something like:
 {code}
 public interface SamplableLoader extends LoadFunc {
 
 /**
  * Skip ahead in the input stream.
  * @param n number of bytes to skip
  * @return number of bytes actually skipped.  The return semantics are
  * exactly the same as {...@link java.io.InpuStream#skip(long)}
  */
 public long skip(long n) throws IOException;
 
 /**
  * Get the current position in the stream.
  * @return position in the stream.
  */
 public long getPosition() throws IOException;
 }
 {code}
 The MRCompiler would then check if the loader being used to load data 
 implemented the SamplableLoader interface.  If so, rather than create an 
 initial MR
 job to do the translation it would create the sampling job, having 
 RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-07-18 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-773:
-

Attachment: pig-773_v5.patch

Updated patch.

 Empty complex constants (empty bag, empty tuple and empty map) should be 
 supported
 --

 Key: PIG-773
 URL: https://issues.apache.org/jira/browse/PIG-773
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.4.0

 Attachments: pig-773.patch, pig-773_v2.patch, pig-773_v3.patch, 
 pig-773_v4.patch, pig-773_v5.patch


 We should be able to create empty bag constant using {}, empty tuple constant 
 using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-513) PERFORMANCE: optimize some of the code in DefaultTuple

2009-07-29 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-513:
-

Status: Patch Available  (was: Reopened)

 PERFORMANCE: optimize some of the code in DefaultTuple
 --

 Key: PIG-513
 URL: https://issues.apache.org/jira/browse/PIG-513
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-513.patch, pig-513_2.patch


 The following areas in DefaultTuple.java can be changed:
 The member methods get(), set(), getType() and isNull() all call 
 checkBounds() which is redundant call since all these 4 functions throw 
 ExecException. Instead of doing a bounds check, we can catch the 
 IndexOutOfBounds exception in a try-catch and throw it as an ExecException
 The write() method has the following unused object (d in the code below):
 {code}
 for (int i = 0; i  sz; i++) {
 try {
 Object d = get(i);
 } catch (ExecException ee) {
 throw new RuntimeException(ee);
 }
 DataReaderWriter.writeDatum(out, mFields.get(i));
 }
 {code}
 {noformat}
 The get(i) call in the try should be replaced by the writeDatum call directly 
 since d is never used and there is an unncessary call to get()
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-845) PERFORMANCE: Merge Join

2009-07-30 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-845:
-

Attachment: merge-join-for-review.patch

Initial patch for review.

 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
 Attachments: merge-join-for-review.patch


 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-845) PERFORMANCE: Merge Join

2009-08-09 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-845:


Assignee: Ashutosh Chauhan

 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Attachments: merge-join-1.patch, merge-join-for-review.patch


 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-845) PERFORMANCE: Merge Join

2009-08-11 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741733#action_12741733
 ] 

Ashutosh Chauhan commented on PIG-845:
--

Hi Dmitriy,

Thanks for review. Please find my comments inline.

1.
EndOfAllInput flags - could you add comments here about what the point of this 
flag is? You explain what EndOfAllInputSetter does (which is actually rather 
self-explanatory) but not what the meaning of the flag is and how it's used. 
There is a bit of an explanation in PigMapBase, but it really belongs here.
 EndofAllInput flag is basically a flag to indicate that on close() call of 
 map/reduce task, run the pipeline once more. Till now it was used only by 
 POStream, but now POMergeJoin also make use of it.

2.
Could you explain the relationship between EndOfAllInput and (deleted) POStream?
 POStream is still there, I guess you are referring to MRStreamHandler which 
 is deleted. Its renaming of class. Now that POMergeJoin also makes use of 
 it, its better to give it a generic name like EndOfAllInput instead of 
 MRStreamHandler.

3.
Comments in MRCompiler alternate between referring to the left MROp as 
LeftMROper and curMROper. Choose one.
 Ya, will update the comments.

4.
I am curious about the decision to throw compiler exceptions if MergeJoin 
requirements re number of inputs, etc, aren't satisfied. It seems like a better 
user experience would be to log a warning and fall back to a regular join.
 Ya, a good suggestion. It would be straight forward to do it while parsing 
 (e.g. when there are more then two inputs). Though its not straight forward 
 to do at logical to physical plan and physical to MRJobs translation time. 

5.
Style notes for visitMergeJoin:

It's a 200-line method. Any way you can break it up into smaller components? As 
is, it's hard to follow.
 I can break it up, but that will bloat the MRCompiler class size. Better 
 idea is to have MRCompilerHelper or some such class where all the low level 
 helper function lives, so that MRCompiler itself is small and thus easier to 
 read. 

The if statements should be broken up into multiple lines to agree with the 
style guides.

Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they 
are fine, but together in a 200-line method they are undreadable. Please 
consider more descriptive names.
 Will use more descriptive names in next patch.

6.
Kind of a global comment, since it applies to more than just MergeJoin:

It seems to me like we need a Builder for operators to clean up some of the 
new, set, set, set stuff.

Having the setters return this and a Plan's add() method return the plan, would 
let us replace this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)));
topPrj.setColumn(1);
topPrj.setResultType(DataType.TUPLE);
topPrj.setOverloaded(true);
rightMROpr.reducePlan.add(topPrj);
rightMROpr.reducePlan.connect(pkg, topPrj);

with this:

POProject topPrj = new POProject(new 
OperatorKey(scope,nig.getNextNodeId(scope)))
.setColumn(1).setResultType(DataType.TUPLE)
.setOverloaded(true);

rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj)

I agree. At many places there are too many parameters to set. Setters should 
be smart and should return the object instead of being void and then this 
idea of chaining will help to cut down the number of lines. 

7.
Is the change to ListListByte keyTypes in POFRJoin related to MergeJoin or 
just rolled in?
POFRJoin can do without this change, but to avoid code duplication, I update 
the POFRJoin to use ListListByte keyTypes.

8. MergeJoin

break getNext() into components.
 I dont want to do that because it already has lots of class members which 
 are getting updated at various places. Making those variables live in 
 multiple functions will make logic even more harder to follow. Also, I am 
 not sure if java compiler can always inline the private methods.

I don't see you supporting Left outer joins. Plans for that? At least document 
the planned approach.
 Ya, outer joins are currently not supported. Its documented in 
 specification. Will include comment in code also.

Error codes being declared deep inside classes, and documented on the wiki, is 
a poor practice, imo. They should be pulled out into PigErrors (as lightweight 
final objects that have an error code, a name, and a description..) I thought 
Santhosh made progress on this already, no?
 Not sure if I understand you completely. I am using ExecException, 
 FrontEndException etc. Arent these are lightweight final objects you are 
 referring to ?

Could you explain the problem with splits and streams? Why can't this work for 
them?
 Streaming after the join will be supported. There was a bug which I fixed 
 and will be a part of next patch. Streaming before Join will not be 
 supported because in endOfAllInput case, streaming may potentially produce 
 multiple tuples

[jira] Updated: (PIG-845) PERFORMANCE: Merge Join

2009-08-12 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-845:
-

Attachment: (was: merge-join-1.patch)

 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan

 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-845) PERFORMANCE: Merge Join

2009-08-13 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-845:
-

Status: Patch Available  (was: Open)

Running through hudson. Release audit warning can be ignored.

 PERFORMANCE: Merge Join
 ---

 Key: PIG-845
 URL: https://issues.apache.org/jira/browse/PIG-845
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
 Attachments: merge-join.patch, merge-join.patch


 Thsi join would work if the data for both tables is sorted on the join key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-845) PERFORMANCE: Merge Join

2009-08-13 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated PIG-845:
-

Attachment: merge-join.patch

{code}
if(rightMROpr == null || rightMROpr.equals(curMROp))
throw new MRCompilerException(Successor of right input not ...
{code}

Do you also need to check rightMROpr == null here?
I removed null check because that indicates that two preceding MROperator
exists but one of them is null. This is highly unlikely and MRCompiler
probably would have thrown exception while compiling those preceding
physical operator. But I added the check back again in any case.

If index is empty it could mean one of the following two things:
1) Data for right input only has null for join key(s)
2) right input is empty
Are there any other reasons why the index would be empty?
In both these cases, join output would be empty - currently the code throws an
exception
Should this change?
A unit test where right side input is empty would be a good one to add.
Exception thrown at that point is correct because if after reading index you
get null object, its a bug. But there was problem dealing with empty right
file nonetheless. I fixed that and added a test case for it as well.

Additionally, fixed findbugs warning.
Release audit warning is because of gold file addition for testing. Apache
header cant be added in it. So, it can be ignored.

PERFORMANCE: Merge Join
---

Key: PIG-845
URL: https://issues.apache.org/jira/browse/PIG-845
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ashutosh Chauhan
Attachments: merge-join.patch, merge-join.patch

Thsi join would work if the data for both tables is sorted on the join key.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-926) Merge-Join phase 2

2009-08-16 Thread Ashutosh Chauhan (JIRA)

Merge-Join phase 2
--

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor


This jira is created to keep track of phase-2 work for MergeJoin. Various 
limitations exist in phase-1 for Merge Join which are listed on: 
http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-16 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated PIG-926:
-

Attachment: mj_phase2_1.patch

The attached first patch runs the full pipeline of right side in indexer before
sampling the tuple from block. This has following advantages:
a) It addresses the concern which Pradeep pointed out in phase-1: Strictly we
should not allow LOForeach since it could change sort order or position of join
keys and hence invalidate the index - but we need it so that the Foreach
introduced by the TypeCastInserter when there is a schema for either of the
inputs remains. Now since pipeline is run before sampling the tuple, this
becomes a non-issue.
b) Currently type information doesn't make it to the POSort which sorts the
index entries in reduce task of index job. This works due to other reasons, but
this patch fixes this.
c) It will improve on performance. Instead of always sampling the first record
of the block, index now contains the entry of first record in the block for
which join may happen, thus saving time spent in fetching right tuples over the
network which couldn't be joined in any case.

Merge-Join phase 2
--

Key: PIG-926
URL: https://issues.apache.org/jira/browse/PIG-926
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
Attachments: mj_phase2_1.patch

This jira is created to keep track of phase-2 work for MergeJoin. Various
limitations exist in phase-1 for Merge Join which are listed on:
http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-18 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Attachment: (was: mj_phase2_1.patch)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor

 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Attachment: (was: mj_phase2_1.patch)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor

 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Status: Open  (was: Patch Available)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Status: Open  (was: Patch Available)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Attachment: (was: mj_phase2_1.patch)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Attachment: mj_phase2_1.patch

Updated patch addressing Pradeep's comments.

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-926) Merge-Join phase 2

2009-08-19 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-926:
-

Status: Patch Available  (was: Open)

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-926) Merge-Join phase 2

2009-08-20 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745510#action_12745510
 ] 

Ashutosh Chauhan commented on PIG-926:
--

Findbugs warning is about dummyTuple. A dummyTuple is used as an argument to 
call appropriate overloaded getNext() of physical operator. Since this is just 
a marker, it is initialized as null and never updated. Findbugs thinks that it 
will always be null, which is true, but it doesn't affect in any way. There is 
no workaround to get rid of this warning.

 Merge-Join phase 2
 --

 Key: PIG-926
 URL: https://issues.apache.org/jira/browse/PIG-926
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: mj_phase2_1.patch


 This jira is created to keep track of phase-2 work for MergeJoin. Various 
 limitations exist in phase-1 for Merge Join which are listed on: 
 http://wiki.apache.org/pig/PigMergeJoin Those will be addressed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-28 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-934:


Assignee: Ashutosh Chauhan

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-29 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749188#action_12749188
 ] 

Ashutosh Chauhan commented on PIG-934:
--

 Seeking to an offset would only work for a single file - hence maybe have a 
 separate function...

Since open() returns an input stream it is not hard to conceive of usecase when 
one would want to seek into that stream even when filespec points to a 
directory or a glob. We have to define the semantics here. What does seeking in 
a directory/glob means? One reasonable answer is to view all the files in 
directory/glob as one big logical file and offset as an offset in this logical 
file and then seek into this file. Something along the lines of :
{code}
iterator = DataStreamIterator
bytesSeen = 0;
while(itertor.hasNext()){
  open current file pointed by iterator
  bytesSeen += current file length
  if (bytesSeen  offset)
bind to adjusted offset in current file and return
 else
continue; 
}
{code} 

But since there is no requirement for such currently, we can catch the 
situation when seeking is asked for directory/glob and throw an exception (as 
is done in this patch).  Later on, if we decide to support it instead of 
throwing exception, we can implement whatever semantics we decide on. If we 
create a new function with separate name it will be confusing to do these 
changes later on. Moreover, if there is a different function, user of the api 
needs to know about it and deal with it (e.g., need of special constructor in 
POLoad). Presence/absence of offset parameter in argument list I think is a 
sufficient indicator to tell which version of overloaded open() to call if 
there is a need for seek. 
Thoughts?

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-31 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-934:
-

Attachment: (was: pig-934.patch)

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934_2.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-08-31 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-934:
-

Status: Patch Available  (was: Open)

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934_2.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-934) Merge join implementation currently does not seek to right point on the right side input based on the offset provided by the index

2009-09-01 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749901#action_12749901
 ] 

Ashutosh Chauhan commented on PIG-934:
--

All tests passed on my local box. Not sure why they failed on hudson. 

 Merge join implementation currently does not seek to right point on the right 
 side input based on the offset provided by the index
 --

 Key: PIG-934
 URL: https://issues.apache.org/jira/browse/PIG-934
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Ashutosh Chauhan
 Attachments: pig-934_2.patch


 We use POLoad to seek into right file which has the following code: 
 {noformat}
public void setUp() throws IOException{
 String filename = lFile.getFileName();
 loader = 
 (LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
 is = FileLocalizer.open(filename, pc);
 loader.bindTo(filename , new BufferedPositionedInputStream(is), 
 this.offset, Long.MAX_VALUE);
 }
 {noformat}
 Between opening the stream and bindTo we do not seek to the right offset. 
 bindTo itself does not perform any seek.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-07 Thread Ashutosh Chauhan (JIRA)

[Usability] Relating pig script with MR jobs


 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor


Currently its hard to find a way to relate pig script with specific MR job. In 
a loaded cluster with multiple simultaneous job submissions, its not easy to 
figure out which specific MR jobs were launched for a given pig script. If Pig 
can provide this info, it will be useful to debug and monitor the jobs 
resulting from a pig script.

At the very least, Pig should be able to provide user the following information
1) Job id of the launched job.
2) Complete web url of jobtracker running this job. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-07 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashutosh Chauhan updated PIG-948:
-

Attachment: pig-948.patch

Attached is a patch which prints following information on grunt shell :

{code}
09/09/07 15:11:48 INFO mapReduceLayer.MapReduceLauncher: Submitting job:
job_200908291847_0046 to execution engine.
09/09/07 15:11:48 INFO mapReduceLayer.MapReduceLauncher: More information at:
http://www.jobtracker-site:50030/jobdetails.jsp?jobid=job_200908291847_0046
09/09/07 15:11:48 INFO mapReduceLayer.MapReduceLauncher: To kill this job, use:
kill job_200908291847_0046
{code}

[Usability] Relating pig script with MR jobs

Key: PIG-948
URL: https://issues.apache.org/jira/browse/PIG-948
Project: Pig
Issue Type: Improvement
Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
Attachments: pig-948.patch

Currently its hard to find a way to relate pig script with specific MR job.
In a loaded cluster with multiple simultaneous job submissions, its not easy
to figure out which specific MR jobs were launched for a given pig script. If
Pig can provide this info, it will be useful to debug and monitor the jobs
resulting from a pig script.
At the very least, Pig should be able to provide user the following
information
1) Job id of the launched job.
2) Complete web url of jobtracker running this job.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-09 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753260#action_12753260
]

Ashutosh Chauhan commented on PIG-948:
--

In this string, we are determining job-tracker address, port number and job-ids
through apis, so thats fine.
I agree that hardcoding other parts of url ( jobdetails.jsp?jobid= ) is not the
best way to do it, as it will break the link if that web-url changes in later
hadoop releases. But since there is no way to programatically get that url, I
went ahead with this. If there is a way to get that url programatically, let me
know. If not, I think its useful enough to have it like this and update it if
it gets changed in later hadoop releases.

[Usability] Relating pig script with MR jobs

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-951) Reset parallelism to 1 for indexing job in MergeJoin

2009-09-09 Thread Ashutosh Chauhan (JIRA)

Reset parallelism to 1 for indexing job in MergeJoin


 Key: PIG-951
 URL: https://issues.apache.org/jira/browse/PIG-951
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan


After sampling one tuple from every block, one reducer is used to sort the 
index entries in reduce phase to produce sorted index to be used in actual join 
job. Thus, parallelism of index job should be explictly set to 1. Currently, 
its not.

Currently, this is a non-issue, since we don't allow any blocking operators in 
pipeline before merge-join. However, later when we do allow blocking operators, 
then parallelism of indexing job will be that of preceding blocking operator. 
Even then, job will complete successfully because all tuple will go to only one 
reducer, because we are grouping on only one key all. However, it will waste 
cluster resources by starting all the extra reducers which get no data and thus 
do nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-951) Reset parallelism to 1 for indexing job in MergeJoin

2009-09-09 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-951:
-

Attachment: pig-951.patch

One line patch which fixes this. Also, added test case to catch regression on 
this.

 Reset parallelism to 1 for indexing job in MergeJoin
 

 Key: PIG-951
 URL: https://issues.apache.org/jira/browse/PIG-951
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: pig-951.patch


 After sampling one tuple from every block, one reducer is used to sort the 
 index entries in reduce phase to produce sorted index to be used in actual 
 join job. Thus, parallelism of index job should be explictly set to 1. 
 Currently, its not.
 Currently, this is a non-issue, since we don't allow any blocking operators 
 in pipeline before merge-join. However, later when we do allow blocking 
 operators, then parallelism of indexing job will be that of preceding 
 blocking operator. Even then, job will complete successfully because all 
 tuple will go to only one reducer, because we are grouping on only one key 
 all. However, it will waste cluster resources by starting all the extra 
 reducers which get no data and thus do nothing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-09 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-948:
-

Status: Patch Available  (was: Open)

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-09-12 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754491#action_12754491
 ] 

Ashutosh Chauhan commented on PIG-793:
--

In addition to String Vs Text, Alan also mentioned using array instead of 
ArrayListObject. Did any took a look at that? I think that change should also 
help. When I benchmarked merge join, nearly 20-30% CPU time was spent in 
arraylist's operations, which should benefit a lot if an array is used instead. 
So, changing to arrays should help both in memory and CPU runtime at the cost 
of expensive appends.

Also, some small benefits can be gained by very simple changes introduced in 
https://issues.apache.org/jira/browse/PIG-513

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-13 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754671#action_12754671
 ] 

Ashutosh Chauhan commented on PIG-953:
--

1. [Pradeep] zebra store function would basically needs to know the sort keys 
in order and which of them are asc/dsc. For this they would iterate over our 
data structure and require that the ordering of the keys match the 
primary/secondary order of the sort keys

[Ashutosh] What about LinkedHashMap? It provides all the properties we are 
seeking here, one data structure, O(1) lookup and guaranteed iteration order. 

2. In Utils.java
{code}
public static boolean checkNullAndClass(Object obj1, Object obj2) {
return checkNullEquals(obj1, obj2, false)  obj1.getClass() == 
obj2.getClass();
}
{code}

will result in NPE when both obj1 and obj2 are null. 

A minor detail:  Suppose obj1 is declared of type ArrayListInteger and obj2 
is declared of type ArrayListString, obj1.getClass() == obj2.getClass() will 
return true thanks to type erasure by java compiler at compile time. Not sure 
if thats OK or not for the check here. 

3. In StoreConfig.java One of the scenarios in which SortInfo is returned as 
null is
{code}
* 3) the store follows an order by but the schema
* of order by does not have column name(s) for the sort
* column(s)
{code}
I understand that reason for this additional constraint is because SortInfo 
maintains list of column names. But even if schema contains only type 
information and not the column names, that still is a sufficient information to 
build indexes. Information about on which column data is sorted on can be 
recorded using column positions isn't it? Does zebra requires columns to be 
named? If it doesn't then SortInfo could be changed in such a way that it can 
provide column position instead of names to loader, if columns arent named.

In POMergeJoin.java
4.
{code}
+currentFileName = lFile.getFileName();
+loader = 
(LoadFunc)PigContext.instantiateFuncFromSpec(lFile.getFuncSpec());
+is = FileLocalizer.open(currentFileName, offset, pc);
+if (currentFileName.endsWith(.bz) || 
currentFileName.endsWith(.bz2)) {
+is = new CBZip2InputStream((SeekableInputStream)is, 9);
+} else if (currentFileName.endsWith(.gz)) {
+is = new GZIPInputStream(is);
+}
+
{code}

Isnt this blocked on https://issues.apache.org/jira/browse/PIG-930 ?

5.
{code}
default: // We don't deal with ERR/NULL. just pass them down
return res;
{code}
 should be changed to   
{code}
default: 
throwProcessingException(false,null);
{code}

because if status is Error, execution should be stopped and exception should be 
thrown as early as possible instead of continue doing work which will be 
wasted. If status is Null NPE will occur while doing join.

6.
{code}
InputStream is = FileLocalizer.open(rightInputFileName, pc);
rightLoader.bindTo(rightInputFileName, new BufferedPositionedInputStream(is), 
0, Long.MAX_VALUE);
{code}
I dont see any use of this code. I think its not required and can be removed.

Infact, there is no need of following function too:
{code}
/**
 * @param rightInputFileName the rightInputFileName to set
 */
public void setRightInputFileName(String rightInputFileName) {
this.rightInputFileName = rightInputFileName;
}
{code} 
file name of right side is obtained from index which is contained in index 
file. Index file is directly passed as a constructor argument of 
indexableLoadFunc, so there is no need of passing rightinputfilename from 
MRCompiler to POMergeJoin.
And if this reasoning is correct then DefaultIndexableLoader.bindTo() should 
throw an IOException, because contract on DefaultIndexableLoader is that it is 
initialized with all the info it needs in constructor and then seekNear is 
called on it to seek to correct location. bindTo() shouldn't be used for this 
loader. 
Also, seekNear() doesn't sound right. How about seekToClosest() ?   

7. I think introducing order preserving flag on logical operator is a good 
idea. 
First its self documenting as the information is contained within operator and 
not checked by doing instanceof else where in code. 
Second its a useful information which if present can help make optimizer smart 
decisions. As an example, optimizer can rewrite a symmetric hash join to 
merge-sort join if all the logical operators in query DAG from join inputs to 
the root has these flags set to true. Without this flag, doing such 
optimizations will be hard.

 Enable merge join in pig to work with loaders and store functions which can 
 internally index sorted data 
 -

 Key: PIG-953
 URL: https://issues.apache.org/jira/browse/PIG-953
 Project: Pig

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-13 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754811#action_12754811
 ] 

Ashutosh Chauhan commented on PIG-953:
--

And couple more:
8.
bq. Findbugs complains about passing internal members as is in getters since 
the caller can then modifiy these internal members - hence the copy.

{code}
public ListBoolean getAscColumns() {
return Utils.getCopy(ascColumns);
}
{code}

Instead if we use following, we will achieve the same thing and then neither 
findbugs will complain, nor their is need for our own copy method.
{code}
public ListBoolean getAscColumns() {
return new ArrayListBoolean(ascColumns);
}
{code}

9. In POMergeJoin.java
{code}
// we should never get here!
return new Result(POStatus.STATUS_ERR, null);
{code}

could be changed to
{code}
// we should never get here!
throw new ExecException(errMsg,2176);
{code}
because if we ever get there, it will result in NPE later on otherwise.

 Enable merge join in pig to work with loaders and store functions which can 
 internally index sorted data 
 -

 Key: PIG-953
 URL: https://issues.apache.org/jira/browse/PIG-953
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-953.patch


 Currently merge join implementation in pig includes construction of an index 
 on sorted data and use of that index to seek into the right input to 
 efficiently perform the join operation. Some loaders (notably the zebra 
 loader) internally implement an index on sorted data and can perform this 
 seek efficiently using their index. So the use of the index needs to be 
 abstracted in such a way that when the loader supports indexing, pig uses it 
 (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-09-14 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-858:


Assignee: Ashutosh Chauhan

 Order By followed by replicated join fails while compiling MR-plan from 
 physical plan
 ---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: pig-858.patch


 Consider the query:
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0;
 explain C;
 {code}
 works. But if replicated join is used instead
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0 using replicated;
 explain C;
 {code}
 this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
 compiling operator POFRJoin
 relevant stacktrace:
 {code}
 Caused by: java.lang.RuntimeException: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
 at org.apache.pig.PigServer.explain(PigServer.java:574)
 ... 8 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
 ... 9 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
 ... 16 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-09-14 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-858:
-

Attachment: pig-858.patch

Patch as discussed in previous comment. Also included are test cases, where 
blocking operator (order-by, distinct) occurs before FRjoin.

 Order By followed by replicated join fails while compiling MR-plan from 
 physical plan
 ---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
 Attachments: pig-858.patch


 Consider the query:
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0;
 explain C;
 {code}
 works. But if replicated join is used instead
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0 using replicated;
 explain C;
 {code}
 this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
 compiling operator POFRJoin
 relevant stacktrace:
 {code}
 Caused by: java.lang.RuntimeException: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
 at org.apache.pig.PigServer.explain(PigServer.java:574)
 ... 8 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
 ... 9 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
 ... 16 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2009-09-14 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan reassigned PIG-959:


Assignee: Ashutosh Chauhan

 Merge Join fails when there is a blocking operator before it in query.
 --

 Key: PIG-959
 URL: https://issues.apache.org/jira/browse/PIG-959
 Project: Pig
  Issue Type: Bug
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan

 If there is an order-by, distinct or any other blocking operator in query 
 followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2009-09-14 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755270#action_12755270
 ] 

Ashutosh Chauhan commented on PIG-959:
--

This issue is blocked on PIG-858

 Merge Join fails when there is a blocking operator before it in query.
 --

 Key: PIG-959
 URL: https://issues.apache.org/jira/browse/PIG-959
 Project: Pig
  Issue Type: Bug
Reporter: Ashutosh Chauhan

 If there is an order-by, distinct or any other blocking operator in query 
 followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-959) Merge Join fails when there is a blocking operator before it in query.

2009-09-14 Thread Ashutosh Chauhan (JIRA)

Merge Join fails when there is a blocking operator before it in query.
--

 Key: PIG-959
 URL: https://issues.apache.org/jira/browse/PIG-959
 Project: Pig
  Issue Type: Bug
Reporter: Ashutosh Chauhan


If there is an order-by, distinct or any other blocking operator in query 
followed by Merge Join, pig fails to compile it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-865) Performance: Unnnecessary computation in FRJoin

2009-09-15 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755808#action_12755808
 ] 

Ashutosh Chauhan commented on PIG-865:
--

Ouch... It should have been atleast at par if not better !  Reading the code, I 
can see there are more opportunities to optimize here. Currently, I am trying 
to get an access on M45, once I get it I will run few benchmarks and report 
back if I see improvements.

 Performance: Unnnecessary computation in FRJoin
 ---

 Key: PIG-865
 URL: https://issues.apache.org/jira/browse/PIG-865
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Attachments: pig-865.patch, pig-865_v2.patch


 In POFRJoin implementation POLocalRearrange is used to extract join keys from 
 the input tuples. If keys match then to perform actual join input tuples are 
 fed to Foreach which does a cross on its inputs. After keys are extracted 
 using POLocalRearrange output; function getValueTuple(POLocalRearrange lr, 
 Tuple tuple) is called to reconstruct the input tuple. It seems that this 
 function call is unnecessary since we already have input tuple at that time. 
 This is not a bug, but since this function would get called for every tuple, 
 if it is eliminated, it should certainly help to improve performance. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-09-27 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760116#action_12760116
]

Ashutosh Chauhan commented on PIG-953:
--

Changes look good. Couple of points:

bq. I think this internal structure at this point does not need to be optimized
for lookup

Well, its less about optimization and more about maintainability. First the
relationship between two parallel arrays is implicit. So, if someone is reading
that code he needs to understand that relationship of his own. If there is
only one structure relationship would be explicit. Second, there is quite a bit
of code around it, which IMO will be simplified if a single data structure is
instead used. That said, either approach works just as fine so I will leave it
upto you.

bq. Zebra needs column names and cannot work with positions

That is then the limitation of Zebra which it should overcome someone point in
time. There might be a good reason for it, but I fail to see what extra
information names of column provides where type and position of columns should
be sufficient. This also implies an additional requirement on user. If data is
stored using ZebraStorage and if later is loaded back, then user has to provide
the same names for columns that he gave while storing it. No such constraint
exists for any other load-store like PigStorage.

Enable merge join in pig to work with loaders and store functions which can
internally index sorted data
-

Key: PIG-953
URL: https://issues.apache.org/jira/browse/PIG-953
Project: Pig
Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
Attachments: PIG-953-2.patch, PIG-953.patch

Currently merge join implementation in pig includes construction of an index
on sorted data and use of that index to seek into the right input to
efficiently perform the join operation. Some loaders (notably the zebra
loader) internally implement an index on sorted data and can perform this
seek efficiently using their index. So the use of the index needs to be
abstracted in such a way that when the loader supports indexing, pig uses it
(indirectly through the loader) and does not construct an index.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-27 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760121#action_12760121
]

Ashutosh Chauhan commented on PIG-948:
--

@Daniel

bq. Also I notice in many cases we cannot get first job id correctly (job id is
null in this case). If I change sleepTime (MapReduceLauncher.java:100) from 500
to 1000 (ms), things look fine. Does anyone else also see that?

Reason for that is JobControlCompiler compiles a set of inter-dependent MR jobs
and generates a job-control object which is then submitted asynchronously to
hadoop for execution. Since we dont block on those thread, its possible that
job-ids are not yet assigned when we ask for them. Setting sleep time to higher
value like 1000ms should be sufficient for most cases and should work. Note
increasing this sleep time doesn't affect execution in anyway since we are
sleeping in a thread which only does reporting. Another fool-proof though
complicated approach is to sleep for shorter time duration, then check if id is
assigned, if not sleep again in a while loop until ids are assigned.

[Usability] Relating pig script with MR jobs

Key: PIG-948
URL: https://issues.apache.org/jira/browse/PIG-948
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
Fix For: 0.6.0

Attachments: pig-948-2.patch, pig-948.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-09-27 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-858:
-

Fix Version/s: 0.6.0
Affects Version/s: (was: 0.3.0)
   0.4.0
   Status: Patch Available  (was: Open)

 Order By followed by replicated join fails while compiling MR-plan from 
 physical plan
 ---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.6.0

 Attachments: pig-858.patch


 Consider the query:
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0;
 explain C;
 {code}
 works. But if replicated join is used instead
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0 using replicated;
 explain C;
 {code}
 this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
 compiling operator POFRJoin
 relevant stacktrace:
 {code}
 Caused by: java.lang.RuntimeException: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
 at org.apache.pig.PigServer.explain(PigServer.java:574)
 ... 8 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
 ... 9 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
 ... 16 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-981) Merge join should restrict join key expressions to simple projects

2009-09-28 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760406#action_12760406
 ] 

Ashutosh Chauhan commented on PIG-981:
--

Default Merge Join implementation can handle order preserving join expressions, 
that is, when merge join itself builds the index and doesn't rely on underlying 
storage for index. When Merge Join doesn't build index itself, this can't be 
guaranteed, but then we don't have to limit all possible uses of merge-join 
because of this reason. Rather, we should check if Merge Join is building 
indexes of its own, if it is then allow order preserving expression, if it is 
not, only *then* restrict expressions to projections.

 Merge join should restrict join key expressions to simple projects
 --

 Key: PIG-981
 URL: https://issues.apache.org/jira/browse/PIG-981
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath

 Currently merge join allows join key expressions to be arbitrary expressions 
 with the assumption that the expressions keep the sort order. Since currently 
 only ascending sort order is supported, the code checks at run times for sort 
 order and catches the case where sort order is broken because the join key 
 expression is not order preserving. However there is a reason we should 
 restrict the join keys to projection of columns only:
  PIG-953 will enable pig to perform merge join  to work with loaders and 
 store functions which can internally index sorted data. These store functions 
 can only create an index (and hence lookup on the index) on raw data columns 
 (and not expressions on the columns).
 Hopefully this does not downgrade the usability of merge join much since if 
 the expressions can always be applied post join on the join columns and since 
 the expressions are order preserving they do not affect the outcome of the 
 join. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-09-30 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12760909#action_12760909
 ] 

Ashutosh Chauhan commented on PIG-948:
--

+1 for the patch.

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.6.0

 Attachments: pig-948-2.patch, pig-948-3.patch, pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-948) [Usability] Relating pig script with MR jobs

2009-10-07 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763169#action_12763169
 ] 

Ashutosh Chauhan commented on PIG-948:
--

+1 
Change looks good. It should be log.info instead of log.error. In local hadoop 
mode, since its all running in one java process there is no port address of job 
tracker to get. 

 [Usability] Relating pig script with MR jobs
 

 Key: PIG-948
 URL: https://issues.apache.org/jira/browse/PIG-948
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
Priority: Minor
 Fix For: 0.6.0

 Attachments: pig-948-2.patch, pig-948-3.patch, PIG-948-4.patch, 
 pig-948.patch


 Currently its hard to find a way to relate pig script with specific MR job. 
 In a loaded cluster with multiple simultaneous job submissions, its not easy 
 to figure out which specific MR jobs were launched for a given pig script. If 
 Pig can provide this info, it will be useful to debug and monitor the jobs 
 resulting from a pig script.
 At the very least, Pig should be able to provide user the following 
 information
 1) Job id of the launched job.
 2) Complete web url of jobtracker running this job. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-07 Thread Ashutosh Chauhan (JIRA)

[
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763218#action_12763218
]

Ashutosh Chauhan commented on PIG-953:
--

Changes look good. One comment I have:

1) In SortInfo.java#equals
We have two lists and we want to check for their equality. I quickly looked up
jdk sources and it seems that ArrayList doesn't override equals, so doing
equals check on lists would result in reference equality test which would be
incorrect. Correct way to do this would be to first check the sizes of two
lists, if they are equal iterate through both lists and check equality of items
at the same index in two list.

Few nits:
1) TestMergeJoin contains a System.err.println which we can get rid of.
2) There are few unused imports in patch.
3) SortInfo.java#getSortColInfoList may result in Findbugs warning because of
similar reason we discussed earlier in this jira.

Enable merge join in pig to work with loaders and store functions which can
internally index sorted data
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-953) Enable merge join in pig to work with loaders and store functions which can internally index sorted data

2009-10-07 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763256#action_12763256
 ] 

Ashutosh Chauhan commented on PIG-953:
--

..aah.. I should have had dug more in jdk sources. AbstractList , which 
ArrayList extends does override equals and provides correct behavior. So, my 
comment is a non-issue. With nits taken care of +1 for the patch.

 Enable merge join in pig to work with loaders and store functions which can 
 internally index sorted data 
 -

 Key: PIG-953
 URL: https://issues.apache.org/jira/browse/PIG-953
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-953-2.patch, PIG-953-3.patch, PIG-953.patch


 Currently merge join implementation in pig includes construction of an index 
 on sorted data and use of that index to seek into the right input to 
 efficiently perform the join operation. Some loaders (notably the zebra 
 loader) internally implement an index on sorted data and can perform this 
 seek efficiently using their index. So the use of the index needs to be 
 abstracted in such a way that when the loader supports indexing, pig uses it 
 (indirectly through the loader) and does not construct an index. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-10-14 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765720#action_12765720
 ] 

Ashutosh Chauhan commented on PIG-858:
--

visitUnion has same changes as others visit functions, that is it adds MR 
Operator corresponding to POUnion in phyToMROpMap map. Real changes are in 
visitFRJoin. Earlier in visitFRJoin, it used to look in compiledInputs array of 
MROper one by one trying to match MROPer leaf PO with POFRJoin using operator 
key. Now, it doesn't need to do that it can simply lookup in the phyToMROpMap.

 Order By followed by replicated join fails while compiling MR-plan from 
 physical plan
 ---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.6.0

 Attachments: pig-858.patch


 Consider the query:
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0;
 explain C;
 {code}
 works. But if replicated join is used instead
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0 using replicated;
 explain C;
 {code}
 this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
 compiling operator POFRJoin
 relevant stacktrace:
 {code}
 Caused by: java.lang.RuntimeException: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
 at org.apache.pig.PigServer.explain(PigServer.java:574)
 ... 8 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
 ... 9 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
 ... 16 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-858) Order By followed by replicated join fails while compiling MR-plan from physical plan

2009-10-14 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765735#action_12765735
 ] 

Ashutosh Chauhan commented on PIG-858:
--

Its been a while since I did that patch. So, bit more clarification: We are 
interested in finding PO which corresponds to fragment PO input of POFRJoin. 
This PO is already compiled and is in one the MROper. Earlier we  will iterate 
through compiledInputs array trying to match this PO  with PO contained in each 
MROperator. This fails as discussed in previous comments. With this change, 
since we keep track of MR operator with each physical operator it need not to 
do that but can simply look up for MROper corresponding to fragment PO in the 
phyToMROpMap.

 Order By followed by replicated join fails while compiling MR-plan from 
 physical plan
 ---

 Key: PIG-858
 URL: https://issues.apache.org/jira/browse/PIG-858
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.4.0
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Fix For: 0.6.0

 Attachments: pig-858.patch


 Consider the query:
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0;
 explain C;
 {code}
 works. But if replicated join is used instead
 {code}
 A = load 'a';
 B = order A by $0;
 C = join A by $0, B by $0 using replicated;
 explain C;
 {code}
 this fails with ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2034: Error 
 compiling operator POFRJoin
 relevant stacktrace:
 {code}
 Caused by: java.lang.RuntimeException: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:306)
 at org.apache.pig.PigServer.explain(PigServer.java:574)
 ... 8 more
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompilerException:
  ERROR 2034: Error compiling operator POFRJoin
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:942)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:173)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:342)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:327)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:233)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:301)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.explain(MapReduceLauncher.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.explain(HExecutionEngine.java:303)
 ... 9 more
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:901)
 ... 16 more
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-928) UDFs in scripting languages

2009-10-16 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766750#action_12766750
 ] 

Ashutosh Chauhan commented on PIG-928:
--

30x is indeed too slow. But, between BSF and direct bindings, I imagine direct 
bindings should have been more performant, since BSF adds an extra layer of 
translation. Isn't it ? 

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Attachments: package.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-928) UDFs in scripting languages

2009-10-16 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766763#action_12766763
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Though good learning from this test is BSF is not slower then direct bindings 
(need additional verifications though..) So, this feature could be implemented 
in lot less code and complexity using BSF as oppose to using different direct 
bindings for different languages.  On the other hand, only useful language BSF 
supports currently is Ruby. Not sure how many people using Pig will also be 
interested in groovy, javascript etc.( other languages supported by BSF ).

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Attachments: package.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1025) Should be able to set job priority through Pig Latin

2009-10-16 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766771#action_12766771
 ] 

Ashutosh Chauhan commented on PIG-1025:
---

Useful feature. Patch looks straightforward. In your test case you are only 
testing whether it parses it correctly or not, I will suggest to also test 
whether priority is actually set in the jobconf or not.

 Should be able to set job priority through Pig Latin
 

 Key: PIG-1025
 URL: https://issues.apache.org/jira/browse/PIG-1025
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.4.0
Reporter: Kevin Weil
Priority: Minor
 Fix For: 0.6.0

 Attachments: PIG-1025.patch


 Currently users can set the job name through Pig Latin by saying
 set job.name 'my job name'
 The ability to set the priority would also be nice, and the patch should be 
 small.  The goal is to be able to say
 set job.priority 'high'
 and throw a JobCreationException in the JobControlCompiler if the priority is 
 not one of the allowed string values from the o.a.h.mapred.JobPriority enum: 
 very_low, low, normal, high, very_high.   Case insensitivity makes this a 
 little nicer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-928) UDFs in scripting languages

2009-10-16 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766774#action_12766774
 ] 

Ashutosh Chauhan commented on PIG-928:
--

Right, I overlooked it. I think Ruby and Python are two most widely used 
scripting languages and both are supported by BSF. So, comparing BSF with 
direct bindings:
1) Performance : Initial test shows almost equal.
2) Support of multiple languages.
3) Ease of implementation 
To me, BSF seems to be the way to go for this, atleast the first-cut. 
Implementing this feature using BSF will allow us to expose this to users 
quickly and if many people are using it and finding one particular language to 
be slow then we can explore language bindings for that particular language. 
Thoughts?

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Attachments: package.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-928) UDFs in scripting languages

2009-10-17 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12766984#action_12766984
 ] 

Ashutosh Chauhan commented on PIG-928:
--

I did some quick benchmarking using BSF approach for UDFs written in Ruby, 
Python, Groovy and native builtin in Pig. It's a standard wordcount example 
where udf tokenizes an input string into number of words. I used pig 
sources(src/org/apache/pig) as input which has more then 210K lines. Since, I 
haven't yet figured out type translation so to be consistent in experiment, I 
passed data as String argument and return type as Object[] in all languages. 
Following are the numbers I got averaged over 3 runs:

||Language|Time(seconds)|Factor||
||Pig|17|1||
||Ruby|155|9.1||
||Python|178|10.4||
||Groovy|1460|85||

This shows Groovy-BSF combo is super-slow and Ruby and Python is much better. 
These numbers must be seen as an absolute worst case. I believe type 
translations, compiling script in constructor and using the compiled version 
instead of evaluating script in every exec() call will give much better 
performance. Also, there might exist other optimizations.

Sometime next week, I will try to repeat the same experiment with javax.script

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
 Attachments: package.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class

2009-10-21 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768353#action_12768353
 ] 

Ashutosh Chauhan commented on PIG-1012:
---

We just looked at POFRJoin, this might be happening at other places as well.

 FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in 
 serializable class
 ---

 Key: PIG-1012
 URL: https://issues.apache.org/jira/browse/PIG-1012
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Attachments: PIG-1012.patch


 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field is
 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field loader
 Sejava.util.zip.GZIPInputStream stored into non-transient field 
 PigSlice.is
 Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into 
 non-transient field PigSlice.is
 Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient 
 field PigSlice.is
 Seorg.apache.pig.builtin.PigStorage stored into non-transient field 
 PigSlice.loader
 Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements 
 Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator
  implements Comparator but not Serializable
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper 
 defines non-transient non-serializable instance field nig
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.NotEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
  defines non-transient non-serializable instance field bagIterator
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserComparisonFunc
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage
  defines non-transient non-serializable instance field log
 SeClass

[jira] Commented: (PIG-1012) FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in serializable class

2009-10-21 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768352#action_12768352
 ] 

Ashutosh Chauhan commented on PIG-1012:
---

Marking log in POFRJoin transient causes FRJoin to fail. Because at the 
backend it can't be deserialized and log.debug is used while building 
hashtables resulting in NPE. Either it shouldn't be marked transient or it 
should be instantiated in readObject() method.  

Stack Trace:

Pig Stack Trace
---
ERROR 2999: Unexpected internal error. null

java.lang.NullPointerException
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.setUpHashMap(POFRJoin.java:293)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.getNext(POFRJoin.java:197)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

Thanks to Tejal to pointing this out.

 FINDBUGS: SE_BAD_FIELD: Non-transient non-serializable instance field in 
 serializable class
 ---

 Key: PIG-1012
 URL: https://issues.apache.org/jira/browse/PIG-1012
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Attachments: PIG-1012.patch


 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field is
 SeClass org.apache.pig.backend.executionengine.PigSlice defines 
 non-transient non-serializable instance field loader
 Sejava.util.zip.GZIPInputStream stored into non-transient field 
 PigSlice.is
 Seorg.apache.pig.backend.datastorage.SeekableInputStream stored into 
 non-transient field PigSlice.is
 Seorg.apache.tools.bzip2r.CBZip2InputStream stored into non-transient 
 field PigSlice.is
 Seorg.apache.pig.builtin.PigStorage stored into non-transient field 
 PigSlice.loader
 Seorg.apache.pig.backend.hadoop.DoubleWritable$Comparator implements 
 Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigBagWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigCharArrayWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDBAWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigDoubleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigFloatWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigIntWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigLongWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigTupleWritableComparator
  implements Comparator but not Serializable
 Se
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler$PigWritableComparator
  implements Comparator but not Serializable
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper 
 defines non-transient non-serializable instance field nig
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.EqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GTOrEqualToExpr
  defines non-transient non-serializable instance field log
 SeClass 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.LessThanExpr
  defines non-transient non-serializable instance field log
 SeClass

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

2009-11-01 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772404#action_12772404
 ] 

Ashutosh Chauhan commented on PIG-1038:
---

I think its a useful optimization. I presume this will be implemented as a 
visitor in MapReduceLauncher which visits on compiled MR plan. Design looks 
good. I have few questions:

bq. 1.1 Discover if we use sort/distinct in nested foreach plan.
How are you planning to discover ? Depending on some pattern like LR in 
map-plan followed by POPackage, POForeach, POSort  in reduce-plan?

Kind of orthogonal but related to this issue. We have rule-based optimizer 
framework in front-end, it seems to me that similar optimizer framework is 
required in backend too to refactor all the optimizer visitors we currently 
have and to add  similar kind of optimizations easily in future. 
There are seven optimizations in front-end expressed through rules. On the 
other hand after addition of this one we will have nine optimization visitors 
in backend. May be we can think about it to avoid lot of rework every time such 
optimization is added.

 Optimize nested distinct/sort to use secondary key
 --

 Key: PIG-1038
 URL: https://issues.apache.org/jira/browse/PIG-1038
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.4.0
Reporter: Olga Natkovich
Assignee: Daniel Dai
 Fix For: 0.6.0


 If nested foreach plan contains sort/distinct, it is possible to use hadoop 
 secondary sort instead of SortedDataBag and DistinctDataBag to optimize the 
 query. 
 Eg1:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = order A by $1;
 generate group, D;
 }
 store C into 'myresult';
 We can specify a secondary sort on A.$1, and drop order A by $1.
 Eg2:
 A = load 'mydata';
 B = group A by $0;
 C = foreach B {
 D = A.$1;
 E = distinct D;
 generate group, E;
 }
 store C into 'myresult';
 We can specify a secondary sort key on A.$1, and simplify D=A.$1; E=distinct 
 D to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-01 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772410#action_12772410
 ] 

Ashutosh Chauhan commented on PIG-1037:
---

I am kinda late on this, but I would appreciate if someone can provide brief 
description of how this patch improves the memory layout and alleviates the 
spill problem. I took a quick look at the patch. 
According to my understanding, previously when memory is about to get exhausted 
Pig will start writing to the disk one tuple at a time. With this new patch, 
once the memory limit is hit whole bag is spilled to disk, at that point 
in-memory bag contains no tuples. If in-memory bag fills again, all of its 
content are spilled to disk in entirety again and so on.. So this patch ensures 
that we are not spilling one tuple at a time, but a full bag a time. Is this 
correct or am I missing something ?

 better memory layout and spill for sorted and distinct bags
 ---

 Key: PIG-1037
 URL: https://issues.apache.org/jira/browse/PIG-1037
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Ying He
 Fix For: 0.6.0

 Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 3 4 >

1 - 100 of 373 matches

Mail list logo