[jira] Updated: (PIG-554) Fragment Replicate Join

2008-12-02 Thread Shravan Matthur Narayanamurthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-554:
---

Attachment: frjofflat.patch

 Fragment Replicate Join
 ---

 Key: PIG-554
 URL: https://issues.apache.org/jira/browse/PIG-554
 Project: Pig
  Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shravan Matthur Narayanamurthy
Assignee: Shravan Matthur Narayanamurthy
 Fix For: types_branch

 Attachments: frjofflat.patch


 Fragment Replicate Join(FRJ) is useful when we want a join between a huge 
 table and a very small table (fitting in memory small) and the join doesn't 
 expand the data by much. The idea is to distribute the processing of the huge 
 files by fragmenting it and replicating the small file to all machines 
 receiving a fragment of the huge file. Because of the availability of the 
 entire small file, the join becomes a trivial task without needing any break 
 in the pipeline. Exhaustive test have done to determine the improvement we 
 get out of FRJ. Will post the details in a wiki and add a link here
 The patch makes changes to parts of the code where new operators are 
 introduced. Currently, when a new operator is introduced, its alias is not 
 set. For schema computation I have modified this behaviour to set the alias 
 of the new operator to that of its predecessor. The logical side of the patch 
 mimics the cogroup behavior as join syntax closely resembles that of cogroup. 
 Currently, this patch doesn't have support for joins other than inner joins. 
 The rest of the code has been documented.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2008-12-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652642#action_12652642
 ] 

Alan Gates commented on PIG-460:


Here's a quick write up of what will need to be done to change order by from 
being a 3 mr job process to 2.  Currently sampling is done via 
org.apache.pig.impl.builtin.RandomSampleLoader.  Since this loader extends 
BinStorage the first mr job reads the data in whatever format and then stores 
it again using BinStorage.  It is then read in the second job using 
RandomSampleLoader.  The tuples that are selected by RandomSampleLoader are 
grouped into a single reducer and then fed to 
org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing 
partitioning information.  The third mr job again reads the data and uses the 
side file in the SortPartitioner.  (It may be helpful to do an explain on a 
simple order by query to see all this.)

What needs to change is that RandomSampleLoader should instead become an 
EvalFunc, RandomSampler.  The logic inside can remain the same.  The MRCompiler 
will need to change to create two mr jobs for the sort instead of 3.  The first 
job should contain a ForEach operator with the new RandomSampler function in 
the map.  It's reduce should look just like the reduce of the second mr job in 
the current system (that is, singular and having a ForEach operator that calls 
FindQuantiles).  The second job should remain exactly the same as the third job 
in the current system.  Take a look at MRCompiler.visitSort() for an idea of 
how sort jobs are constructed now.  It's this function and the functions it 
calls that you'll be changing in MRCompiler.

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: types_branch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Pig 0.1.1 (candidate 0)

2008-12-02 Thread Arun C Murthy

+1.

I downloaded the release, checked the signatures and checksums. All  
unit test pass.


Arun

On Nov 25, 2008, at 3:58 PM, Olga Natkovich wrote:


Hi,

I have created a candidate build for Pig 0.1.1. This release is  
almost identical to Pig 0.1.0 with a couple of exceptions:


(1) It is integrated with hadoop 18
(2) It has one small bug fix (PIG-253)
(3) Several UDF were added to piggybank - pig's UDF repository

The rat report is attached.

Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup 
.


Please download, test, and try it out:

http://people.apache.org/~olga/pig-0.1.1-candidate-0

Should we release this? Vote closes on Wednesday, December 3rd.

Olga