date:20081202

[jira] Updated: (PIG-554) Fragment Replicate Join

2008-12-02 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shravan Matthur Narayanamurthy updated PIG-554:
---

Attachment: frjofflat.patch

Fragment Replicate Join
---

Key: PIG-554
URL: https://issues.apache.org/jira/browse/PIG-554
Project: Pig
Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shravan Matthur Narayanamurthy
Assignee: Shravan Matthur Narayanamurthy
Fix For: types_branch

Attachments: frjofflat.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Will post the details in a wiki and add a link here
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2008-12-02 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652642#action_12652642
]

Alan Gates commented on PIG-460:

Here's a quick write up of what will need to be done to change order by from
being a 3 mr job process to 2. Currently sampling is done via
org.apache.pig.impl.builtin.RandomSampleLoader. Since this loader extends
BinStorage the first mr job reads the data in whatever format and then stores
it again using BinStorage. It is then read in the second job using
RandomSampleLoader. The tuples that are selected by RandomSampleLoader are
grouped into a single reducer and then fed to
org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing
partitioning information. The third mr job again reads the data and uses the
side file in the SortPartitioner. (It may be helpful to do an explain on a
simple order by query to see all this.)

What needs to change is that RandomSampleLoader should instead become an
EvalFunc, RandomSampler. The logic inside can remain the same. The MRCompiler
will need to change to create two mr jobs for the sort instead of 3. The first
job should contain a ForEach operator with the new RandomSampler function in
the map. It's reduce should look just like the reduce of the second mr job in
the current system (that is, singular and having a ForEach operator that calls
FindQuantiles). The second job should remain exactly the same as the third job
in the current system. Take a look at MRCompiler.visitSort() for an idea of
how sort jobs are constructed now. It's this function and the functions it
calls that you'll be changing in MRCompiler.

PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

Key: PIG-460
URL: https://issues.apache.org/jira/browse/PIG-460
Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
Fix For: types_branch

Currently order by is done in three MR jobs:
job 1: read data in whatever loader the user requests, store using BinStorage
job 2: load using RandomSampleLoader, find quantiles
job 3: load data again and sort
It is done this way because RandomSampleLoader extends BinStorage, and so
needs the data in that format to read it.
If the logic in RandomSampleLoader was made into an operator instead of being
in a loader then jobs 1 and 2 could be merged. On average job 1 takes about
15% of the time of an order by script.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [VOTE] Release Pig 0.1.1 (candidate 0)

2008-12-02 Thread Arun C Murthy


+1.

I downloaded the release, checked the signatures and checksums. All  
unit test pass.


Arun

On Nov 25, 2008, at 3:58 PM, Olga Natkovich wrote:


Hi,

I have created a candidate build for Pig 0.1.1. This release is  
almost identical to Pig 0.1.0 with a couple of exceptions:


(1) It is integrated with hadoop 18
(2) It has one small bug fix (PIG-253)
(3) Several UDF were added to piggybank - pig's UDF repository

The rat report is attached.

Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup 
.


Please download, test, and try it out:

http://people.apache.org/~olga/pig-0.1.1-candidate-0

Should we release this? Vote closes on Wednesday, December 3rd.

Olga

[jira] Updated: (PIG-554) Fragment Replicate Join

[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

Re: [VOTE] Release Pig 0.1.1 (candidate 0)

3 matches

Site Navigation

Mail list logo

Footer information