You can control map size by setting "pig.maxCombinedSplitSize",
"mapred.max.split.size", "mapred.min.split.size". The first one is pig
parameter and last two are hadoop parameters.
Daniel
On 03/24/2011 06:18 PM, Dexin Wang wrote:
Thanks for your explanation Alex.
In some cases, there isn't even a reduce phase. For example, we have some
raw data, after our custom LOAD function and some filter function, it
directly goes into DB. And since we don't have control on number of mappers,
we end up with too many DB writers. That's why I had to add that artificial
reduce phase I mentioned earlier so that we can throttle it down.
We could also do what someone else suggested - add a post process step that
writes output to HDFS and load DB from that. But there are other
considerations that we'd like not to do that if we don't have to.
On Thu, Mar 17, 2011 at 2:16 PM, Alex Rovner<[email protected]> wrote:
Dexin,
You can control the amount of reducers by adding the following in your pig
script:
SET default_parallel 29;
Pig will run with 29 reducers with the above statement.
As far as the bulk insert goes:
We are using MS-SQL as our database, but MySQL would be able to handle the
bulk insert the same way.
Essentially we are directing the output of the job into a temporary folder
in order to know the output of this particular run. If you set the amount of
reducers to 29, you will have 29 files in the temp folder after the job
completes. You can then run a bulk insert SQL command on each of the
resulting files with pointing to HDFS either through FUSE(The way we do it)
or you can copy the resulting files to a samba share or NFS and point the
SQL server to that location.
In order to bulk insert you would have to either A. Do this in a post
processing script or write your own storage func that takes care of this.
Storage func is tricky since you will need to implement your own
outputcommiter (See https://issues.apache.org/jira/browse/PIG-1891)
Let me know if you have further questions.
Alex
On Thu, Mar 17, 2011 at 5:00 PM, Dexin Wang<[email protected]> wrote:
Can you describe a bit more about your bulk insert technique? And the way
you control the number of reducers is also by adding artificial ORDER or
GROUP step?
Thanks!
On Thu, Mar 17, 2011 at 1:33 PM, Alex Rovner<[email protected]>wrote:
We use bulk insert technique after the job completes. You can control the
amount of each bulk insert by controlling the amount of reducers.
Sent from my iPhone
On Mar 17, 2011, at 2:03 PM, Dexin Wang<[email protected]> wrote:
We do some processing in hadoop then as the last step, we write the
result
to database. Database is not good at handling hundreds of concurrent
connections and fast writes. So we need to throttle down the number of
tasks
that writes to DB. Since we have no control on the number of mappers,
we add
an artificial reducer step to achieve that, either by doing GROUP or
ORDER,
like this:
sorted_data = ORDER data BY f1 PARALLEL 10;
-- then write sorted_data to DB
or
grouped_data = GROUP data BY f1 PARALLEL 10;
data_to_write = FOREACH grouped_data GENERATE $1;
I feel neither is good approach. They just add unnecessary computing
time,
especially the first one. And GROUP may result in too large of bags
issue.
Any better suggestions?