>  - Do you plan to serve data out of HBase or will you just use it for
> MapReduce? Or will it be a mix (not recommended)?

I am also curious what would be the recommended deployment when you
have this need (e.g. building multiple Lucene indexes which hold only
the Row ID, so building is MR intensive, and index use spawns many
getByKey()).
Perhaps you'd recommend HDFS across say 10 machines and region servers
only on 5 and MR on the other 5.
Or can you achieve better data locality with rack awareness and MR
processes running on 1 rack and HBase on the other?  Presumably this
has issues on HDFS node failures though.

Ta,
Tim



On Tue, Jun 8, 2010 at 8:13 AM, Sean Bigdatafun
<[email protected]> wrote:
> On Mon, Jun 7, 2010 at 10:46 AM, Jean-Daniel Cryans 
> <[email protected]>wrote:
>
>> It really depends on your usage pattern, but there's a balance wrt
>> cost VS hardware you must achieve. At StumbleUpon we run with 2xi7,
>> 24GB, 4x 1TB and it works like a charm. The only thing I would change
>> is maybe more disks/node but that's pretty much it. Some relevant
>> questions:
>>
>
> I understand that the bottleneck of mapreduction is normally disk bandwidth
> (if we have enough mapper to do their work) -- is this what you mean here?
> I would guess 4x 1TB may not be as good as 8x500GB. I mean, normally disk
> bandwidth is precious but not disk capacity.
>
>
>>
>>  - Do you have any mem-intensive jobs? If so, figure how many tasks
>> you'll run per node and make the RAM fit the load.
>>
>
> By mem-intensive jobs, I guess you mean "random reads", "range scans" and
> "inserts", but not mapreduction work, right?
>
>
>>  - Do you plan to serve data out of HBase or will you just use it for
>> MapReduce? Or will it be a mix (not recommended)?
>>
>
> Actually, I am going to use it in a mix mode --Google Analytics seems to use
> in this mode as well, where it runs mapreduction to calculate statistics
> along with live query. Do you have any suggestion about something to pay
> attention?
>
>
>>
>> Also, keep in mind that losing 1 machine over 8 compared to 1 over 16
>> drastically changes the performance of your system at the time of the
>> failure.
>>
> Agreed.
>
>
>>
>> About virtualization, it doesn't make sense. Also your disks should be in
>> JBOD.
>>
>
>
>
>
>>
>> J-D
>>
>> On Wed, Jun 2, 2010 at 11:12 PM, Sean Bigdatafun
>> <[email protected]> wrote:
>> > I am thinking of the following problem lately. I started thinking of this
>> > problem in the following context.
>> >
>> > I have a predefined budget and I can either
>> >  -- A) purchase 8 more powerful servers (4cpu x 4 cores/cpu +  128GB mem
>> +
>> > 16 x 1TB disk) or
>> >  -- B) purchase 16 less powerful servers(2cpu x 4 cores/cpu +  64GB mem +
>> 8
>> > x 1TB disk)
>> >          NOTE: I am basically making up a half housepower scenario
>> >  -- Let's say I am going to use 10Gbps network switch and each machine
>> has
>> > a 10Gbps network card
>> >
>> > In the above scenario, does A or B perform better or relatively same? --
>> I
>> > guess this really depends on Hadoop's map/reduce's scheduler.
>> >
>> > And then I have a following question: does it make sense to virtualize a
>> > Hadoop datanode at all?  (if the answer to above question is "relatively
>> > same", I'd say it does not make sense)
>> >
>> > Thanks,
>> > Sean
>> >
>>
>

Reply via email to