Back up for a second.  Why would you want to do this and where does the data 
come from?

Is this a new PB of data every time? or is it PB total with some new and some 
old?
Only migrating the deltas could help. 

Can the data migration/load have it's latency hidden?  Is the PB of data ready 
all at once?
Is the first 100TB ready to be loaded long before the last 100TB is 
written/gathered/generated?

Is it possible to generate/gather the data into HDFS originally so there is no 
initial load time penalty?

1PB / 10 minutes = 26 Terabits / second throughput ( 3x that for naive data 
redundancy ).  That is a lot. 
Not crazy a lot, but a lot.  Todays large core switches/routers can do single 
multi Tb/sec, you'd need a 
fleet of them or use openflow.


Redundancy will require going across a node to node network of some sort be it 
SAN, Ethernet or whatever. 
By building a special purpose back end replication network/nodes you may be 
able to decrease the 
network costs.


If you really want to push this much data around that quickly the only type of 
network that makes sense
one that avoids over subscription.  Look into Fat tree networks as a start. 


Tens of thousands of nodes running at gigabit or thousands of nodes running at 
10gig, or hundreds of nodes running
infiniband (40gbit). 


The biggest question is can you avoid having to do this much data migration?  
Networks aren't getting faster
as fast as CPUs are.  A long term architecture based on growing datasets and 
data migration is looking for trouble.


-gulfie


On Wed, Sep 05, 2012 at 05:51:50PM +0530, prabhu K wrote:
> Hi Users,
> 
> Please clarify the below questions.
> 
> 1. With in 10 minutes one petabyte of data load into HDFS/HIVE , how many
> slave (Data Nodes) machines required.
> 
> 2. With in 10 minutes one petabyte of data load into HDFS/HIVE, what is the
> configuration setup for cloud computing.
> 
> Please suggest and help me on this.
> 
> Thanks&Regards,
> Prabhu.

Reply via email to