Well I think the question would make more sense if he meant to say how one 
could load a GB file within 10 mins. 

Note that 1x10^6 GB are in a PB.  (Hence the comment about being off by several 
orders of magnitude. )

Now were the OP asking about how to load 1GB file in 10min, 

then you're within the realm of 10GBe, SATA drives and a couple of nodes. 
And then the question would make sense.

But to your point. What's the incremental load if the data is a single 1PB 
file? 
Either you have the file, or you don't. ;-) 

As to hitting your limits, we all have limits.  Mine is c. ;-) 

On Sep 10, 2012, at 2:22 PM, Siddharth Tiwari <[email protected]> wrote:

> Well can't you load the incremental data only ? as the goal seems quite 
> unrealistic. The big guns have already spoken :P
> 
> 
> *------------------------*
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of 
> God.” 
> "Maybe other people will try to limit me but I don't limit myself"
> 
> 
> From: [email protected]
> To: [email protected]; [email protected]
> Subject: RE: One petabyte of data loading into HDFS with in 10 min.
> Date: Mon, 10 Sep 2012 16:17:20 +0000
> 
> Well said Mike. Lots of “funny questions” around here lately…
>  
> From: Michael Segel [mailto:[email protected]] 
> Sent: Monday, September 10, 2012 4:50 AM
> To: [email protected]
> Cc: Michael Segel
> Subject: Re: One petabyte of data loading into HDFS with in 10 min.
>  
>  
> On Sep 10, 2012, at 2:40 AM, prabhu K <[email protected]> wrote:
> 
> 
> Hi Users,
>  
> Thanks for the response.
>  
> We have loaded 100GB data loaded into HDFS, time taken 1hr.with below 
> configuration.
> 
> Each Node (1 machine master, 2 machines  are slave)
> 
> 1.    500 GB hard disk.
> 2.    4Gb RAM
> 3.    3 quad code CPUs.
> 4.    Speed 1333 MHz
>  
> Now, we are planning to load 1 petabyte of data (single file)  into Hadoop 
> HDFS and Hive table within 10-20 minutes. For this we need a clarification 
> below.
> 
> Ok...
>  
> Some say that I am sometimes too harsh in my criticisms so take what I say 
> with a grain of salt...
>  
> You loaded 100GB in an hour using woefully underperforming hardware and are 
> now saying you want to load 1PB in 10 mins.
>  
> I would strongly suggest that you first learn more about Hadoop.  No really. 
> Looking at your first machine, its obvious that you don't really grok hadoop 
> and what it requires to achieve optimum performance.  You couldn't even 
> extrapolate any meaningful data from your current environment.
>  
> Secondly, I think you need to actually think about the problem. Did you mean 
> PB or TB? Because your math seems to be off by a couple orders of magnitude. 
>  
> A single file measured in PBs? That is currently impossible using today 
> (2012) technology. In fact a single file that is measured in PBs wouldn't 
> exist within the next 5 years and most likely the next decade. [Moore's law 
> is all about CPU power, not disk density.]
>  
> Also take a look at networking. 
> ToR switch design differs, however current technology, the fabric tends to 
> max out at 40GBs.  What's the widest fabric on a backplane? 
> That's your first bottleneck because even if you had a 1PB of data, you 
> couldn't feed it to the cluster fast enough. 
>  
> Forget disk. look at PCIe based memory. (Money no object, right? ) 
> You still couldn't populate it fast enough.
>  
> I guess Steve hit this nail on the head when he talked about this being a 
> homework assignment. 
>  
> High school maybe? 
>  
> 
> 
> 1. what are the system configuration setup required for all the 3 machine’s ?.
> 
> 2. Hard disk size.
> 
> 3. RAM size.
> 
> 4. Mother board
> 
> 5. Network cable
> 
> 6. How much Gbps  Infiniband required.
> 
>  For the same setup we need cloud computing environment too?
> 
> Please suggest and help me on this.
> 
>  Thanks,
> 
> Prabhu.
> 
> On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[email protected]> 
> wrote:
> Sorry, but you didn't account for the network saturation.
> 
> And why 1GBe and not 10GBe? Also which version of hadoop?
> 
> Here MapR works well with bonding two 10GBe ports and with the right switch, 
> you could do ok.
> Also 2 ToR switches... per rack. etc...
> 
> How many machines? 150? 300? more?
> 
> Then you don't talk about how much memory, CPUs, what type of storage...
> 
> Lots of factors.
> 
> I'm sorry to interrupt this mental masturbation about how to load 1PB in 
> 10min.
> There is a lot more questions that should be asked that weren't.
> 
> Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to 
> a white board.
> 
> But what do I know? In a different thread, I'm talking about how to tame HR 
> and Accounting so they let me play with my team Ninja!
> :-P
> 
> On Sep 5, 2012, at 9:56 AM, zGreenfelder <[email protected]> wrote:
> 
> > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[email protected]> wrote:
> >> Here's an extremely naïve ballpark estimation: at theoretical hardware
> >> speed, for 3PB representing 1PB with 3x replication
> >>
> >> Over a single 1Gbps connection (and I'm not sure, you can actually reach
> >> 1Gbps)
> >> (3 petabytes) / (1 Gbps) = 291.271111 days
> >>
> >> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes
> >> :) - (3PB/1Gbps)/40000
> >>
> >> The actual number of nodes would depend a lot on the actual network
> >> architecture, the type of storage you use (SSD,  HDD), etc.
> >>
> >> Cosmin
> >
> > ah, I went te other direction with the math, and assumed no
> > replication (completely unsafe and never reasonable for a real,
> > production environment, but since we're all theory and just looking
> > for starting point numbers)
> >
> >
> > 1PB in 10 min ==
> > 1,000,000gB in 10 min ==
> > 8,000,000gb in 600 seconds ==
> >
> > 80,000/6  ~= 14k machines running at gigabit or about 1.5k machines if you
> > get 10Gb connected machines.
> >
> > all assuming there's no network or cluster sync overhead
> > (of course there would be)
> >
> >
> > that seems like some pretty deep pockets to get to < 10 minute load
> > time for that much data.
> >
> > I could also be off, I just threw some stuff together somewhat
> > quickly.between conf calls.
> >
> > --
> > Even the Magic 8 ball has an opinion on email clients: Outlook not so good.
> >
> 

Reply via email to