Well I think the question would make more sense if he meant to say how one could load a GB file within 10 mins.
Note that 1x10^6 GB are in a PB. (Hence the comment about being off by several orders of magnitude. ) Now were the OP asking about how to load 1GB file in 10min, then you're within the realm of 10GBe, SATA drives and a couple of nodes. And then the question would make sense. But to your point. What's the incremental load if the data is a single 1PB file? Either you have the file, or you don't. ;-) As to hitting your limits, we all have limits. Mine is c. ;-) On Sep 10, 2012, at 2:22 PM, Siddharth Tiwari <[email protected]> wrote: > Well can't you load the incremental data only ? as the goal seems quite > unrealistic. The big guns have already spoken :P > > > *------------------------* > Cheers !!! > Siddharth Tiwari > Have a refreshing day !!! > "Every duty is holy, and devotion to duty is the highest form of worship of > God.” > "Maybe other people will try to limit me but I don't limit myself" > > > From: [email protected] > To: [email protected]; [email protected] > Subject: RE: One petabyte of data loading into HDFS with in 10 min. > Date: Mon, 10 Sep 2012 16:17:20 +0000 > > Well said Mike. Lots of “funny questions” around here lately… > > From: Michael Segel [mailto:[email protected]] > Sent: Monday, September 10, 2012 4:50 AM > To: [email protected] > Cc: Michael Segel > Subject: Re: One petabyte of data loading into HDFS with in 10 min. > > > On Sep 10, 2012, at 2:40 AM, prabhu K <[email protected]> wrote: > > > Hi Users, > > Thanks for the response. > > We have loaded 100GB data loaded into HDFS, time taken 1hr.with below > configuration. > > Each Node (1 machine master, 2 machines are slave) > > 1. 500 GB hard disk. > 2. 4Gb RAM > 3. 3 quad code CPUs. > 4. Speed 1333 MHz > > Now, we are planning to load 1 petabyte of data (single file) into Hadoop > HDFS and Hive table within 10-20 minutes. For this we need a clarification > below. > > Ok... > > Some say that I am sometimes too harsh in my criticisms so take what I say > with a grain of salt... > > You loaded 100GB in an hour using woefully underperforming hardware and are > now saying you want to load 1PB in 10 mins. > > I would strongly suggest that you first learn more about Hadoop. No really. > Looking at your first machine, its obvious that you don't really grok hadoop > and what it requires to achieve optimum performance. You couldn't even > extrapolate any meaningful data from your current environment. > > Secondly, I think you need to actually think about the problem. Did you mean > PB or TB? Because your math seems to be off by a couple orders of magnitude. > > A single file measured in PBs? That is currently impossible using today > (2012) technology. In fact a single file that is measured in PBs wouldn't > exist within the next 5 years and most likely the next decade. [Moore's law > is all about CPU power, not disk density.] > > Also take a look at networking. > ToR switch design differs, however current technology, the fabric tends to > max out at 40GBs. What's the widest fabric on a backplane? > That's your first bottleneck because even if you had a 1PB of data, you > couldn't feed it to the cluster fast enough. > > Forget disk. look at PCIe based memory. (Money no object, right? ) > You still couldn't populate it fast enough. > > I guess Steve hit this nail on the head when he talked about this being a > homework assignment. > > High school maybe? > > > > 1. what are the system configuration setup required for all the 3 machine’s ?. > > 2. Hard disk size. > > 3. RAM size. > > 4. Mother board > > 5. Network cable > > 6. How much Gbps Infiniband required. > > For the same setup we need cloud computing environment too? > > Please suggest and help me on this. > > Thanks, > > Prabhu. > > On Fri, Sep 7, 2012 at 7:30 PM, Michael Segel <[email protected]> > wrote: > Sorry, but you didn't account for the network saturation. > > And why 1GBe and not 10GBe? Also which version of hadoop? > > Here MapR works well with bonding two 10GBe ports and with the right switch, > you could do ok. > Also 2 ToR switches... per rack. etc... > > How many machines? 150? 300? more? > > Then you don't talk about how much memory, CPUs, what type of storage... > > Lots of factors. > > I'm sorry to interrupt this mental masturbation about how to load 1PB in > 10min. > There is a lot more questions that should be asked that weren't. > > Hey but look. Its a Friday, so I suggest some pizza, beer and then take it to > a white board. > > But what do I know? In a different thread, I'm talking about how to tame HR > and Accounting so they let me play with my team Ninja! > :-P > > On Sep 5, 2012, at 9:56 AM, zGreenfelder <[email protected]> wrote: > > > On Wed, Sep 5, 2012 at 10:43 AM, Cosmin Lehene <[email protected]> wrote: > >> Here's an extremely naïve ballpark estimation: at theoretical hardware > >> speed, for 3PB representing 1PB with 3x replication > >> > >> Over a single 1Gbps connection (and I'm not sure, you can actually reach > >> 1Gbps) > >> (3 petabytes) / (1 Gbps) = 291.271111 days > >> > >> So you'd need at least 40,000 1Gbps network cards to get that in 10 minutes > >> :) - (3PB/1Gbps)/40000 > >> > >> The actual number of nodes would depend a lot on the actual network > >> architecture, the type of storage you use (SSD, HDD), etc. > >> > >> Cosmin > > > > ah, I went te other direction with the math, and assumed no > > replication (completely unsafe and never reasonable for a real, > > production environment, but since we're all theory and just looking > > for starting point numbers) > > > > > > 1PB in 10 min == > > 1,000,000gB in 10 min == > > 8,000,000gb in 600 seconds == > > > > 80,000/6 ~= 14k machines running at gigabit or about 1.5k machines if you > > get 10Gb connected machines. > > > > all assuming there's no network or cluster sync overhead > > (of course there would be) > > > > > > that seems like some pretty deep pockets to get to < 10 minute load > > time for that much data. > > > > I could also be off, I just threw some stuff together somewhat > > quickly.between conf calls. > > > > -- > > Even the Magic 8 ball has an opinion on email clients: Outlook not so good. > > >
