Re: Can distributed DBs replace HDFS?

Adaryl "Bob" Wakefield, MBA Thu, 19 Nov 2015 19:14:41 -0800

I’m willing to accept being abstracted from HDFS. If something wants to use 
HDFS as file storage ok as long as I don’t have to actually write code against 
HDFS. I don’t consider just straight replication as being distributed. You need 
to have a shared nothing architecture and no one node holds the entire 
database.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Rowland Gosling 
Sent: Thursday, November 19, 2015 9:07 PM
To: [email protected] 
Subject: RE: Can distributed DBs replace HDFS?

You do by way of a MPP database running on HDFS from what I’ve read.  But 
distributed databases come in many flavors such as Postgres or Oracle. Depends 
on the interpretation of ‘distributed’ I suppose.

From: Adaryl "Bob" Wakefield, MBA [mailto:[email protected]] 
Sent: Thursday, November 19, 2015 9:02 PM
To: [email protected]
Subject: Re: Can distributed DBs replace HDFS?

“long term storage, reporting, analytics”

Don’t you get these things with Hawq or am I misunderstanding something?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Rowland Gosling 

Sent: Thursday, November 19, 2015 8:16 PM

To: [email protected] 

Subject: RE: Can distributed DBs replace HDFS?

There isn’t an either/or proposition in modern data systems e.g. it’s not 
either you choose distributed databases or HDFS. If you have complex systems 
there’s a good chance you need both. 

For example, there’s no way I want my banking system running on anything less 
than a traditional RDBMS with all its guarantees. Conversely, I can’t see large 
financial institutions not leveraging HDFS in some capacity: long term storage, 
reporting, analytics. 

Both is the right answer in many cases.

Rowland Gosling

Senior Consultant

Pragmatic Works, Inc.

From: Adaryl "Bob" Wakefield, MBA [mailto:[email protected]] 
Sent: Thursday, November 19, 2015 8:00 PM
To: [email protected]
Subject: Re: Can distributed DBs replace HDFS?

You had me right up to the last paragraph. I’m coming from the standpoint of 
engineering big data systems. There is a bewildering array of technologies and 
techniques that are currently being used. As a matter of fact, I think either 
Nathan Marz or Jay Kreps gave a speech literally titled “Reducing complexity in 
big data systems”. We lived with relational databases for years. We had to move 
away from them because the data got ridiculous and traditional databases 
couldn’t keep up. 

Now there are more types of databases than you can shake a stick at: column 
stores, graph dbs, document dbs and each one requires a different modeling 
technique which means each one has a learning curve for anybody new to NoSQL. 

If you’re going to design and implement a big data system, at a minimum you’re 
going to need to know Java, Git, some flavor of Linux, some build tool (ant, 
maven, gradle, etc), and all that is before we even start storing the data. If 
you’re coming from a non computer science background, the amount of stuff you 
need to put in your head can quite literally blow some people out of the career 
field (because who wants to learn new stuff at 40?). 

So I’ve been watching with excitement at the rise of “NewSQL” databases like 
MemSQL, and VoltDB because you can use familiar skills to build big data 
systems. Instead of having to write code to serialize an object to a file 
format, you can go back to just executing an insert statement. You can model 
data like you’re used to modeling. However, those tools are for a transactional 
use case. When I look at Hawq and Greenplumb I see what we have are relational 
databases that can handle big data and an analytics use case. 

>From an analytics stand point, most analytic tools I’m aware of try to mimic 
>SQL. There is HiveSQL, Drill, SparkSQL, SQL DF package in R. Some of these 
>tools aren’t fully SQL compliant and have quirks (more stuff to learn). 
>Hawq/Greenplumb gets us back to legit SQL.

So when I talk about leaving HDFS for a distributed DB, what I’m talking about 
is simplifying the work necessary to store data by not having to know 
Java/MapReduce, or having to worry about file formats and compression schemes, 
or having to have another tool that lets analyst query the data as if it were a 
relational database. Let’s just put the data in an actual relational database. 

The key is to have a distributed database that is up to the challenge of modern 
data management. Are we there yet or is there more work to do?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Caleb Welton 

Sent: Thursday, November 19, 2015 1:11 PM

To: [email protected] 

Subject: Re: Can distributed DBs replace HDFS?

Q: Is anybody using Hawq in production?

Separate answers depending on context.

• Is anybody using HAWQ in production?

Yes, prior to incubating HAWQ in Apache HAWQ was sold by Pivotal and there are 
Pivotal customers that have the pre-apache code line in production today.

• Is anybody using Apache HAWQ in production?  

HAWQ was incubated into Apache very recently and we have not yet had an 
official Apache release.  The code in Apache is based on the next major release 
and is currently in a pre-beta state of stability.  We have a release motion in 
process for a beta release that should be available soon, but there is no one 
that I am aware of currently in production with code based off the Apache HAWQ 
codeline.

Q: What would be faster, placing stuff in HDFS or inserts directly into a 
distributed database.

Generally speaking HDFS does add some overhead over a distributed database 
sitting on bare metal.  However in both cases there is need for replicating 
data to ensure that the distributed system has built in mechanisms for fault 
tolerance and so the primary cost will be a comparison of the overhead related 
to replication mechanisms in HDFS compared to special purpose mechanisms in the 
DRDBMS.  One of the clearest comparisons there would be comparing HAWQ with the 
Greenplum database (also recently open sourced) as they are both based on the 
same fundamental RDBMS architecture, but HAWQ has been adapted to the Hadoop 
ecosystem and Greenplum has been optimized for maximum bare metal speed.

That said there are other advantages that you get from a Hadoop based system 
beyond pure speed.  These include greater elasticity, better integration with 
other Hadoop components, and builtin cross system resource management through 
components such as YARN.  If these benefits are not of interest and your only 
concern is speed then the Greenplum database may be a better choice.

Q: Does HAWQ store data in plain text format?

No.  HAWQ supports multiple data formats for input, including its own format, 
Parquet format, and access of a variety of other data formats via external data 
access mechanisms including PXF.  Our support for our builtin format and 
Parquet comes complete with MVCC transaction snapshot isolation mechanisms 
which is a significant advantage if you want to be able to ensure transactional 
data loading mechanisms.

Q: Can we leave behind HDFS and design high speed BI systems without all the 
extra IQ points required to deal with writing and reading to HDFS?

Fundamentally one of the key advantages of a system designed based on open 
components and for a broader ecosystem is that SQL, while an extremely import 
capability, is just part of the puzzle for many modern businesses.  There are 
things that you can do with MapReduce/Pig/Spark/etc that are not well expressed 
in SQL and having a shared data store and data formats that allow multiple 
backend processing systems to share data, be managed by a single resource 
management system, is something that can provide additional flexibility and 
enhanced capabilities.

Does that help?

Regards,

  Caleb

On Thu, Nov 19, 2015 at 10:41 AM, Adaryl Wakefield 
<[email protected]> wrote:

  Is anybody using Hawq in production? Today I was thinking about speed and 
what would be faster. Placing stuff on HDFS or inserts into a distributed 
database. Coming from a structured data background, I haven't entirely wrapped 
my head around storing data in plain text format. I know you can use stuff like 
Avro and Parquet to enforce schema but it's still just binary data on disk 
without all the grantees that you've come to expect from relational databases 
over the past 20 years. In a perfect world, I'd like to have all the 
awesomeness of HDFS but ease of use of relational databases. My question is, 
are we there yet? Can we leave behind HDFS (or just be abstracted away from it) 
and design high speed BI systems without all the extra IQ points required to 
deal with writing and reading to HDFS?

  B.

Re: Can distributed DBs replace HDFS?

Reply via email to