Re: Can distributed DBs replace HDFS?

Caleb Welton Thu, 19 Nov 2015 11:11:54 -0800

Q: Is anybody using Hawq in production?

Separate answers depending on context.

• Is anybody using HAWQ in production?

Yes, prior to incubating HAWQ in Apache HAWQ was sold by Pivotal and there
are Pivotal customers that have the pre-apache code line in production
today.

• Is anybody using Apache HAWQ in production?

HAWQ was incubated into Apache very recently and we have not yet had an
official Apache release.  The code in Apache is based on the next major
release and is currently in a pre-beta state of stability.  We have a
release motion in process for a beta release that should be available soon,
but there is no one that I am aware of currently in production with code
based off the Apache HAWQ codeline.

Q: What would be faster, placing stuff in HDFS or inserts directly into a
distributed database.

Generally speaking HDFS does add some overhead over a distributed database
sitting on bare metal.  However in both cases there is need for replicating
data to ensure that the distributed system has built in mechanisms for
fault tolerance and so the primary cost will be a comparison of the
overhead related to replication mechanisms in HDFS compared to special
purpose mechanisms in the DRDBMS.  One of the clearest comparisons there
would be comparing HAWQ with the Greenplum database (also recently open
sourced) as they are both based on the same fundamental RDBMS architecture,
but HAWQ has been adapted to the Hadoop ecosystem and Greenplum has been
optimized for maximum bare metal speed.

That said there are other advantages that you get from a Hadoop based
system beyond pure speed.  These include greater elasticity, better
integration with other Hadoop components, and builtin cross system resource
management through components such as YARN.  If these benefits are not of
interest and your only concern is speed then the Greenplum database may be
a better choice.

Q: Does HAWQ store data in plain text format?

No.  HAWQ supports multiple data formats for input, including its own
format, Parquet format, and access of a variety of other data formats via
external data access mechanisms including PXF.  Our support for our builtin
format and Parquet comes complete with MVCC transaction snapshot isolation
mechanisms which is a significant advantage if you want to be able to
ensure transactional data loading mechanisms.

Q: Can we leave behind HDFS and design high speed BI systems without all
the extra IQ points required to deal with writing and reading to HDFS?

Fundamentally one of the key advantages of a system designed based on open
components and for a broader ecosystem is that SQL, while an extremely
import capability, is just part of the puzzle for many modern businesses.
There are things that you can do with MapReduce/Pig/Spark/etc that are not
well expressed in SQL and having a shared data store and data formats that
allow multiple backend processing systems to share data, be managed by a
single resource management system, is something that can provide additional
flexibility and enhanced capabilities.

Does that help?

Regards,
  Caleb

On Thu, Nov 19, 2015 at 10:41 AM, Adaryl Wakefield <
[email protected]> wrote:

> Is anybody using Hawq in production? Today I was thinking about speed and
> what would be faster. Placing stuff on HDFS or inserts into a distributed
> database. Coming from a structured data background, I haven't entirely
> wrapped my head around storing data in plain text format. I know you can
> use stuff like Avro and Parquet to enforce schema but it's still just
> binary data on disk without all the grantees that you've come to expect
> from relational databases over the past 20 years. In a perfect world, I'd
> like to have all the awesomeness of HDFS but ease of use of relational
> databases. My question is, are we there yet? Can we leave behind HDFS (or
> just be abstracted away from it) and design high speed BI systems without
> all the extra IQ points required to deal with writing and reading to HDFS?
>
> B.
>

Re: Can distributed DBs replace HDFS?

Reply via email to