I’m willing to accept being abstracted from HDFS. If something wants to use HDFS as file storage ok as long as I don’t have to actually write code against HDFS. I don’t consider just straight replication as being distributed. You need to have a shared nothing architecture and no one node holds the entire database.
Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Rowland Gosling Sent: Thursday, November 19, 2015 9:07 PM To: [email protected] Subject: RE: Can distributed DBs replace HDFS? You do by way of a MPP database running on HDFS from what I’ve read. But distributed databases come in many flavors such as Postgres or Oracle. Depends on the interpretation of ‘distributed’ I suppose. From: Adaryl "Bob" Wakefield, MBA [mailto:[email protected]] Sent: Thursday, November 19, 2015 9:02 PM To: [email protected] Subject: Re: Can distributed DBs replace HDFS? “long term storage, reporting, analytics” Don’t you get these things with Hawq or am I misunderstanding something? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Rowland Gosling Sent: Thursday, November 19, 2015 8:16 PM To: [email protected] Subject: RE: Can distributed DBs replace HDFS? There isn’t an either/or proposition in modern data systems e.g. it’s not either you choose distributed databases or HDFS. If you have complex systems there’s a good chance you need both. For example, there’s no way I want my banking system running on anything less than a traditional RDBMS with all its guarantees. Conversely, I can’t see large financial institutions not leveraging HDFS in some capacity: long term storage, reporting, analytics. Both is the right answer in many cases. Rowland Gosling Senior Consultant Pragmatic Works, Inc. From: Adaryl "Bob" Wakefield, MBA [mailto:[email protected]] Sent: Thursday, November 19, 2015 8:00 PM To: [email protected] Subject: Re: Can distributed DBs replace HDFS? You had me right up to the last paragraph. I’m coming from the standpoint of engineering big data systems. There is a bewildering array of technologies and techniques that are currently being used. As a matter of fact, I think either Nathan Marz or Jay Kreps gave a speech literally titled “Reducing complexity in big data systems”. We lived with relational databases for years. We had to move away from them because the data got ridiculous and traditional databases couldn’t keep up. Now there are more types of databases than you can shake a stick at: column stores, graph dbs, document dbs and each one requires a different modeling technique which means each one has a learning curve for anybody new to NoSQL. If you’re going to design and implement a big data system, at a minimum you’re going to need to know Java, Git, some flavor of Linux, some build tool (ant, maven, gradle, etc), and all that is before we even start storing the data. If you’re coming from a non computer science background, the amount of stuff you need to put in your head can quite literally blow some people out of the career field (because who wants to learn new stuff at 40?). So I’ve been watching with excitement at the rise of “NewSQL” databases like MemSQL, and VoltDB because you can use familiar skills to build big data systems. Instead of having to write code to serialize an object to a file format, you can go back to just executing an insert statement. You can model data like you’re used to modeling. However, those tools are for a transactional use case. When I look at Hawq and Greenplumb I see what we have are relational databases that can handle big data and an analytics use case. >From an analytics stand point, most analytic tools I’m aware of try to mimic >SQL. There is HiveSQL, Drill, SparkSQL, SQL DF package in R. Some of these >tools aren’t fully SQL compliant and have quirks (more stuff to learn). >Hawq/Greenplumb gets us back to legit SQL. So when I talk about leaving HDFS for a distributed DB, what I’m talking about is simplifying the work necessary to store data by not having to know Java/MapReduce, or having to worry about file formats and compression schemes, or having to have another tool that lets analyst query the data as if it were a relational database. Let’s just put the data in an actual relational database. The key is to have a distributed database that is up to the challenge of modern data management. Are we there yet or is there more work to do? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Caleb Welton Sent: Thursday, November 19, 2015 1:11 PM To: [email protected] Subject: Re: Can distributed DBs replace HDFS? Q: Is anybody using Hawq in production? Separate answers depending on context. • Is anybody using HAWQ in production? Yes, prior to incubating HAWQ in Apache HAWQ was sold by Pivotal and there are Pivotal customers that have the pre-apache code line in production today. • Is anybody using Apache HAWQ in production? HAWQ was incubated into Apache very recently and we have not yet had an official Apache release. The code in Apache is based on the next major release and is currently in a pre-beta state of stability. We have a release motion in process for a beta release that should be available soon, but there is no one that I am aware of currently in production with code based off the Apache HAWQ codeline. Q: What would be faster, placing stuff in HDFS or inserts directly into a distributed database. Generally speaking HDFS does add some overhead over a distributed database sitting on bare metal. However in both cases there is need for replicating data to ensure that the distributed system has built in mechanisms for fault tolerance and so the primary cost will be a comparison of the overhead related to replication mechanisms in HDFS compared to special purpose mechanisms in the DRDBMS. One of the clearest comparisons there would be comparing HAWQ with the Greenplum database (also recently open sourced) as they are both based on the same fundamental RDBMS architecture, but HAWQ has been adapted to the Hadoop ecosystem and Greenplum has been optimized for maximum bare metal speed. That said there are other advantages that you get from a Hadoop based system beyond pure speed. These include greater elasticity, better integration with other Hadoop components, and builtin cross system resource management through components such as YARN. If these benefits are not of interest and your only concern is speed then the Greenplum database may be a better choice. Q: Does HAWQ store data in plain text format? No. HAWQ supports multiple data formats for input, including its own format, Parquet format, and access of a variety of other data formats via external data access mechanisms including PXF. Our support for our builtin format and Parquet comes complete with MVCC transaction snapshot isolation mechanisms which is a significant advantage if you want to be able to ensure transactional data loading mechanisms. Q: Can we leave behind HDFS and design high speed BI systems without all the extra IQ points required to deal with writing and reading to HDFS? Fundamentally one of the key advantages of a system designed based on open components and for a broader ecosystem is that SQL, while an extremely import capability, is just part of the puzzle for many modern businesses. There are things that you can do with MapReduce/Pig/Spark/etc that are not well expressed in SQL and having a shared data store and data formats that allow multiple backend processing systems to share data, be managed by a single resource management system, is something that can provide additional flexibility and enhanced capabilities. Does that help? Regards, Caleb On Thu, Nov 19, 2015 at 10:41 AM, Adaryl Wakefield <[email protected]> wrote: Is anybody using Hawq in production? Today I was thinking about speed and what would be faster. Placing stuff on HDFS or inserts into a distributed database. Coming from a structured data background, I haven't entirely wrapped my head around storing data in plain text format. I know you can use stuff like Avro and Parquet to enforce schema but it's still just binary data on disk without all the grantees that you've come to expect from relational databases over the past 20 years. In a perfect world, I'd like to have all the awesomeness of HDFS but ease of use of relational databases. My question is, are we there yet? Can we leave behind HDFS (or just be abstracted away from it) and design high speed BI systems without all the extra IQ points required to deal with writing and reading to HDFS? B.
