Q: Is anybody using Hawq in production? Separate answers depending on context.
• Is anybody using HAWQ in production? Yes, prior to incubating HAWQ in Apache HAWQ was sold by Pivotal and there are Pivotal customers that have the pre-apache code line in production today. • Is anybody using Apache HAWQ in production? HAWQ was incubated into Apache very recently and we have not yet had an official Apache release. The code in Apache is based on the next major release and is currently in a pre-beta state of stability. We have a release motion in process for a beta release that should be available soon, but there is no one that I am aware of currently in production with code based off the Apache HAWQ codeline. Q: What would be faster, placing stuff in HDFS or inserts directly into a distributed database. Generally speaking HDFS does add some overhead over a distributed database sitting on bare metal. However in both cases there is need for replicating data to ensure that the distributed system has built in mechanisms for fault tolerance and so the primary cost will be a comparison of the overhead related to replication mechanisms in HDFS compared to special purpose mechanisms in the DRDBMS. One of the clearest comparisons there would be comparing HAWQ with the Greenplum database (also recently open sourced) as they are both based on the same fundamental RDBMS architecture, but HAWQ has been adapted to the Hadoop ecosystem and Greenplum has been optimized for maximum bare metal speed. That said there are other advantages that you get from a Hadoop based system beyond pure speed. These include greater elasticity, better integration with other Hadoop components, and builtin cross system resource management through components such as YARN. If these benefits are not of interest and your only concern is speed then the Greenplum database may be a better choice. Q: Does HAWQ store data in plain text format? No. HAWQ supports multiple data formats for input, including its own format, Parquet format, and access of a variety of other data formats via external data access mechanisms including PXF. Our support for our builtin format and Parquet comes complete with MVCC transaction snapshot isolation mechanisms which is a significant advantage if you want to be able to ensure transactional data loading mechanisms. Q: Can we leave behind HDFS and design high speed BI systems without all the extra IQ points required to deal with writing and reading to HDFS? Fundamentally one of the key advantages of a system designed based on open components and for a broader ecosystem is that SQL, while an extremely import capability, is just part of the puzzle for many modern businesses. There are things that you can do with MapReduce/Pig/Spark/etc that are not well expressed in SQL and having a shared data store and data formats that allow multiple backend processing systems to share data, be managed by a single resource management system, is something that can provide additional flexibility and enhanced capabilities. Does that help? Regards, Caleb On Thu, Nov 19, 2015 at 10:41 AM, Adaryl Wakefield < [email protected]> wrote: > Is anybody using Hawq in production? Today I was thinking about speed and > what would be faster. Placing stuff on HDFS or inserts into a distributed > database. Coming from a structured data background, I haven't entirely > wrapped my head around storing data in plain text format. I know you can > use stuff like Avro and Parquet to enforce schema but it's still just > binary data on disk without all the grantees that you've come to expect > from relational databases over the past 20 years. In a perfect world, I'd > like to have all the awesomeness of HDFS but ease of use of relational > databases. My question is, are we there yet? Can we leave behind HDFS (or > just be abstracted away from it) and design high speed BI systems without > all the extra IQ points required to deal with writing and reading to HDFS? > > B. >
