Hi, We use Avro to store/accumulate/badge streaming data and then we migrate it to Parquet.
We then use union queries to merge fresh and historical data (Avro + Parquet) Things to keep in mind (AFAIK): - Avro is a lot slower and more inefficient, storage space and performance wise, than Parquet - We migrate our Afro records to Parquet every 24 hours - The mr-parquet-avro library will not create drill compatible Parquet if you are using nested structures in Avro (in some cases) - Use Drill to convert your Avro files into Drill - Avro is missing date support and maintaining compatible schema between Avro and Parquet can be a bit tricky (depending on structure) - The Avro Drill plugin does not support Directory Pruning. - We use that to limit the files scanned when dealing with date-rage queries - We have been dealing with a lot of issues with Avro - We hope the remainder of them is fixed in the imminent 1.6 release of Drill - Parquet are not suited for frequent updated (Streaming inserts) - If you are getting strange query results then immediately assume it's the Avro plugin - This will hopefully save you some time otherwise spent on second guessing/verifying your data Hope this helps. Regards, -Stefán On Tue, Mar 8, 2016 at 11:58 AM, Conrad Crampton < [email protected]> wrote: > Hi (new here), > I have a plan to use Drill to provide a sql abstraction layer (as an > alternative to Hive). I like what I see so far, but I am a bit in the dark > on Avro support. Whilst support for Avro is mentioned (almost in passing) > in the documentation, there is very little details on its use in practice > as opposed to Parquet references. I am using Apache NiFi to move data > around and as final resting place Avro data on HDFS (as Nifi supports this > nicely out of the box). I therefore want to use Drill to query this, but > the tests I have done so far seem very slow when querying any substantial > amount of avro data directly with Drill. > > I am looking for some pointers on how best to do this – my idea was to > have my data in avro (well defined schema), partitioned into HDFS > directory/ sub directories but simple select * from `/location` limit 100 > takes forever (many minutes). Am I to assume that I need to create tables/ > views on top of the raw data for Drill to optimise its queries and if so, > it doesn’t need to re-run these as batch jobs to update them? > > Any pointers/ documentations/ blog links that would be welcome. > > Thanks > Conrad > > > SecureData, combating cyber threats > ______________________________________________________________________ > The information contained in this message or any of its attachments may be > privileged and confidential and intended for the exclusive use of the > intended recipient. If you are not the intended recipient any disclosure, > reproduction, distribution or other dissemination or use of this > communications is strictly prohibited. The views expressed in this email > are those of the individual and not necessarily of SecureData Europe Ltd. > Any prices quoted are only valid if followed up by a formal written quote. > > SecureData Europe Limited. Registered in England & Wales 04365896. > Registered Address: SecureData House, Hermitage Court, Hermitage Lane, > Maidstone, Kent, ME16 9NT >
