Nice article about Parquet *with* Avro : - https://dzone.com/articles/understanding-how-parquet - http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
Nice video from the good folks of Cloudera for the *differences* between "Avrow" and Parquet - https://www.youtube.com/watch?v=AY1dEfyFeHc 2016-03-04 7:12 GMT+01:00 Koert Kuipers <ko...@tresata.com>: > well can you use orc without bringing in the kitchen sink of dependencies > also known as hive? > > On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim <ilike...@gmail.com> wrote: > >> How about ORC? I have experimented briefly with Parquet and ORC, and I >> liked the fact that ORC has its schema within the file, which makes it >> handy to work with any other tools. >> >> Jong Wook >> >> On 3 March 2016 at 23:29, Don Drake <dondr...@gmail.com> wrote: >> >>> My tests show Parquet has better performance than Avro in just about >>> every test. It really shines when you are querying a subset of columns in >>> a wide table. >>> >>> -Don >>> >>> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann <tim.sp...@airisdata.com> >>> wrote: >>> >>>> Which format is the best format for SparkSQL adhoc queries and general >>>> data storage? >>>> >>>> There are lots of specialized cases, but generally accessing some but >>>> not all the available columns with a reasonable subset of the data. >>>> >>>> I am learning towards Parquet as it has great support in Spark. >>>> >>>> I also have to consider any file on HDFS may be accessed from other >>>> tools like Hive, Impala, HAWQ. >>>> >>>> Suggestions? >>>> — >>>> airis.DATA >>>> Timothy Spann, Senior Solutions Architect >>>> C: 609-250-5894 >>>> http://airisdata.com/ >>>> http://meetup.com/nj-datascience >>>> >>>> >>>> >>> >>> >>> -- >>> Donald Drake >>> Drake Consulting >>> http://www.drakeconsulting.com/ >>> https://twitter.com/dondrake <http://www.MailLaunder.com/> >>> 800-733-2143 >>> >> >> > -- Paul Leclercq | Data engineer paul.lecle...@tabmo.io | http://www.tabmo.fr/