Pig and Hive both have support for compressed sequence files. Regarding best format - if its just text log data (i.e. no types/structure) then the best format to keep it in is in text+compress. SequenceFiles help make it splittable but add a small overhead in space and efficiency and none of the good codecs out there are splittable on their own for compression (LZO is good, but needs pre-indexing to be viewed splittable).
On Tue, Apr 9, 2013 at 10:21 PM, Mark <[email protected]> wrote: > Actually, compressed sequence files may not work with Pig or Hive then right? > > On Apr 9, 2013, at 9:50 AM, Mark <[email protected]> wrote: > >> Forgetting Impala, what format would be best to use with daily logs? >> >> Block-compressed sequence files? >> >> On Apr 8, 2013, at 8:12 PM, Harsh J <[email protected]> wrote: >> >>> Hey Mark, >>> >>> Gzip codec creates extension .gzip, not .deflate (which is >>> DeflateCodec). You may want to re-check your settings. >>> >>> Impala questions are best resolved at its current user and developer >>> community at >>> https://groups.google.com/a/cloudera.org/forum/#!forum/impala-user. >>> Impala does currently support LZO (and also Indexed LZO) compressed >>> text files however, so you may want to try that as its splittable >>> (compared to Gzip ones). >>> >>> On Tue, Apr 9, 2013 at 5:18 AM, Mark <[email protected]> wrote: >>>> Trying to determine what the best format to use for storing daily logs. We >>>> recently switch from snappy (.snappy) to gzip (.deflate) but I'm wondering >>>> if there is something better? Our main clients for these daily logs are >>>> pig and hive using an external table. We were thinking about testing out >>>> impala but we see that it doesn't work with compressed text files. Any >>>> suggestions? >>>> >>>> Thanks >>> >>> >>> >>> -- >>> Harsh J >> > -- Harsh J
