Sequence files are language neutral as Avro. Yes , but not sure about the support of other language lib for processing seq files.
Thanks, Rahul On Mon, Sep 30, 2013 at 11:10 PM, Peyman Mohajerian <[email protected]>wrote: > It is not recommended to keep the data at rest in sequences format, > because it is Java specific and you cannot share it with other none-java > systems easily, it is ideal for running map/reduce jobs. On approach would > be to bring all the data of different formats in HDFS as is and then > convert them to a single format that works best for you depending on > whether you will export this data out or not (in addition to many other > considerations). But as already mentioned Hive can directly read any of > these formats. > > > On Mon, Sep 30, 2013 at 1:08 AM, Raj K Singh <[email protected]>wrote: > >> for xml files processing hadoop comes with a class for this purpose >> called StreamXmlRecordReader,You can use it by setting your input format >> to StreamInputFormat and setting the >> stream.recordreader.class property to >> org.apache.hadoop.streaming.StreamXmlRecordReader. >> >> for Json files, an open-source project ElephantBird that contains some >> useful utilities for working with LZO compression, has a >> LzoJsonInputFormat, which can read JSON, but it requires that the input >> file be LZOP compressed. We’ll use this code as a template for our own JSON >> InputFormat, which doesn’t have the LZOP compression requirement. >> >> if you are dealing with small files then sequence file format comes in >> rescue, it stores sequences of binary key-value pairs. Sequence files >> are well suited as a format for MapReduce data since they are >> splittable,support compression. >> >> >> :::::::::::::::::::::::::::::::::::::::: >> Raj K Singh >> http://in.linkedin.com/in/rajkrrsingh >> http://www.rajkrrsingh.blogspot.com >> Mobile Tel: +91 (0)9899821370 >> >> >> On Mon, Sep 30, 2013 at 1:10 PM, Wolfgang Wyremba < >> [email protected]> wrote: >> >>> Hello, >>> >>> the file format topic is still confusing me and I would appreciate if you >>> could share your thoughts and experience with me. >>> >>> From reading different books/articles/websites I understand that >>> - Sequence files (used frequently but not only for binary data), >>> - AVRO, >>> - RC (was developed to work best with Hive -columnar storage) and >>> - ORC (a successor of RC to give Hive another performance boost - Stinger >>> initiative) >>> are all container file formats to solve the "small files problem" and all >>> support compression and splitting. >>> Additionally, each file format was developed with specific >>> features/benefits >>> in mind. >>> >>> Imagine I have the following text source data >>> - 1 TB of XML documents (some millions of small files) >>> - 1 TB of JSON documents (some hundred thousands of medium sized files) >>> - 1 TB of Apache log files (some thousands of bigger files) >>> >>> How should I store this data in HDFS to process it using Java MapReduce >>> and >>> Pig and Hive? >>> I want to use the best tool for my specific problem - with "best" >>> performance of course - i.e. maybe one problem on the apache log data >>> can be >>> best solved using Java MapReduce, another one using Hive or Pig. >>> >>> Should I simply put the data into HDFS as the data comes from - i.e. as >>> plain text files? >>> Or should I convert all my data to a container file format like sequence >>> files, AVRO, RC or ORC? >>> >>> Based on this example, I believe >>> - the XML documents will be need to be converted to a container file >>> format >>> to overcome the "small files problem". >>> - the JSON documents could/should not be affected by the "small files >>> problem" >>> - the Apache files should definitely not be affected by the "small files >>> problem", so they could be stored as plain text files. >>> >>> So, some source data needs to be converted to a container file format, >>> others not necessarily. >>> But what is really advisable? >>> >>> Is it advisable to store all data (XML, JSON, Apache logs) in one >>> specific >>> container file format in the cluster- let's say you decide to use >>> sequence >>> files? >>> Having only one file format in HDFS is of course a benefit in terms of >>> managing the files and writing Java MapReduce/Pig/Hive code against it. >>> Sequence files in this case is certainly not a bad idea, but Hive queries >>> could probably better benefit from let's say RC/ORC. >>> >>> Therefore, is it better to use a mix of plain text files and/or one or >>> more >>> container file formats simultaneously? >>> >>> I know that there will be no crystal-clear answer here as it always >>> "depends", but what approach should be taken here, or what is usually >>> used >>> in the community out there? >>> >>> I welcome any feedback and experiences you made. >>> >>> Thanks >>> >>> >> >
