Hi All, i am new to ORC (Hadoop as well). currently trying to use ORC format to save data on HDFS using apache storm topology. i am getting JSON message from Kafka, i am converting the json message to Java object and adding few more properties, convert to comma delimited values and after that i will save the data to HDFS. after going through few online material i am still unable to figure out how to implement it and how the data will actaully look like in ORC format.
the data i need to store on HDFS gets as String format (can change to any java object if needed), i want to save it along with the header info which descrbes what each attribute in data is, the header needs to save only once in that file. now the question is how can i create the header file (similar to header we create in txt or csv file). as i go through tutorial there is no header and schema definiation need to be defined. if that is the case how can i save schema defination in the file. does the stripeSize should be same as block size in hdfs file system. what happens if both are not equal. i have to write into different orc files based on timestamp i get from json message. i will not know which file i will be writing until i parse the json message. the time stamp can be anything. does closing and opening the ORC writer for writing each tuple(json message) in ORC file takes lot of resources? if i dont know which file i need to write until i see the timestamp in json message, what other affective way to write into file. i feel my understanding on ORC formating is wrong, but i am unable to move fwd nor do i am getting clear picture of how ORC format looks like and how to save. we are not using any HIVE. can you guys please give me some input on how to move fwd.
