Thanks Dudu and Gopal. I tried HAR files and it works.
I want to use Sequence file because I want to expose data using a table (filename and content columns). *Can this be done for HAR files?* This is what I am doing to create a sequencefile: create external table raw_files (raw_data string) location '/user/myid/myfiles'; create table fies_seq (key string, value string) stored as sequencefile; insert overwrite table files_seq select REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) as file_name, CONCAT_WS(' ', COLLECT_LIST(raw_data)) as raw_data from raw_files group by INPUT__FILE__NAME; It works well. But, I am seeing 1MB files in fies_seq directory. I am using below parameters. * Is there a way to increase the file/block size?* SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; SET mapred.output.compression.type=BLOCK; On Fri, Sep 23, 2016 at 7:16 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > Is there a way to create an external table on a directory, extract 'key' > as file name and 'value' as file content and write to a sequence file table? > > Do you care that it is a sequence file? > > The HDFS HAR format was invented for this particular problem, check if the > "hadoop archive" command works for you and offers a filesystem abstraction. > > Otherwise, there's always the old Mahout "seqdirectory" job, which is > great if you have like .jpg files and want to pack them for HDFS to handle > better (like GPS tiles). > > Cheers, > Gopal > > >