Thanks Dudu and Gopal.

I tried HAR files and it works.

I want to use Sequence file because I want to expose data using a table
(filename and content columns).  *Can this be done for HAR files?*

This is what I am doing to create a sequencefile:

create external table raw_files (raw_data string) location
'/user/myid/myfiles';
create table fies_seq (key string, value string) stored as sequencefile;
insert overwrite table files_seq
         select REGEXP_EXTRACT(INPUT__FILE__NAME, '.*/(.*)/(.*)', 2) as
file_name, CONCAT_WS(' ', COLLECT_LIST(raw_data)) as              raw_data
from raw_files group by INPUT__FILE__NAME;

It works well.  But, I am seeing 1MB files in fies_seq directory.  I am
using below parameters. * Is there a way to increase the file/block size?*

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapred.output.compression.type=BLOCK;


On Fri, Sep 23, 2016 at 7:16 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > Is there a way to create an external table on a directory, extract 'key'
> as file name and 'value' as file content and write to a sequence file table?
>
> Do you care that it is a sequence file?
>
> The HDFS HAR format was invented for this particular problem, check if the
> "hadoop archive" command works for you and offers a filesystem abstraction.
>
> Otherwise, there's always the old Mahout "seqdirectory" job, which is
> great if you have like .jpg files and want to pack them for HDFS to handle
> better (like GPS tiles).
>
> Cheers,
> Gopal
>
>
>

Reply via email to