What I'm doing is at the end of each day I deduce and store all my log files in 
lzo format in an archive directory. I thought that since LZO is splittable and 
Hadoop likes larger files that this would be best. Is this not the case?

And to answer your question there seems to be 2 files around 800mb in size.

On May 1, 2013, at 10:17 AM, Mike Sukmanowsky <m...@parsely.com> wrote:

> How many output files are you getting?  You can set SET DEFAULT_PARALLEL 1;
> so you don't have to specify parallelism on each reduce phase.
> 
> In general though, I wouldn't recommend forcing your output into one file
> (parallelism is good).  Just write a shell/python/ruby/perl script that
> appends the files after the full job executes.
> 
> 
> On Wed, May 1, 2013 at 12:51 PM, Mark <static.void....@gmail.com> wrote:
> 
>> Thought I understood how to output to a single file but It doesn't seem to
>> be working. Anything I'm missing here?
>> 
>> 
>> -- Dedupe and store
>> 
>> rows   = LOAD '$input';
>> unique = DISTINCT rows PARELLEL 1;
>> 
>> STORE unique INTO '$output';
>> 
>> 
>> 
> 
> 
> -- 
> Mike Sukmanowsky
> 
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: m...@parsely.com

Reply via email to