What I'm doing is at the end of each day I deduce and store all my log files in lzo format in an archive directory. I thought that since LZO is splittable and Hadoop likes larger files that this would be best. Is this not the case?
And to answer your question there seems to be 2 files around 800mb in size. On May 1, 2013, at 10:17 AM, Mike Sukmanowsky <m...@parsely.com> wrote: > How many output files are you getting? You can set SET DEFAULT_PARALLEL 1; > so you don't have to specify parallelism on each reduce phase. > > In general though, I wouldn't recommend forcing your output into one file > (parallelism is good). Just write a shell/python/ruby/perl script that > appends the files after the full job executes. > > > On Wed, May 1, 2013 at 12:51 PM, Mark <static.void....@gmail.com> wrote: > >> Thought I understood how to output to a single file but It doesn't seem to >> be working. Anything I'm missing here? >> >> >> -- Dedupe and store >> >> rows = LOAD '$input'; >> unique = DISTINCT rows PARELLEL 1; >> >> STORE unique INTO '$output'; >> >> >> > > > -- > Mike Sukmanowsky > > Product Lead, http://parse.ly > 989 Avenue of the Americas, 3rd Floor > New York, NY 10018 > p: +1 (416) 953-4248 > e: m...@parsely.com