Kris, As logs accumulate over time the union will get slow since you have to read all the data off disk and write it back to disk.
Why not just have a hierarchy in your cleaned log directory? You can do something like define newdir `date +%s` store newclean into 'cleaned_files/$newdir/' then to load all logs you can just load 'cleaned_files' you can also format the date output differently and wind up with your cleaned files nicely organized by year/month/day/hour/ ... D On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <[email protected]> wrote: > Hi all, > > I'm writing a bit of code to grab some logfiles, parse them, and run some > sanity checks on them (before subjecting them to further analysis). > Naturally, logfiles being logfiles, they accumulate, and I was wondering > how efficiently pig would handle a request to add recently accumulated > log data to a bit of logfile that's already been started. > > In particular, two approaches that I'm contemplating are > > raw = LOAD 'logfile' ... > -- snipped parsing/cleaning steps producing a relation with alias > "cleanfile" > oldclean = LOAD 'existing_log'; > newclean = UNION oldclean, cleanfile; > STORE newclean INTO 'tmp_log'; > rm existing_log; > mv tmp_log existing_log; > > ...ALTERNATELY... > > raw = LOAD 'logfile' ... > -- snipped parsing/cleaning steps producing a relation with alias > "cleanfile" > STORE cleanfile INTO 'tmp_log'; > > followed by renumbering all the part files in tmp_log and copying them > to existing_log. > > Is pig clever enough to handle the first set of instructions reasonably > efficiently (and if not, are there any gotchas I'd have to watch out for > with the second approach, e.g. a catalogue file that'd have to be edited > when the new parts are added). > > Thanks, > Kris > > -- > Kris Coward http://unripe.melon.org/ > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3 >
