Hi all, I'm writing a bit of code to grab some logfiles, parse them, and run some sanity checks on them (before subjecting them to further analysis). Naturally, logfiles being logfiles, they accumulate, and I was wondering how efficiently pig would handle a request to add recently accumulated log data to a bit of logfile that's already been started.
In particular, two approaches that I'm contemplating are raw = LOAD 'logfile' ... -- snipped parsing/cleaning steps producing a relation with alias "cleanfile" oldclean = LOAD 'existing_log'; newclean = UNION oldclean, cleanfile; STORE newclean INTO 'tmp_log'; rm existing_log; mv tmp_log existing_log; ...ALTERNATELY... raw = LOAD 'logfile' ... -- snipped parsing/cleaning steps producing a relation with alias "cleanfile" STORE cleanfile INTO 'tmp_log'; followed by renumbering all the part files in tmp_log and copying them to existing_log. Is pig clever enough to handle the first set of instructions reasonably efficiently (and if not, are there any gotchas I'd have to watch out for with the second approach, e.g. a catalogue file that'd have to be edited when the new parts are added). Thanks, Kris -- Kris Coward http://unripe.melon.org/ GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
