Kris,
As logs accumulate over time the union will get slow since you have to read
all the data off disk and write it back to disk.

Why not just have a hierarchy in your cleaned log directory? You can do
something like
define newdir `date +%s`

store newclean into 'cleaned_files/$newdir/'


then to load all logs you can just load 'cleaned_files'

you can also format the date output differently and wind up with your
cleaned files nicely organized by year/month/day/hour/ ...

D

On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <[email protected]> wrote:

> Hi all,
>
> I'm writing a bit of code to grab some logfiles, parse them, and run some
> sanity checks on them (before subjecting them to further analysis).
> Naturally, logfiles being logfiles, they accumulate, and I was wondering
> how efficiently pig would handle a request to add recently accumulated
> log data to a bit of logfile that's already been started.
>
> In particular, two approaches that I'm contemplating are
>
> raw = LOAD 'logfile' ...
> -- snipped parsing/cleaning steps producing a relation with alias
> "cleanfile"
> oldclean = LOAD 'existing_log';
> newclean = UNION oldclean, cleanfile;
> STORE newclean INTO 'tmp_log';
> rm existing_log;
> mv tmp_log existing_log;
>
> ...ALTERNATELY...
>
> raw = LOAD 'logfile' ...
> -- snipped parsing/cleaning steps producing a relation with alias
> "cleanfile"
> STORE cleanfile INTO 'tmp_log';
>
> followed by renumbering all the part files in tmp_log and copying them
> to existing_log.
>
> Is pig clever enough to handle the first set of instructions reasonably
> efficiently (and if not, are there any gotchas I'd have to watch out for
> with the second approach, e.g. a catalogue file that'd have to be edited
> when the new parts are added).
>
> Thanks,
> Kris
>
> --
> Kris Coward                                     http://unripe.melon.org/
> GPG Fingerprint: 2BF3 957D 310A FEEC 4733  830E 21A4 05C7 1FEB 12B3
>

Reply via email to