I missed the globbing on my previous passes over the documentation for LOAD. Having missed that, my objection would have been that with all the files in a single directory, I can get them with a single LOAD command. That said, a wildcard would also solve that. Thanks for pushing back hard enough to make me re-read that.
Cheers, Kris On Fri, Jan 28, 2011 at 01:27:54PM -0800, Dmitriy Ryaboy wrote: > It's a pain to rename everything, especially since the number of renames > grows every day. You'll stress out the namenode at some point. > > I am not sure why loading data back out of 8760 distinct directories is > worse than 8760 distinct files. There is no real difference. > > That's what we do at Twitter, fwiw, and that's also what the standard setup > for Hive logs is.. Can you explain in greater detail what your objection is > if this doesn't work for you? > > D > > > On Fri, Jan 28, 2011 at 9:11 AM, Kris Coward <[email protected]> wrote: > > > > > I want to flatten things at least a little, since I'm looking for > > year-long trends in logfiles that are rotated hourly (and loading the > > data back out of 8760 distinct directories isn't my idea of a good > > time). > > > > Any reason that moving/renaming the part-nnnn files wouldn't work? > > > > Thanks, > > Kris > > > > On Thu, Jan 27, 2011 at 05:57:32PM -0800, Dmitriy Ryaboy wrote: > > > Kris, > > > As logs accumulate over time the union will get slow since you have to > > read > > > all the data off disk and write it back to disk. > > > > > > Why not just have a hierarchy in your cleaned log directory? You can do > > > something like > > > define newdir `date +%s` > > > > > > store newclean into 'cleaned_files/$newdir/' > > > > > > > > > then to load all logs you can just load 'cleaned_files' > > > > > > you can also format the date output differently and wind up with your > > > cleaned files nicely organized by year/month/day/hour/ ... > > > > > > D > > > > > > On Thu, Jan 27, 2011 at 4:40 PM, Kris Coward <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > I'm writing a bit of code to grab some logfiles, parse them, and run > > some > > > > sanity checks on them (before subjecting them to further analysis). > > > > Naturally, logfiles being logfiles, they accumulate, and I was > > wondering > > > > how efficiently pig would handle a request to add recently accumulated > > > > log data to a bit of logfile that's already been started. > > > > > > > > In particular, two approaches that I'm contemplating are > > > > > > > > raw = LOAD 'logfile' ... > > > > -- snipped parsing/cleaning steps producing a relation with alias > > > > "cleanfile" > > > > oldclean = LOAD 'existing_log'; > > > > newclean = UNION oldclean, cleanfile; > > > > STORE newclean INTO 'tmp_log'; > > > > rm existing_log; > > > > mv tmp_log existing_log; > > > > > > > > ...ALTERNATELY... > > > > > > > > raw = LOAD 'logfile' ... > > > > -- snipped parsing/cleaning steps producing a relation with alias > > > > "cleanfile" > > > > STORE cleanfile INTO 'tmp_log'; > > > > > > > > followed by renumbering all the part files in tmp_log and copying them > > > > to existing_log. > > > > > > > > Is pig clever enough to handle the first set of instructions reasonably > > > > efficiently (and if not, are there any gotchas I'd have to watch out > > for > > > > with the second approach, e.g. a catalogue file that'd have to be > > edited > > > > when the new parts are added). > > > > > > > > Thanks, > > > > Kris > > > > > > > > -- > > > > Kris Coward > > http://unripe.melon.org/ > > > > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
