As always, a million thanks. 2010/12/1 Dmitriy Ryaboy <[email protected]>
> 1) Pig (and hadoop) uses bash-style globbing. You can see the details here: > > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 > > 2) Records are processed in a pipeline -- a record is read in, passed > through all the operators in the given stage (map or reduce), and the > output > written to disk for the next stage to pick up. So if you load and then > filter, the pipeline will be load->filter, and records will be discarded as > they are read in, which I think is the behavior you are asking for. > > -D > > On Wed, Dec 1, 2010 at 7:57 AM, Jonathan Coveney <[email protected]> > wrote: > > > In order to facilitate more robust loading, I have 2 questions. > > > > 1) I know that you can use some wildcards in loading... for example, if > you > > have 2 files, dog1.txt and dog2.txt, you can load dog*.txt and it will > load > > more. Is there any way to use regular expressions or anything more > powerful > > in the actual load? For example, if I want to load 10 different files > with > > a > > generally similar name structure but identically structured data, what's > > the > > easiest and fastest way to load them all into the same table? > > 2) Can you filter as you load? If you do a load then a filter right after > > that, it seems wasteful (unless pig/hadoop are smart enough to realize > that > > it doesn't have to load all the data off the bat) > > > > I appreciate your help > > Jon > > >
