Re: Can you filter and load at the same time?

Jonathan Coveney Wed, 01 Dec 2010 08:57:59 -0800

As always, a million thanks.

2010/12/1 Dmitriy Ryaboy <[email protected]>


> 1) Pig (and hadoop) uses bash-style globbing. You can see the details here:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> 2) Records are processed in a pipeline -- a record is read in, passed
> through all the operators in the given stage (map or reduce), and the
> output
> written to disk for the next stage to pick up. So if you load and then
> filter, the pipeline will be load->filter, and records will be discarded as
> they are read in, which I think is the behavior you are asking for.
>
> -D
>
> On Wed, Dec 1, 2010 at 7:57 AM, Jonathan Coveney <[email protected]>
> wrote:
>
> > In order to facilitate more robust loading, I have 2 questions.
> >
> > 1) I know that you can use some wildcards in loading... for example, if
> you
> > have 2 files, dog1.txt and dog2.txt, you can load dog*.txt and it will
> load
> > more. Is there any way to use regular expressions or anything more
> powerful
> > in the actual load? For example, if I want to load 10 different files
> with
> > a
> > generally similar name structure but identically structured data, what's
> > the
> > easiest and fastest way to load them all into the same table?
> > 2) Can you filter as you load? If you do a load then a filter right after
> > that, it seems wasteful (unless pig/hadoop are smart enough to realize
> that
> > it doesn't have to load all the data off the bat)
> >
> > I appreciate your help
> > Jon
> >
>

Re: Can you filter and load at the same time?

Reply via email to