agree with the pre-processing step... BUT, in case the data is big data
(i.e. pound signs scattered over terabytes), you could load things into a
relvar first as one big data, filter, and then split on the columns...  i
have many similar issues where the default loader won't handle something,
and I have been using this 'design pattern'... Something like:

A = LOAD 'yourfile' AS (data:chararray);
B = FILTER A by SUBSTRING(data,0,1) != '#';
C = FOREACH B generate SOMETOKENIZEUDF(data) as ( .. your columns...);

I've become a big fan of the python udfs, and you could easily use them as
your own 'loader' in the third step above.

I will not vouch for the efficiency of the approach.

On Tue, Jun 7, 2011 at 3:12 PM, <[email protected]> wrote:

> Can you stream it through
>
>  grep -v ‘^#’
>
>
>
> ?
>
>
>
> William F Dowling
>
> Sr Technical Specialist, Software Engineering
>
> Thomson Reuters
>
> 0 +1 215 823 3853
>
>
>
> From: Moore, Michael A. [mailto:[email protected]]
> Sent: Tuesday, June 07, 2011 3:04 PM
> To: [email protected]
> Subject: Loading Files with Comment Lines
>
>
>
> Hello all-
>
>
>
> I've got a quick question and Google isn't proving to be much help.
>
>
>
> I've got a big file, that has a few lines in it prefaced with a pound sign
> (#) to indicate they are to be ignored.  I would like to LOAD this file
> using PigStorage.  Is there a way to do this, or is it handled
> automatically?
>
>
>
> The data might look something like this:
>
>
>
> # Data Source: Project A
>
> # Contact MMoore with Questions
>
> # SenderId      RecipientId
>
> 1          2
>
> 3          5
>
> 6          7
>
> #2        1
>
> 3          6
>
> 11        7
>
>
>
> Thanks!
>
> -Michael
>
>
>
> ______________________________________
>
> Michael Moore :: [email protected] <mailto:[email protected]
> >
>
> The Johns Hopkins University Applied Physics Laboratory
>
> 0B7B17EE1AE2A80B pgp
>
> BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint
>
>
>
>
>
>

Reply via email to