agree with the pre-processing step... BUT, in case the data is big data (i.e. pound signs scattered over terabytes), you could load things into a relvar first as one big data, filter, and then split on the columns... i have many similar issues where the default loader won't handle something, and I have been using this 'design pattern'... Something like:
A = LOAD 'yourfile' AS (data:chararray); B = FILTER A by SUBSTRING(data,0,1) != '#'; C = FOREACH B generate SOMETOKENIZEUDF(data) as ( .. your columns...); I've become a big fan of the python udfs, and you could easily use them as your own 'loader' in the third step above. I will not vouch for the efficiency of the approach. On Tue, Jun 7, 2011 at 3:12 PM, <[email protected]> wrote: > Can you stream it through > > grep -v ‘^#’ > > > > ? > > > > William F Dowling > > Sr Technical Specialist, Software Engineering > > Thomson Reuters > > 0 +1 215 823 3853 > > > > From: Moore, Michael A. [mailto:[email protected]] > Sent: Tuesday, June 07, 2011 3:04 PM > To: [email protected] > Subject: Loading Files with Comment Lines > > > > Hello all- > > > > I've got a quick question and Google isn't proving to be much help. > > > > I've got a big file, that has a few lines in it prefaced with a pound sign > (#) to indicate they are to be ignored. I would like to LOAD this file > using PigStorage. Is there a way to do this, or is it handled > automatically? > > > > The data might look something like this: > > > > # Data Source: Project A > > # Contact MMoore with Questions > > # SenderId RecipientId > > 1 2 > > 3 5 > > 6 7 > > #2 1 > > 3 6 > > 11 7 > > > > Thanks! > > -Michael > > > > ______________________________________ > > Michael Moore :: [email protected] <mailto:[email protected] > > > > The Johns Hopkins University Applied Physics Laboratory > > 0B7B17EE1AE2A80B pgp > > BC31 A861 9726 8211 F79F 7E21 0B7B 17EE 1AE2 A80B pgp fingerprint > > > > > >
