I guess you could use globbing for extracting the files by extensions, like this: $ ls input.avro input.txt $ cat input.avro avro1 avro2 $ cat input.txt txt1 txt2
[cloudera@localhost workpig]$ pig -x local 2012-06-15 17:21:09,613 [main] INFO org.apache.pig.Main - Logging error messages to: /home/cloudera/workpig/pig_1339766469585.log 2012-06-15 17:21:09,892 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt> txt = LOAD '*.txt'; grunt> avro = LOAD '*.avro'; grunt> result = UNION txt, avro; grunt> DUMP result; (txt1) (txt2) (avro1) (avro2) Please note that the input.avro file is actually not Avro, so you'll need to use the Avro loader in the LOAD statement. Ruslan On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk <[email protected]> wrote: > Hi Ruslan, > > thanks for you answer! > > I have only the input path, but do not know which file format the > different files in that path possess. All files that are in the path > belong to one relation however, so i want to load them at once. Though a > union of separately loaded files would be ok too, if that is possible to > achieve. Important is, that the LOAD automatically takes care of the > different formats. > > To illustrate further consider the following scenario: > > 1. Our logging system writes log data to LOG_PATH. > 2. The current format is tab separated values. > 3. We LOAD '$LOG_PATH' > 4. We switch to Avro format and have to migrate. > 5. The migration can not happen instantly, so it might be that at some > point in time some files in LOG_PATH still have the TSV format while > other are already switched to Avro. > > Thanks, > Johannes > > Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh: >> Hi Johannes, >> >> I guess you'd have to write a custom Loader for such a situation, but >> why do you need to load everything in one pass? You can load different >> types of files separately (having multiple LOAD statements) and make a >> join or a union afterwards. >> >> Ruslan >> >> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk >> <[email protected]> wrote: >>> Hi all, >>> >>> is it possible to have an input path (as parameter to a LOAD statement) >>> that contains several files in *different formats* - say serialized Avro >>> data and tab separated values and make pig read the data into one alias? >>> I guess I have to write an UDF for this? How should I start, can you >>> sketch out a rough plan on how to proceed? >>> >>> >>> Greetings, >>> Johannes Schwenk >>> >>> -- >>> Softwareentwickler (Reporting) >>> ________________________________________________________ >>> >>> ADITION technologies AG >>> Schwarzwaldstraße 78b >>> 79117 Freiburg >>> >>> http://www.adition.com >>> >>> T +49 / (0)761 / 88147 - 30 >>> F +49 / (0)761 / 88147 - 77 >>> SUPPORT +49 / (0)1805 - ADITION >>> >>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>> >>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>> UStIDNr.: DE 218 858 434 >>> >> >> >> > > > > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434 > -- Best Regards, Ruslan Al-Fakikh
