Thanks a lot Ruslan, that seems one possible direction! One things stands to be resolved: I don't know whether I will get an Avro in the input or CSV, TSV or all... So how could I get pig not to choke on missing input files?
Johannes Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh: > I guess you could use globbing for extracting the files by extensions, > like this: > $ ls > input.avro input.txt > $ cat input.avro > avro1 > avro2 > $ cat input.txt > txt1 > txt2 > > [cloudera@localhost workpig]$ pig -x local > 2012-06-15 17:21:09,613 [main] INFO org.apache.pig.Main - Logging > error messages to: /home/cloudera/workpig/pig_1339766469585.log > 2012-06-15 17:21:09,892 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: file:/// > grunt> txt = LOAD '*.txt'; > grunt> avro = LOAD '*.avro'; > grunt> result = UNION txt, avro; > grunt> DUMP result; > (txt1) > (txt2) > (avro1) > (avro2) > > Please note that the input.avro file is actually not Avro, so you'll > need to use the Avro loader in the LOAD statement. > > Ruslan > > On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk > <[email protected]> wrote: >> Hi Ruslan, >> >> thanks for you answer! >> >> I have only the input path, but do not know which file format the >> different files in that path possess. All files that are in the path >> belong to one relation however, so i want to load them at once. Though a >> union of separately loaded files would be ok too, if that is possible to >> achieve. Important is, that the LOAD automatically takes care of the >> different formats. >> >> To illustrate further consider the following scenario: >> >> 1. Our logging system writes log data to LOG_PATH. >> 2. The current format is tab separated values. >> 3. We LOAD '$LOG_PATH' >> 4. We switch to Avro format and have to migrate. >> 5. The migration can not happen instantly, so it might be that at some >> point in time some files in LOG_PATH still have the TSV format while >> other are already switched to Avro. >> >> Thanks, >> Johannes >> >> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh: >>> Hi Johannes, >>> >>> I guess you'd have to write a custom Loader for such a situation, but >>> why do you need to load everything in one pass? You can load different >>> types of files separately (having multiple LOAD statements) and make a >>> join or a union afterwards. >>> >>> Ruslan >>> >>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk >>> <[email protected]> wrote: >>>> Hi all, >>>> >>>> is it possible to have an input path (as parameter to a LOAD statement) >>>> that contains several files in *different formats* - say serialized Avro >>>> data and tab separated values and make pig read the data into one alias? >>>> I guess I have to write an UDF for this? How should I start, can you >>>> sketch out a rough plan on how to proceed? >>>> >>>> >>>> Greetings, >>>> Johannes Schwenk >>>> >>>> -- >>>> Softwareentwickler (Reporting) >>>> ________________________________________________________ >>>> >>>> ADITION technologies AG >>>> Schwarzwaldstraße 78b >>>> 79117 Freiburg >>>> >>>> http://www.adition.com >>>> >>>> T +49 / (0)761 / 88147 - 30 >>>> F +49 / (0)761 / 88147 - 77 >>>> SUPPORT +49 / (0)1805 - ADITION >>>> >>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>>> >>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>>> UStIDNr.: DE 218 858 434 >>>> >>> >>> >>> >> >> >> >> Johannes Schwenk >> >> -- >> Softwareentwickler (Reporting) >> ________________________________________________________ >> >> ADITION technologies AG >> Schwarzwaldstraße 78b >> 79117 Freiburg >> >> http://www.adition.com >> >> T +49 / (0)761 / 88147 - 30 >> F +49 / (0)761 / 88147 - 77 >> SUPPORT +49 / (0)1805 - ADITION >> >> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >> >> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >> UStIDNr.: DE 218 858 434 >> > > > Johannes Schwenk -- Softwareentwickler (Reporting) ________________________________________________________ ADITION technologies AG Schwarzwaldstraße 78b 79117 Freiburg http://www.adition.com T +49 / (0)761 / 88147 - 30 F +49 / (0)761 / 88147 - 77 SUPPORT +49 / (0)1805 - ADITION (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer UStIDNr.: DE 218 858 434
signature.asc
Description: OpenPGP digital signature
