Hey, You can keep a single empty file per format. That way pig won't fail. But basically I recommend to avoid such situations that need hacks or custom formats. According to my experience you'll soon get in trouble with that.
Thanks On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk <[email protected]> wrote: > Thanks a lot Ruslan, that seems one possible direction! > > One things stands to be resolved: I don't know whether I will get an > Avro in the input or CSV, TSV or all... So how could I get pig not to > choke on missing input files? > > Johannes > > Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh: >> I guess you could use globbing for extracting the files by extensions, >> like this: >> $ ls >> input.avro input.txt >> $ cat input.avro >> avro1 >> avro2 >> $ cat input.txt >> txt1 >> txt2 >> >> [cloudera@localhost workpig]$ pig -x local >> 2012-06-15 17:21:09,613 [main] INFO org.apache.pig.Main - Logging >> error messages to: /home/cloudera/workpig/pig_1339766469585.log >> 2012-06-15 17:21:09,892 [main] INFO >> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - >> Connecting to hadoop file system at: file:/// >> grunt> txt = LOAD '*.txt'; >> grunt> avro = LOAD '*.avro'; >> grunt> result = UNION txt, avro; >> grunt> DUMP result; >> (txt1) >> (txt2) >> (avro1) >> (avro2) >> >> Please note that the input.avro file is actually not Avro, so you'll >> need to use the Avro loader in the LOAD statement. >> >> Ruslan >> >> On Fri, Jun 15, 2012 at 4:52 PM, Johannes Schwenk >> <[email protected]> wrote: >>> Hi Ruslan, >>> >>> thanks for you answer! >>> >>> I have only the input path, but do not know which file format the >>> different files in that path possess. All files that are in the path >>> belong to one relation however, so i want to load them at once. Though a >>> union of separately loaded files would be ok too, if that is possible to >>> achieve. Important is, that the LOAD automatically takes care of the >>> different formats. >>> >>> To illustrate further consider the following scenario: >>> >>> 1. Our logging system writes log data to LOG_PATH. >>> 2. The current format is tab separated values. >>> 3. We LOAD '$LOG_PATH' >>> 4. We switch to Avro format and have to migrate. >>> 5. The migration can not happen instantly, so it might be that at some >>> point in time some files in LOG_PATH still have the TSV format while >>> other are already switched to Avro. >>> >>> Thanks, >>> Johannes >>> >>> Am 15.06.2012 14:37, schrieb Ruslan Al-Fakikh: >>>> Hi Johannes, >>>> >>>> I guess you'd have to write a custom Loader for such a situation, but >>>> why do you need to load everything in one pass? You can load different >>>> types of files separately (having multiple LOAD statements) and make a >>>> join or a union afterwards. >>>> >>>> Ruslan >>>> >>>> On Fri, Jun 15, 2012 at 4:13 PM, Johannes Schwenk >>>> <[email protected]> wrote: >>>>> Hi all, >>>>> >>>>> is it possible to have an input path (as parameter to a LOAD statement) >>>>> that contains several files in *different formats* - say serialized Avro >>>>> data and tab separated values and make pig read the data into one alias? >>>>> I guess I have to write an UDF for this? How should I start, can you >>>>> sketch out a rough plan on how to proceed? >>>>> >>>>> >>>>> Greetings, >>>>> Johannes Schwenk >>>>> >>>>> -- >>>>> Softwareentwickler (Reporting) >>>>> ________________________________________________________ >>>>> >>>>> ADITION technologies AG >>>>> Schwarzwaldstraße 78b >>>>> 79117 Freiburg >>>>> >>>>> http://www.adition.com >>>>> >>>>> T +49 / (0)761 / 88147 - 30 >>>>> F +49 / (0)761 / 88147 - 77 >>>>> SUPPORT +49 / (0)1805 - ADITION >>>>> >>>>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>>>> >>>>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>>>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus >>>>> Schlüter >>>>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>>>> UStIDNr.: DE 218 858 434 >>>>> >>>> >>>> >>>> >>> >>> >>> >>> Johannes Schwenk >>> >>> -- >>> Softwareentwickler (Reporting) >>> ________________________________________________________ >>> >>> ADITION technologies AG >>> Schwarzwaldstraße 78b >>> 79117 Freiburg >>> >>> http://www.adition.com >>> >>> T +49 / (0)761 / 88147 - 30 >>> F +49 / (0)761 / 88147 - 77 >>> SUPPORT +49 / (0)1805 - ADITION >>> >>> (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) >>> >>> Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 >>> Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter >>> Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer >>> UStIDNr.: DE 218 858 434 >>> >> >> >> > > > > Johannes Schwenk > > -- > Softwareentwickler (Reporting) > ________________________________________________________ > > ADITION technologies AG > Schwarzwaldstraße 78b > 79117 Freiburg > > http://www.adition.com > > T +49 / (0)761 / 88147 - 30 > F +49 / (0)761 / 88147 - 77 > SUPPORT +49 / (0)1805 - ADITION > > (Festnetzpreis 14 ct/min; Mobilfunkpreise maximal 42 ct/min) > > Eingetragen beim Amtsgericht Düsseldorf unter HRB 54076 > Vorstände: Andreas Kleiser, Jörg Klekamp, Tihomir Perkovic, Marcus Schlüter > Aufsichtsratsvorsitzender: Rechtsanwalt Daniel Raimer > UStIDNr.: DE 218 858 434 > -- Best Regards, Ruslan Al-Fakikh
