Hi Markus, Currently I am doing almost the same task. But in Hive. In Hive you can use the native Avro+Hive integration: https://issues.apache.org/jira/browse/HIVE-895 Or haivvreo project if you are not using the latest version of Hive. Also there is a Dynamic Partition feature in Hive that can separate your data by a column value.
As for HCatalog - I refused to use it after some investigation, because: 1) It is still incubating 2) It is not supported by Cloudera (the distribution provider we are currently using) I think it would be perfect if MultiStorage would be generic in the way you described, but I am not familiar with it. Ruslan On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> wrote: > I am not aware of any work on adding those features to MultiStorage. > > I think the best way to do this is to use Hcatalog. (It makes the hive > metastore available for all of hadoop, so you get metadata for your data as > well). > You can associate a outputformat+serde for a table (instead of file name > ending), and HCatStorage will automatically pick the right format. > > Thanks, > Thejas > > > > On 6/28/12 2:17 AM, Markus Resch wrote: >> >> Thanks Thejas, >> >> This _really_ helped a lot :) >> Some additional question on this: >> As far as I see, the MultiStorage is currently just capable to write CSV >> output, right? Is there any attempt ongoing currently to make this >> storage more generic regarding the format of the output data? For our >> needs we would require AVRO output as well as some special proprietary >> binary encoding for which we already created our own storage. I'm >> thinking about a storage that will select a certain writer method >> depending to the file names ending. >> >> Do you know of such efforts? >> >> Thanks >> >> Markus >> >> >> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>> >>> You can use MultiStorage store func - >>> >>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html >>> >>> Or if you want something more flexible, and have metadata as well, use >>> hcatalog . Specify the keys on which you want to partition as your >>> partition keys in the table. Then use HcatStorer() to store the data. >>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html >>> >>> Thanks, >>> Thejas >>> >>> >>> >>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>> >>>> Hey everyone, >>>> >>>> We're doing some aggregation. The result contains a key where we want to >>>> have a single output file for each key. Is it possible to store files >>>> like this? Especially adjusting the path by the key's value. >>>> >>>> Example: >>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>> [.... doing stuff....] >>>> Output = GROUP AggregatesValues BY Key; >>>> FOREACH Output Store * into >>>> '/my/output/path/by/$Output.Key/Result.avro' >>>> >>>> I know this example does not work. But is there anything similar >>>> possible? And, as I assume, not: is there some framework in the hadoop >>>> world that can do such stuff? >>>> >>>> >>>> Thanks >>>> >>>> Markus >>>> >>>> >> >> >
