On a different topic, I'm interested in why you refuse to use a project in the incubator. Incubation is the Apache process by why a community is built around the code. It says nothing about the maturity of the code.
Alan. On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: > Hi Markus, > > Currently I am doing almost the same task. But in Hive. > In Hive you can use the native Avro+Hive integration: > https://issues.apache.org/jira/browse/HIVE-895 > Or haivvreo project if you are not using the latest version of Hive. > Also there is a Dynamic Partition feature in Hive that can separate > your data by a column value. > > As for HCatalog - I refused to use it after some investigation, because: > 1) It is still incubating > 2) It is not supported by Cloudera (the distribution provider we are > currently using) > > I think it would be perfect if MultiStorage would be generic in the > way you described, but I am not familiar with it. > > Ruslan > > On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> wrote: >> I am not aware of any work on adding those features to MultiStorage. >> >> I think the best way to do this is to use Hcatalog. (It makes the hive >> metastore available for all of hadoop, so you get metadata for your data as >> well). >> You can associate a outputformat+serde for a table (instead of file name >> ending), and HCatStorage will automatically pick the right format. >> >> Thanks, >> Thejas >> >> >> >> On 6/28/12 2:17 AM, Markus Resch wrote: >>> >>> Thanks Thejas, >>> >>> This _really_ helped a lot :) >>> Some additional question on this: >>> As far as I see, the MultiStorage is currently just capable to write CSV >>> output, right? Is there any attempt ongoing currently to make this >>> storage more generic regarding the format of the output data? For our >>> needs we would require AVRO output as well as some special proprietary >>> binary encoding for which we already created our own storage. I'm >>> thinking about a storage that will select a certain writer method >>> depending to the file names ending. >>> >>> Do you know of such efforts? >>> >>> Thanks >>> >>> Markus >>> >>> >>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>> >>>> You can use MultiStorage store func - >>>> >>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html >>>> >>>> Or if you want something more flexible, and have metadata as well, use >>>> hcatalog . Specify the keys on which you want to partition as your >>>> partition keys in the table. Then use HcatStorer() to store the data. >>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html >>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> >>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>> >>>>> Hey everyone, >>>>> >>>>> We're doing some aggregation. The result contains a key where we want to >>>>> have a single output file for each key. Is it possible to store files >>>>> like this? Especially adjusting the path by the key's value. >>>>> >>>>> Example: >>>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>>> [.... doing stuff....] >>>>> Output = GROUP AggregatesValues BY Key; >>>>> FOREACH Output Store * into >>>>> '/my/output/path/by/$Output.Key/Result.avro' >>>>> >>>>> I know this example does not work. But is there anything similar >>>>> possible? And, as I assume, not: is there some framework in the hadoop >>>>> world that can do such stuff? >>>>> >>>>> >>>>> Thanks >>>>> >>>>> Markus >>>>> >>>>> >>> >>> >>
