"It would give me the list of datasets in one place accessible from all tools,"
And that's exactly why you want it. D On Mon, Jul 2, 2012 at 5:57 AM, Ruslan Al-Fakikh <[email protected]> wrote: > Hey Alan, > > I am not familiar with Apache processes, so I could be wrong in my > point 1, I am sorry. > Basically my impressions was that Cloudera is pushing Avro format for > intercommunications between hadoop tools like pig, hive and mapreduce. > https://ccp.cloudera.com/display/CDHDOC/Avro+Usage > http://www.cloudera.com/blog/2011/07/avro-data-interop/ > And if I decide to use Avro then HCatalog becomes a little redundant. > It would give me the list of datasets in one place accessible from all > tools, but all the columns (names and types) would be stored in Avro > schemas and Hive metastore becomes just a stub for those Avro schemas: > https://github.com/jghoman/haivvreo#creating-avro-backed-hive-tables > And having those avro schemas I could access data from pig and > mapreduce without HCatalog. Though I haven't figured out how to deal > without hive partitions yet. > > Best Regards, > Ruslan > > On Fri, Jun 29, 2012 at 9:13 PM, Alan Gates <[email protected]> wrote: >> On a different topic, I'm interested in why you refuse to use a project in >> the incubator. Incubation is the Apache process by why a community is built >> around the code. It says nothing about the maturity of the code. >> >> Alan. >> >> On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote: >> >>> Hi Markus, >>> >>> Currently I am doing almost the same task. But in Hive. >>> In Hive you can use the native Avro+Hive integration: >>> https://issues.apache.org/jira/browse/HIVE-895 >>> Or haivvreo project if you are not using the latest version of Hive. >>> Also there is a Dynamic Partition feature in Hive that can separate >>> your data by a column value. >>> >>> As for HCatalog - I refused to use it after some investigation, because: >>> 1) It is still incubating >>> 2) It is not supported by Cloudera (the distribution provider we are >>> currently using) >>> >>> I think it would be perfect if MultiStorage would be generic in the >>> way you described, but I am not familiar with it. >>> >>> Ruslan >>> >>> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> wrote: >>>> I am not aware of any work on adding those features to MultiStorage. >>>> >>>> I think the best way to do this is to use Hcatalog. (It makes the hive >>>> metastore available for all of hadoop, so you get metadata for your data as >>>> well). >>>> You can associate a outputformat+serde for a table (instead of file name >>>> ending), and HCatStorage will automatically pick the right format. >>>> >>>> Thanks, >>>> Thejas >>>> >>>> >>>> >>>> On 6/28/12 2:17 AM, Markus Resch wrote: >>>>> >>>>> Thanks Thejas, >>>>> >>>>> This _really_ helped a lot :) >>>>> Some additional question on this: >>>>> As far as I see, the MultiStorage is currently just capable to write CSV >>>>> output, right? Is there any attempt ongoing currently to make this >>>>> storage more generic regarding the format of the output data? For our >>>>> needs we would require AVRO output as well as some special proprietary >>>>> binary encoding for which we already created our own storage. I'm >>>>> thinking about a storage that will select a certain writer method >>>>> depending to the file names ending. >>>>> >>>>> Do you know of such efforts? >>>>> >>>>> Thanks >>>>> >>>>> Markus >>>>> >>>>> >>>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair: >>>>>> >>>>>> You can use MultiStorage store func - >>>>>> >>>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html >>>>>> >>>>>> Or if you want something more flexible, and have metadata as well, use >>>>>> hcatalog . Specify the keys on which you want to partition as your >>>>>> partition keys in the table. Then use HcatStorer() to store the data. >>>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html >>>>>> >>>>>> Thanks, >>>>>> Thejas >>>>>> >>>>>> >>>>>> >>>>>> On 6/22/12 4:54 AM, Markus Resch wrote: >>>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> We're doing some aggregation. The result contains a key where we want to >>>>>>> have a single output file for each key. Is it possible to store files >>>>>>> like this? Especially adjusting the path by the key's value. >>>>>>> >>>>>>> Example: >>>>>>> Input = LOAD 'my/data.avro' USING AvroStorage; >>>>>>> [.... doing stuff....] >>>>>>> Output = GROUP AggregatesValues BY Key; >>>>>>> FOREACH Output Store * into >>>>>>> '/my/output/path/by/$Output.Key/Result.avro' >>>>>>> >>>>>>> I know this example does not work. But is there anything similar >>>>>>> possible? And, as I assume, not: is there some framework in the hadoop >>>>>>> world that can do such stuff? >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Markus >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>
