Re: Best Practice: store depending on data content

Alan Gates Fri, 29 Jun 2012 10:14:22 -0700

On a different topic, I'm interested in why you refuse to use a project in the 
incubator.  Incubation is the Apache process by why a community is built around 
the code.  It says nothing about the maturity of the code.


Alan.

On Jun 28, 2012, at 10:59 AM, Ruslan Al-Fakikh wrote:

> Hi Markus,
> 
> Currently I am doing almost the same task. But in Hive.
> In Hive you can use the native Avro+Hive integration:
> https://issues.apache.org/jira/browse/HIVE-895
> Or haivvreo project if you are not using the latest version of Hive.
> Also there is a Dynamic Partition feature in Hive that can separate
> your data by a column value.
> 
> As for HCatalog - I refused to use it after some investigation, because:
> 1) It is still incubating
> 2) It is not supported by Cloudera (the distribution provider we are
> currently using)
> 
> I think it would be perfect if MultiStorage would be generic in the
> way you described, but I am not familiar with it.
> 
> Ruslan
> 
> On Thu, Jun 28, 2012 at 9:27 PM, Thejas Nair <[email protected]> wrote:
>> I am not aware of any work on adding those features to MultiStorage.
>> 
>> I think the best way to do this is to use Hcatalog. (It makes the hive
>> metastore available for all of hadoop, so you get metadata for your data as
>> well).
>> You can associate a outputformat+serde for a table (instead of file name
>> ending), and HCatStorage will automatically pick the right format.
>> 
>> Thanks,
>> Thejas
>> 
>> 
>> 
>> On 6/28/12 2:17 AM, Markus Resch wrote:
>>> 
>>> Thanks Thejas,
>>> 
>>> This _really_ helped a lot :)
>>> Some additional question on this:
>>> As far as I see, the MultiStorage is currently just capable to write CSV
>>> output, right? Is there any attempt ongoing currently to make this
>>> storage more generic regarding the format of the output data? For our
>>> needs we would require AVRO output as well as some special proprietary
>>> binary encoding for which we already created our own storage. I'm
>>> thinking about a storage that will select a certain writer method
>>> depending to the file names ending.
>>> 
>>> Do you know of such efforts?
>>> 
>>> Thanks
>>> 
>>> Markus
>>> 
>>> 
>>> Am Freitag, den 22.06.2012, 11:23 -0700 schrieb Thejas Nair:
>>>> 
>>>> You can use MultiStorage store func -
>>>> 
>>>> http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
>>>> 
>>>> Or if you want something more flexible, and have metadata as well, use
>>>> hcatalog . Specify the keys on which you want to partition as your
>>>> partition keys in the table. Then use HcatStorer() to store the data.
>>>> See http://incubator.apache.org/hcatalog/docs/r0.4.0/index.html
>>>> 
>>>> Thanks,
>>>> Thejas
>>>> 
>>>> 
>>>> 
>>>> On 6/22/12 4:54 AM, Markus Resch wrote:
>>>>> 
>>>>> Hey everyone,
>>>>> 
>>>>> We're doing some aggregation. The result contains a key where we want to
>>>>> have a single output file for each key. Is it possible to store files
>>>>> like this? Especially adjusting the path by the key's value.
>>>>> 
>>>>> Example:
>>>>> Input = LOAD 'my/data.avro' USING AvroStorage;
>>>>> [.... doing stuff....]
>>>>> Output = GROUP AggregatesValues BY Key;
>>>>> FOREACH Output Store * into
>>>>> '/my/output/path/by/$Output.Key/Result.avro'
>>>>> 
>>>>> I know this example does not work. But is there anything similar
>>>>> possible? And, as I assume, not: is there some framework in the hadoop
>>>>> world that can do such stuff?
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Markus
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Best Practice: store depending on data content

Reply via email to