Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Thejas Nair
On 6/28/10 5:51 PM, "Dmitriy Ryaboy" wrote: > > I have a feeling that propagating schemas when known, and using them to for > (de)serialization instead of reflecting every field, would also be a big > win. > > Thoughts on just using Avro for the internal PigStorage? When I profiled pig quer

Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Jeff Zhang
Agree, I have compared the performance between Hive and Pig using some simple script. The performance of Hive is much better than Pig. The mapper task time of pig and hive is almost the same, The time difference is almost is caused by the reduce task and much time is spent on transfer time from ma

Re: Avoiding serialization/de-serialization in pig

2010-06-29 Thread Alan Gates
On Jun 28, 2010, at 5:51 PM, Dmitriy Ryaboy wrote: For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when k

Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Russell Jurney
I don't fully understand the repercussions of this, but I like it. We're moving from our VoldemortStorage stuff to Avro and it would be great to pipe Avro all the way through. Russ On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy wrote: > For what it's worth, I saw very significant speed improv

Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Dmitriy Ryaboy
For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of re

Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functi