[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-08 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907280#action_12907280
 ] 

Doug Cutting commented on PIG-794:
--

Jeff, please instead use current trunk or the 1.4.0 build that I expect to be 
released tomorrow (http://people.apache.org/~cutting/avro-1.4.0-rc4/).  There 
was a bug that caused a similar failure in the snapshot you're using, but that 
should only happen in multi-threaded applications, which I doubt yours is, but 
it's better to either test against trunk or a release so we don't chase ghosts.

Further, while debugging a DatumWriter and DatumReader, you might use a 
ValidatingEncoder and ValidatingDecoder to ensure that what you write and read 
conforms to your schema.  You might also test by reading and printing your data 
with GenericDatumReader to see that you've written what you meant to write.  If 
you've written data that does not conform to your declared schema then it 
cannot be read correctly.  If this is the case, we should attempt to improve 
the error message here.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-07 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906871#action_12906871
 ] 

Doug Cutting commented on PIG-794:
--

Jeff, what version of Avro are you using?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-07 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907049#action_12907049
 ] 

Jeff Zhang commented on PIG-794:


Doug, I am using avro trunk revision 988779

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-06 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906671#action_12906671
 ] 

Jeff Zhang commented on PIG-794:


Dmitriy,

In my patch I turn InternalMap as an avro array whose element is a record 
having two datums(one is key and the other is value).
But it occurred weird exception , not know what's wrong with my code 


{code}
Exception in thread main java.lang.NullPointerException
at org.apache.avro.io.parsing.Parser.advance(Parser.java:86)
at 
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:121)
at 
org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:77)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106)
at 
org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:66)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106)
at 
org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:184)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:108)
at 
org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:81)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106)
at 
org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:184)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:108)
at 
org.apache.pig.impl.io.avro.PigDataRecordReader.readRecord(PigDataRecordReader.java:83)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:106)
at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:97)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:198)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:185)
at org.apache.pig.impl.io.avro.PigData.main(PigData.java:224)

{code}

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905612#action_12905612
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Doug and Scott will know better of course, but afaik, Avro doesn't support 
Object keys.

You can cheat and turn Object keys into strings by Base64-encoding their 
serialized representations.. you'd have to know to reverse the process when 
deserializing, though.

Or we can try to get rid of InternalMap.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905663#action_12905663
 ] 

Doug Cutting commented on PIG-794:
--

Some quick comments on the new patch:
  - you might define a java enum type for the union elements, using 
Enum#ordinal() for the union indexes
  - instead of name.equals(union), s.getType()==Type.UNION would be faster, 
but better yet would be to simply call read() recursively, since it already 
handles unions, no?
 - peekArray() can simply return null, and that might be faster



 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905775#action_12905775
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, that's what I am saying -- since they are writables, we can turn them 
into strings and not need InternalMap at all.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904551#action_12904551
 ] 

Jeff Zhang commented on PIG-794:


I did some experiment on Avro, Avro_Storage_2.patch is the detail 
implementation.

Here I use avro as the data storage between map reduce jobs to replace 
InterStorage which has been optimized compared to BinStorage. 
 I use a simple pig script which will been translate into 2 mapred jobs
{code}
a = load '/a.txt';
b = load '/b.txt';
c = join a by $0, b by $0;
d = group c by $0;
dump d;
{code}

The following table shows my experiment result (1 master + 3 slaves)
|| Storage || Time spent on job_1 || Output size of job_1 || Mapper task number 
of job_2 || Time spent on job_2 || Total spent time on pig script
| AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| 
| InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec|

The experiment shows that AvroStorage has more compact format than InterStorage 
( according the output size of job_1), but has more overhead on serialization ( 
according the time spent on job_1). I think the time spent on job_2 using 
AvroStorage is less than that using InterStorage is because the input size of 
job_2 (the output of job_1) which using AvroStorage is much less than that 
using InterStorage, so it need less mapper task.

Overall, AvroStorage is not so good as expected.
One reason is maybe I do not use Avro's API correctly (hope avro guys can 
review my code), another reason is maybe avro's serialization performance is 
not so good.
BTW, I use avro trunk.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, have you checkoed out Scott Carey's work here: 
https://issues.apache.org/jira/browse/AVRO-592 ?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904674#action_12904674
 ] 

Scott Carey commented on PIG-794:
-

AVRO-592 creates an AvroStorage class for writing and reading M/R inputs and 
outputs but does not deal with intermediate M/R output.  I have some updates to 
that in progress that simplify it more.   Some aspects may be re-usable for 
this too.   

One thing to note is that Avro cannot be completely optimal for intermediate 
M/R output because the Hadoop API for this has a performance flaw that prevents 
efficient use of buffers and input/output streams there.  This would affect 
InterStorage as well though.

I'll take a look at the patch here and see if I can see any performance 
optimizations.
Note, that there are still several performance optimizations left to do in Avro 
itself.  For example, the BinaryDecoder has been optimized, but not the Encoder 
yet.

Also, I'm somewhat blocked with AVRO-592 due to lack of Pig 0.7 maven 
availability. 



 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904680#action_12904680
 ] 

Scott Carey commented on PIG-794:
-

So a summary of the differences I can see quickly are:

h5. Schema usage:
This creates a 'generic' Avro schema that can be used for any pig data.  Each 
field in a Tuple is a Union of all possible pig types, and each Tuple is a list 
of fields.  It does not preserve the field names or types -- these are not 
important for intermediate data anyway.

AVRO-592 translates the Pig schema into a specific Avro schema that persists 
the field names and types, so that:
STORE foo INTO 'file' USING AvroStorage();
Will create a file that
foo2 = LOAD 'file' USING AvroStorage(); 
will be able to re-create the exact schema for use in a script.

h5. Serialization and Deserialization:
This uses the same style as Avro's GenericRecord, which traverses the schema on 
the fly and writes fields for each record.

AVRO-592 constructs a state machine for each specific schema to optimally 
traverse a Tuple to serialize a record or create a Tuple when deserializing.  
This should be faster but the code is definitely harder to read (but easy to 
unit test -- AVRO-592 has 98% unit test code coverage on that portion).


Integrating these should not be too hard.  I'll try and put my latest version 
of AVRO-592 up there late today or tomorrow.




 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904687#action_12904687
 ] 

Doug Cutting commented on PIG-794:
--

A few comments about the attached code:
 - is there a reason you don't subclass GenericDatumReader and 
GenericDatumWriter, overriding readRecord() and writeRecord()?  That would 
simplify things and better guarantee that you're conforming to a schema.  
Currently, e.g., your writeMap() doesn't appear to write a valid Avro map, 
writeArray() doesn't write a valid Avro array, etc., so the data written is not 
interoperable,.
 - my guess is that a lot of time is spent in findSchemaIndex().  if that's 
right, you might improve this in various ways, e.g.:
 -- sort this by the most common types.  the order in Pig's DataType.java is 
probably a good one.
 -- try using a static MapClass,Integer cache of indexes
- have you run this under a profiler?

I don't see where this specifies an Avro schema for Pig data.  It's possible to 
construct a generic schema for all Pig data.  In this, a Bag should be record 
with a single field, an array of Tuples.  A Tuple should be a record with a 
single field, an array of a union of all types.  Given such a schema, one could 
then write a DatumReader/Writer using the control logic of Pig's 
DataReaderWriter (i.e., a switch based on the value of DataType.findType(), 
but, instead of calling DataInput/Output methods, use Encoder/Decoder methods 
with a ValidatingEncoder (at least while debugging) to ensure you conform to 
that schema.

Alternately, in Avro 1.4 (snapshot in Maven now, release this week, hopefully) 
Avro arrays can be arbitrary Collection implementations.  Bag already 
implements all of the required Collection methods -- clear(), add(), size(),  
iterator(), so there's no reason I can see for Bag not to implement 
CollectionTuple.  So then one could subclass GenericData, GenericDatumReader 
 Writer, overriding:

{code}
protected boolean isRecord(Object datum) {
  return datum instanceof Tuple || datum instanceof Bag;
}
protected void writeRecord(Schema schema, Object datum, Encoder out) throws 
IOException {
  if (TUPLE_NAME.equals(schema.getFullName()))
datum = ((Tuple)datum.getAll();
  writeArray(schema.getFields().get(0).getType(), datum, out);
}
protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) 
throws IOException {
  Object result;
  if (TUPLE_NAME.equals(schema.getFullName())) {
old = new ArrayList();
result = new Tuple(old);
  } else {
old = result = new Bag();
  }
  readArray(old, expected.getFields().get(0).getType(), in);
  return result;
}
{code}
   
Finally, if you knew the schema for the dataset being processed, rather than 
using a fully-general Pig schema, then you could translate that schema to an 
Avro schema.  This schema in most cases would not likely have a huge, 
compute-intensive-to-write union in it .  Or you might use something like what 
Scott's proposed in AVRO-592.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-07-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884577#action_12884577
 ] 

Jeff Zhang commented on PIG-794:


We can leverage  AvroInputFormat and AvroOutputFormat in Avro trunk, (see 
AVRO-493)

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Allen Wittenauer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847594#action_12847594
 ] 

Allen Wittenauer commented on PIG-794:
--

What is the latest on getting Avro support in pig?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847606#action_12847606
 ] 

Alan Gates commented on PIG-794:


It depends on what you mean by support.  As far as Pig using Avro for 
serialization between Map and Reduce and MR jobs, we haven't done anything on 
that front lately.  Last time we tested the performance was comparable to our 
own BinStorage so we weren't motivated to move yet.  Now that Avro has matured 
a bit maybe we should test again.

As far as using Avro to store user data, with Pig 0.7 it should become quite 
easy to write Avro load and store functions.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Jeff Hammerbacher (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847607#action_12847607
 ] 

Jeff Hammerbacher commented on PIG-794:
---

bq. Last time we tested the performance was comparable to our own BinStorage so 
we weren't motivated to move yet.

Hey Alan,

There should be benefits to using Avro besides just performance. Either way, 
looking forward to seeing you on the Avro lists when you decide to test again!

Regards,
Jeff

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847613#action_12847613
 ] 

Alan Gates commented on PIG-794:


Jeff,

Beyond performance what do you see as the big wins of using Avro?  I'm just 
thinking here of moving data between MR jobs in a Pig script and between Map 
and Reduce phases.  I see lots of advantages to users using Avro to store their 
data.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-03-19 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847614#action_12847614
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

I'll take a crack at it.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-07-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729700#action_12729700
 ] 

Alan Gates commented on PIG-794:


I agree with Doug's comments that it's better to use an API to build the schema 
that will give us compile time checking.  I think it will also (hopefully) be 
easier to figure out the schema when reading the code, as it will avoid the 
need to read JSON directly.

I have a general question on the approach.  This is a direct port of Pig's 
BinStorage to use Avro, including the writing of indicator bytes for types.  I 
do not have a deep knowledge of Avro.  But I had assumed that since it was a 
de/serialization framework with types, part of what it would provide was type 
recognition.  That is, can't this code rely on Avro to set the type for it?  Do 
we need to be writing those indicator bytes ourselves?  Perhaps this is the 
same comment that Doug is making about using GenericDatumReader and addField.

In response to Hong's comment, the sync marks are vulnerable as you point out.  
But the loader needs some way to find a proper starting place when it's handed 
any block but the initial block of a file.  I wonder if we could create a new 
sync type.  It would always consist of a 100 byte marker (say the first 25 
prime numbers, or the first 25 digits of pi or something).  We could then write 
a tuple with that sync type every 1000 records in the data.  Loaders that don't 
start at position 0 could then seek to the first sync type it found before it 
began reading.  All loaders would read past the end of their position until 
they saw a sync type.

As for this being compatible with with non-pig apps, that isn't the purpose of 
this AvroStorage function.  This is for pig to pass data between MR jobs for 
itself.  Having a tool independent storage format is a bigger project, as it 
requires agreeing on things like sync marks, how to represent different Avro 
objects, etc.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-06-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723812#action_12723812
 ] 

Alan Gates commented on PIG-794:


PIG-734 has been committed.  This will allow this patch to simplify its 
handling of maps to match avro maps, since Pig maps now only allow strings as 
keys.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-23 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712397#action_12712397
 ] 

Hong Tang commented on PIG-794:
---

- It appears that the code added a three-byte sync-mark \1\2\3 before every 
tuple. 
- There is no escaping of sync-mark collisions in user code. 
- The introduction of the sync mark also defeats the purpose of using Avro in 
the first place (sharing a common serialization format).

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-14 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709608#action_12709608
 ] 

Doug Cutting commented on PIG-794:
--

Looking at the patch, I have a few questions and remarks:
 - Why not name the records Tuple and Bag instead of T and B?  The 
names are not written in the data, so there's little advantage to shorter names.
 - Why not, instead of parsing the schema from Json, construct the schema using 
the Java Schema API?  Then you would not need to walk the schema afterwards to 
find union indexes, and you'd get compile-time API checking rather than 
potential load-time JSON parse errors.
 - Why not extend GenericDatumReader and override newRecord() to create either 
a Bag or a Tuple, then override addField() to add values to either a bag or 
tuple?  This would make the patch much smaller, and potentially permit you to 
eventually take advantage of GenericDatumReader features like projection and 
object reuse.
 - Finally, since you're using a pre-release version of Avro, you should 
probably name the jar with the subversion revision number.  Also note that, 
since Avro is not yet stable, it should not be yet used for persistent data in 
production systems.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-13 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709078#action_12709078
 ] 

Rakesh Setty commented on PIG-794:
--

Olga,

The reproduced the issue. It is because the schema parsing is failing on the 
Avro side. I checked that the Avro codebase has changed which is causing this 
issue. I will work with the Avro team to understand the changes that I need to 
make for AvroStorage.

Thanks,
Rakesh

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-13 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709081#action_12709081
 ] 

Rakesh Setty commented on PIG-794:
--

I also noticed that the Avro codebase supports only Strings for maps. As 
mentioned earlier I have modified the AvroStorage to have key as any object. 
Does Avro need to have only strings as map keys? 

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-12 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708701#action_12708701
 ] 

Olga Natkovich commented on PIG-794:


I integrated the latest patch and run unit tests. All the AVRO unit tests 
failed with the following stack trace:

Could not initialize class 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroTupleSchema
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroTupleSchema
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.writeDatum(AvroStorage.java:359)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.writeTuple(AvroStorage.java:408)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TupleAvroWriter.write(AvroStorage.java:353)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.AvroStorage.putNext(AvroStorage.java:571)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:121)
at 
org.apache.pig.backend.local.executionengine.LocalPigLauncher.runPipeline(LocalPigLauncher.java:129)
at 
org.apache.pig.backend.local.executionengine.LocalPigLauncher.launchPig(LocalPigLauncher.java:102)
at org.apache.pig.test.TestAvroStorage.store(TestAvroStorage.java:117)
at 
org.apache.pig.test.TestAvroStorage.testLoadStoreComplexDataWithNull(TestAvroStorage.java:178)

~


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-08 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707570#action_12707570
 ] 

Rakesh Setty commented on PIG-794:
--

The new patch has unit tests. The comments are already in javadoc format. 
Please let me know if I have missed somewhere.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-08 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707576#action_12707576
 ] 

Rakesh Setty commented on PIG-794:
--

There was one important change I had to do in AvroStorage to the Avro format to 
get it working. The map keys were stored as String objects. I had to change it 
so that both key and value can be Object instances. Please let me know if this 
is an issue.

Thanks,
Rakesh

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-07 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12707063#action_12707063
 ] 

Olga Natkovich commented on PIG-794:


Rakesh,

Could you please convert the comments to javadoc and at unit tests before we 
commit the code. Thanks

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-06 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706473#action_12706473
 ] 

Rakesh Setty commented on PIG-794:
--

The attached jar files should go to the lib directory.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: avro-0.1-dev-java.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706144#action_12706144
 ] 

Rakesh Setty commented on PIG-794:
--

While trying to address the comment about eliminating the AvroValueReader, I 
noticed that the way pos (current position in the stream) is being handled is 
wrong. The position in the stream can only be handled by the ValueReader (Avro 
codebase) due to the non-standard (not making use of
DataOutput's methods to store data) way of storing data by Avro. For example, 
an integer can be stored in anywhere between 1 -
5 bytes while a long can be stored in anywhere between 1 - 10 bytes.
I think we have to ask the Avro team to support this (current position in the 
stream) for us to proceed with this. 

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706197#action_12706197
 ] 

Doug Cutting commented on PIG-794:
--

 I think we have to ask the Avro team to support this (current position in the 
 stream) for us to proceed with this. 

ValueReader performs no buffering, so its position is always the same as the 
InputStream that it wraps.  See DataFileReader#SeekableBufferedInput for an 
example of an input stream that tracks its position.

Note that AVRO-25 proposes to add buffering to ValueWriter, so that the 
position of the underlying stream might be different than that of the 
ValueWriter, but I do not forsee a need to add this to ValueReader, the concern 
here.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Rakesh Setty (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706220#action_12706220
 ] 

Rakesh Setty commented on PIG-794:
--

This works. Will update the patch.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706226#action_12706226
 ] 

Hadoop QA commented on PIG-794:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12407285/AvroStorage.patch
  against trunk revision 771844.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/30/console

This message is automatically generated.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706244#action_12706244
 ] 

Olga Natkovich commented on PIG-794:


Doug, if there is no buffering then the position in the inout stream can be 
used for now. However, if you are planning to do buffering in the future, it 
might be good to have an API that just gives the position so that later we 
don't need to change the code.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706278#action_12706278
 ] 

Olga Natkovich commented on PIG-794:


Hi Rakesh,

Thanks for the update. A few comments:

(1) Thanks for adding comments. They need to be of javadoc style so that we get 
free documentation from it. You can see examples in other files
(2) Looks like there is at least one System.println statement that got in I 
assume by mistake.
(3) Looks like you have some traces as log.error instead of log.debug
(4) You need to attach AVRO library separately. Patches don't work well with 
binary data

Also I am curious if removing wrapper class made a performance difference?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-05 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12706284#action_12706284
 ] 

Olga Natkovich commented on PIG-794:


One more thing: since we are adding avro library, lets add some unit tests as 
well.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch, AvroStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-02 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705351#action_12705351
 ] 

Olga Natkovich commented on PIG-794:


Hi Rakesh,

Thanks for the patch. A few comments below.

First, a few general comments:

(1) AVROBinStorage should not be in builtins. We don't want to expose to the 
end user because in the past we had issues with backward compatibility (with 
BinStorage) when the same function was used both internally ane externally.
(2) Every new file needs to have an apache license header. You can get one from 
a file in SVN.
(3) I would just call the class AVROStorage
(4) Once we are fully integrated with AVRO, we should at unit tests but for now 
this is fine
(5) It would be nice to have javadoc comments in the data. At a minimum a 
header for each class on what it does and each public method. Also, it would be 
good to document any non-obvious code.

Now, code related comments: what is the reason for having AVROValueReader. It 
seems to be a streight wrapper around ValueReader + position which we can keep 
track separately. I am concerned with the performance overhad that happens on 
each call. 


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.