[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-19 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675214#action_12675214
 ] 

Santhosh Srinivasan commented on PIG-652:
-

Review comments:

1. In JobControlCompiler, the use of a static final string is a good idea to 
remove the use of string constants. We should probably make this change across 
the board as part of a new JIRA

2. Will schemas be useful for other operators and not just POStore?

3. StoreConfig implements Serializable, it should also have a static final long 
serialVersionUID

The rest of the code is good. I have not run any of the tests, this was a pure 
code review.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-652.patch


 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-11 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672690#action_12672690
 ] 

Hong Tang commented on PIG-652:
---

You probably want to provide utility method for getting back the StoreFunc from 
a JobConf, instead of forcing people into copy/paste internal pig code...

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-11 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672711#action_12672711
 ] 

Hong Tang commented on PIG-652:
---

One more thing that is still not clear to me. StoreFunc does not impelement any 
serialization interface, and it depends on an all-string constructor to 
properly construct the object. How do my customized TableStoreFunc instance 
convey this information to PIG?

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-10 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672324#action_12672324
 ] 

Alan Gates commented on PIG-652:


As far as getting the schema in the output format, that will be the job of your 
StoreFunc to store it somewhere that your OutputFormat can retrieve it.

The path passed to the OutputFormat is the full path of the file to be written, 
not just part-001.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-06 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671224#action_12671224
 ] 

Alan Gates commented on PIG-652:


In response to Ben's question in comment 
https://issues.apache.org/jira/browse/PIG-652?focusedCommentId=12671009#action_12671009
 of the motivating scenario, the issue is that right now we pass an already 
opened output stream to the store function.  This, and the fact that a fair 
amount of setup is done in the PigRecordWriter forces all stores to be done to 
an HDFS text file.  If a user wants to store to a different type of HDFS file 
(like Table) or to a non-HDFS store (such as a database, hbase, a socket, 
whatever) there's no option for that.  We don't want to export all of the setup 
to the StoreFunc.  The RecordWriter is the right place to do that setup.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-06 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671240#action_12671240
 ] 

Alan Gates commented on PIG-652:


Sorry, forgot to include the schema part.  A second function should be added to 
the StoreFunc interface:

{code}
/**
 * Specify a schema to be used in storing the data.  This can be used by
 * store functions that store the data in a self describing format.  The
 * store function is free to ignore this if it cannot use it.
 * @param schema of the output data
 */
public void setStorageSchema(Schema s);
{code}

This function would be called during query planning.  The StoreFunc can then 
take the responsibility of storing away the schema so that it (or it's 
associated OutputFormat) can access it on the backend.  This schema will also 
include the sortedness of the data.

As for making the JobConf and path available those are passed to 
OutputFormat.getRecordWriter, so those implementing their own OutputFormats 
will have access to them.  They can then pass them on to their store functions 
as they wish.

For compression, pig right now has no way to communicate compression types 
other than file endings (.bz is the only one we support at the moment).  This 
is a kludge, but I don't want to propose a whole way to coherently communicate 
compression in pig at the moment.  So I vote that we stay with this for the 
time being.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-06 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671254#action_12671254
 ] 

Hong Tang commented on PIG-652:
---

I might miss something. How can the outputformat class retrieve the schema 
information? The output format is constructed with its default constructor, and 
then its getRecordWriter is called with name like part-001, part-002, but not 
the path to the basic table. 

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12670977#action_12670977
 ] 

Alan Gates commented on PIG-652:


I propose that we add a method to the StoreFunc interface:

{code}
/**
 * Specify a backend specific class to use to prepare for
 * storing output.  In the Hadoop case, this can return an
 * OutputFormat that will be used instead of PigOutputFormat.  The 
 * framework will call this function and if a Class is returned
 * that implements OutputFormat it will be used.
 * @return Backend specific class used to prepare for storing output.
 * @throws IOException if the class does not implement the expected
 * interface(s).
 */
public Class getStorePreparationClass() throws IOException;
{code}

This way we are not forced to write a whole pig copy of OutputFormat and 
RecordWriter interfaces (the way Slicer and Slice copy InputFormat, InputSplit, 
and RecordReader) while still avoiding importing hadoop classes into our 
interface.  It also avoids forcing the StoreFunc to also be RecordWriter (the 
way LoadFunc has to implement Slicer).

The downside of this is that we do not allow Pig Latin to change to allow a 
construct like:

{code}
store A using MyStoreFunc() using format MyOutputFormat()
{code}

There would be an advantage of to this.  For example, if one wanted to 
serialize tuples over a socket, you might still want to use PigStorage but 
create a SocketOutputFormat function.  In the currently proposed interface you 
could still accomplish this by writing a StoreFunc that subclasses PigStorage 
and implements the getStorePreparationClass(), but this is less elegant.  

As far as I know no one is currently asking for the ability to specify 
OutputFormat separate from StoreFunc, and doing so would necessitate creating 
pig copies of OutputFormat and RecordWriter.  So rather than create a lot of 
extra interfaces for functionality no one is requesting I propose this simpler 
solution.  If, in the future we choose to allow the ability to separate the 
two, we would still want a StoreFunc to be able to specify its OutputFormat, so 
the proposed functionality would not be deprecated.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12670981#action_12670981
 ] 

Hong Tang commented on PIG-652:
---

Since this API is supposed to provide backend specific output classes, 
shouldn't the API take a parameter describing the backend?

For MR backend, the returned class would be implementing OutputFormatText, 
Tuple ? Also, need to make it public the keys in the JobConf object describing 
path, schema, compression, etc.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671009#action_12671009
 ] 

Benjamin Reed commented on PIG-652:
---

can you explain the motivating scenario in more detail? Is it just to avoid 
creating the outputstream? 

@Hong, you aren't going to be getting the schema from the hadoop jobconf, 
you'll be getting that from pig. since a pig job may involve multiple hadoop 
jobs. you can't count on passing stuff through hadoop configuration.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671028#action_12671028
 ] 

Hong Tang commented on PIG-652:
---

How do I get the schema information from Pig? I thought you would put the 
schema in the JobConf and pass it to the customized OutputFormat class to 
create RecordWriter.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.