[jira] Commented: (PIG-1115) [zebra] temp files are not cleaned.

2010-02-16 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834373#action_12834373
 ] 

Hong Tang commented on PIG-1115:


Why not requesting the patch to be back ported to Hadoop 0.21 (btw do you mean 
Hadoop 0.21 or 0.20)?

 [zebra] temp files are not cleaned.
 ---

 Key: PIG-1115
 URL: https://issues.apache.org/jira/browse/PIG-1115
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Hong Tang
Assignee: Gaurav Jain
 Attachments: PIG-1115.patch


 Temp files created by zebra during table creation are not cleaned where there 
 is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1115) [zebra] temp files are not cleaned.

2009-11-30 Thread Hong Tang (JIRA)
[zebra] temp files are not cleaned.
---

 Key: PIG-1115
 URL: https://issues.apache.org/jira/browse/PIG-1115
 Project: Pig
  Issue Type: Bug
Reporter: Hong Tang


Temp files created by zebra during table creation are not cleaned where there 
is any task failure, which results in waste of disk space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-992) [zebra] Separate Schema-related files into a Schema package

2009-10-08 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763439#action_12763439
 ] 

Hong Tang commented on PIG-992:
---

Comments:
- In many places, both types.ParseException and schema.ParseException are 
thrown. Do you really want both?
- In the following
{noformat}
+public enum ColumnType implements Writable {
{noformat}
Is the Writable interface actually used? You have rather odd pattern of 
asymmetric readFields and write:
{noformat}
+  @Override
+  public void readFields(DataInput in) throws IOException {
+// no op, instantiated by the caller
+  }
+
+  @Override
+  public void write(DataOutput out) throws IOException {
+Utils.writeString(out, name);
+  }
{noformat}
- In the following code
{noformat}
+  public static class ColumnSchema {
+public String name;
+public ColumnType type;
+public Schema schema;
+public int index; // field index in schema
{noformat}
Exposing fields as all-public seems like a bad idea.
- Is there a specific usage case to allow schema to be mutable at any time? 
(minor nit: the comment says add a field, but the code seems to add a column to 
the schema).
{noformat}
+  /**
+   * add a field
+   */
+  public void add(ColumnSchema f) throws ParseException
+  {
+add(f, false);
+  }
{noformat}
- Why Schema.equals(Object) is not implemented on top of the static version of 
the method (or vice versa)?
- In Schema.readFields(), the Version string from the input is not checked for 
compatibility.
- In the following
{noformat}
+  private void init(String[] columnNames) throws ParseException {
+// the arg must be of type or they will be treated as the default type
+// TODO: verify column names don't contain COLUMN_DELIMITER
{noformat}
It seems that the TODO should not involve too much work and please consider not 
deferring it later.
- Need more detailed documentation on the spec of the parameter for 
Schema.getColumnSchema(String name)
{noformat}
+  /**
+   * Get a column's schema
+   */
+  public ColumnSchema getColumnSchema(String name) throws ParseException
+  {
{noformat}
- Schema.getColumnSchemaOnParsedName and Schema.getColumnSchema seems to be 
copy/paste code.
- Schema.getColumnSchema(ParsedName pn) has side effect of modifying the 
parameter pn. The javadoc reads cryptic to me.
- There are many classes generated by JavaCC. It is probably better not 
including them in the patch (and put the generated source under build/src).

Other minor issues:
- Typically contrib projects should use the version string as the parent 
project.
- Style: there are some very long lines.
 - There are a few white space changes. That should be avoided if possible.
- In the following
{noformat}
+} catch (org.apache.hadoop.zebra.schema.ParseException e) {
+  throw new AssertionError(Invalid Projection: +e.getMessage());
{noformat}
consider change AssertionError to IllegalArgumentException.
- In the following:
{noformat}
+  /*
+   * helper class to parse a column name string one section at a time and find 
the required
+   * type for the parsed part.
+   */
+  public static class ParsedName {
+public String mName;
+int mKeyOffset; // the offset where the keysstring starts
+public ColumnType mDT = ColumnType.ANY; // parent's type
{noformat}
The description seems to indicate that this should not be a public class. I 
tried to understand the body of the class and do not feel that it serves a 
general purpose.
- The following seems like useless assignment:
{noformat}
+  private long mVersion = schemaVersion;
{noformat}
- {noformat}
  /**
+   * Normalize the schema string.
+   * 
+   * @param value
+   *  the input string representation of the schema.
+   * @return the normalized string representation.
+   */
+  public static String normalize(String value) {
+String result = new String();
+
+if (value == null || value.trim().isEmpty())
+  return result;
+
+StringBuilder sb = new StringBuilder();
+String[] parts = value.trim().split(COLUMN_DELIMITER);
+for (int nx = 0; nx  parts.length; nx++) {
+  if (nx  0) sb.append(COLUMN_DELIMITER);
+  sb.append(parts[nx].trim());
+}
+return sb.toString();
+  }

{noformat}
There is a wasted value.trim().
- In Schema.equals(Object), instead of comparing class equality, using 
instanceof is typically better.
- Use StringBuilder instead in the following code:
{noformat}
+String merged = new String();
+for (int i = 0; i  columnNames.length; i++) {
+  if (i  0) merged += ,;
+  merged += columnNames[i];
+}
{noformat}
- There are a few indentation problems.

 [zebra] Separate Schema-related files into a Schema package
 -

 Key: PIG-992
 URL: https://issues.apache.org/jira/browse/PIG-992
 Project: Pig
  Issue Type: Improvement

[jira] Resolved: (PIG-526) Order of key, value pairs not preserved in MAP type.

2009-07-30 Thread Hong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Tang resolved PIG-526.
---

Resolution: Won't Fix

 Order of key, value pairs not preserved in MAP type.
 --

 Key: PIG-526
 URL: https://issues.apache.org/jira/browse/PIG-526
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.2.0
Reporter: Hong Tang

 PIG uses HashMap to deserialize the Pig MAP type which will not observe the 
 order of key, value pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729744#action_12729744
 ] 

Hong Tang commented on PIG-879:
---

1) and 3) are kind of equivalent to user, and are preferred for customized 
loaders that do not wish pig to do the escaping at all. 


 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-879) Pig should provide a way for input location string in load statement to be passed as-is to the Loader

2009-07-10 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729771#action_12729771
 ] 

Hong Tang commented on PIG-879:
---

Both are valid arguments. The problem of 2) and 4) are that they require change 
to the load statement syntax or load-func api and would take longer to get 
there. 

I guess we could structure the fix in two phases: Phase One: supporting 1) and 
3), so that we can have the minimum to move along without having to disable 
multi-query optimization completely. User should be able to modify the script 
to change all relative paths to absolute ones (the chance of such usage should 
be rare that most people should not be impacted). Phase Two: support either 2) 
or 4) (but I do not think we need both). And personally I think 4) would be 
better because loader should be the one that interprets the location string 
syntax.

 Pig should provide a way for input location string in load statement to be 
 passed as-is to the Loader
 -

 Key: PIG-879
 URL: https://issues.apache.org/jira/browse/PIG-879
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.0
Reporter: Pradeep Kamath

  Due to multiquery optimization, Pig always converts the filenames to 
 absolute URIs (see 
 http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification - section 
 about Incompatible Changes - Path Names and Schemes). This is necessary since 
 the script may have cd .. statements between load or store statements and 
 if the load statements have relative paths, we would need to convert to 
 absolute paths to know where to load/store from. To do this 
 QueryParser.massageFilename() has the code below[1] which basically gives the 
 fully qualified hdfs path
  
 However the issue with this approach is that if the filename string is 
 something like 
 hdfs://localhost.localdomain:39125/user/bla/1,hdfs://localhost.localdomain:39125/user/bla/2,
  the code below[1] actually translates this to 
 hdfs://localhost.localdomain:38264/user/bla/1,hdfs://localhost.localdomain:38264/user/bla/2
  and throws an exception that it is an incorrect path.
  
 Some loaders may want to interpret the filenames (the input location string 
 in the load statement) in any way they wish and may want Pig to not make 
 absolute paths out of them.
  
 There are a few options to address this:
 1)A command line switch to indicate to Pig that pathnames in the script 
 are all absolute and hence Pig should not alter them and pass them as-is to 
 Loaders and Storers. 
 2)A keyword in the load and store statements to indicate the same intent 
 to pig
 3)A property which users can supply on cmdline or in pig.properties to 
 indicate the same intent.
 4)A method in LoadFunc - relativeToAbsolutePath(String filename, String 
 curDir) which does the conversion to absolute - this way Loader can chose to 
 implement it as a noop.
 Thoughts?
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-833) Storage access layer

2009-06-04 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716470#action_12716470
 ] 

Hong Tang commented on PIG-833:
---

Jeff, just like the SQL effort, the space of columnar storage is also wide 
open, and I think it is more beneficial to the overall healthy of the hadoop 
ecosystem.

With that being said, I also looked at the patch attached with HIVE-352. It 
appears that what the patch does is a level below our stated objectives. 
Specifically, the guts of the implementation (RCFile) is very close in spirit 
to TFile as described HADOOP-3315, which seems to have its first comprehensive 
patch back in December 2008. 

 Storage access layer
 

 Key: PIG-833
 URL: https://issues.apache.org/jira/browse/PIG-833
 Project: Pig
  Issue Type: New Feature
Reporter: Jay Tang

 A layer is needed to provide a high level data access abstraction and a 
 tabular view of data in Hadoop, and could free Pig users from implementing 
 their own data storage/retrieval code.  This layer should also include a 
 columnar storage format in order to provide fast data projection, 
 CPU/space-efficient data serialization, and a schema language to manage 
 physical storage metadata.  Eventually it could also support predicate 
 pushdown for further performance improvement.  Initially, this layer could be 
 a contrib project in Pig and become a hadoop subproject later on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2009-05-23 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712397#action_12712397
 ] 

Hong Tang commented on PIG-794:
---

- It appears that the code added a three-byte sync-mark \1\2\3 before every 
tuple. 
- There is no escaping of sync-mark collisions in user code. 
- The introduction of the sync mark also defeats the purpose of using Avro in 
the first place (sharing a common serialization format).

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Fix For: 0.2.0

 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188
 ] 

Hong Tang commented on PIG-793:
---

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data 
byte array object. In this case, the contract would be that caller will no 
longer access the portion of byte array from offset to offset+length 
(exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188
 ] 

Hong Tang edited comment on PIG-793 at 5/1/09 4:59 PM:
---

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayFactor.createShared(byte[], offset, length). if the input 
buffer can be shared with the data byte array object. In this case, the 
contract would be that caller will no longer access the portion of byte array 
from offset to offset+length (exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.


  was (Author: hong.tang):
Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data 
byte array object. In this case, the contract would be that caller will no 
longer access the portion of byte array from offset to offset+length 
(exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.

  
 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-11 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672690#action_12672690
 ] 

Hong Tang commented on PIG-652:
---

You probably want to provide utility method for getting back the StoreFunc from 
a JobConf, instead of forcing people into copy/paste internal pig code...

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-11 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672711#action_12672711
 ] 

Hong Tang commented on PIG-652:
---

One more thing that is still not clear to me. StoreFunc does not impelement any 
serialization interface, and it depends on an all-string constructor to 
properly construct the object. How do my customized TableStoreFunc instance 
convey this information to PIG?

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-653) Make fieldsToRead work in loader

2009-02-09 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12672176#action_12672176
 ] 

Hong Tang commented on PIG-653:
---

my quibble is that the interface uses null to indicate all required for nested 
fields, but uses a concrete class for top level fields. any justification why 
possible future extensions are only applicable to top-level fields but not 
nested fields?

 Make fieldsToRead work in loader
 

 Key: PIG-653
 URL: https://issues.apache.org/jira/browse/PIG-653
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Attachments: PIG-653-2.comment


 Currently pig does not call the fieldsToRead function in LoadFunc, thus it 
 does not provide information to load functions on what fields are needed.  We 
 need to implement a visitor that determines (where possible) which fields in 
 a file will be used and relays that information to the load function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-06 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671254#action_12671254
 ] 

Hong Tang commented on PIG-652:
---

I might miss something. How can the outputformat class retrieve the schema 
information? The output format is constructed with its default constructor, and 
then its getRecordWriter is called with name like part-001, part-002, but not 
the path to the basic table. 

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12670981#action_12670981
 ] 

Hong Tang commented on PIG-652:
---

Since this API is supposed to provide backend specific output classes, 
shouldn't the API take a parameter describing the backend?

For MR backend, the returned class would be implementing OutputFormatText, 
Tuple ? Also, need to make it public the keys in the JobConf object describing 
path, schema, compression, etc.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-652) Need to give user control of OutputFormat

2009-02-05 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671028#action_12671028
 ] 

Hong Tang commented on PIG-652:
---

How do I get the schema information from Pig? I thought you would put the 
schema in the JobConf and pass it to the customized OutputFormat class to 
create RecordWriter.

 Need to give user control of OutputFormat
 -

 Key: PIG-652
 URL: https://issues.apache.org/jira/browse/PIG-652
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Alan Gates
Assignee: Alan Gates

 Pig currently allows users some control over InputFormat via the Slicer and 
 Slice interfaces.  It does not allow any control over OutputFormat and 
 RecordWriter interfaces.  It just allows the user to implement a storage 
 function that controls how the data is serialized.  For hadoop tables, we 
 will need to allow custom OutputFormats that prepare output information and 
 objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-526) Order of key, value pairs not preserved in MAP type.

2008-11-13 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647432#action_12647432
 ] 

Hong Tang commented on PIG-526:
---

I understand your concern. But just as I rephrased, the issue here is that PIG 
allows no control from user side to choose which concrete Map class to use when 
deserilizing the tuples. 

Probably a good compromise is to allow user to specify which Map class should 
be used when performing Tuple deserialization.

 Order of key, value pairs not preserved in MAP type.
 --

 Key: PIG-526
 URL: https://issues.apache.org/jira/browse/PIG-526
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: types_branch
Reporter: Hong Tang

 PIG uses HashMap to deserialize the Pig MAP type which will not observe the 
 order of key, value pairs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.