date:20100628


 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389.patch

sync with the latest trunk.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1467) order by fail when set fs.file.impl.disable.cache to true


[ 
https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883230#action_12883230
 ] 

Richard Ding commented on PIG-1467:
---

+1

 order by fail when set fs.file.impl.disable.cache to true
 ---

 Key: PIG-1467
 URL: https://issues.apache.org/jira/browse/PIG-1467
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1467-1.patch, PIG-1467-2.patch


 Order by fail with the message:
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:551)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
 at org.apache.hadoop.mapred.Child.main(Child.java:211)
 This happens with the following hadoop settings:
 fs.file.impl.disable.cache=true
 fs.hdfs.impl.disable.cache=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Randy Prager (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883302#action_12883302
 ] 

Randy Prager commented on PIG-1470:
---

thanks.  we started testing w/ G1 GC on our hadoop cluster to avoid (which it 
seems to do) the exceptions

{noformat}
java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
{noformat}

which occur randomly on 6u18,6u20 and the default GC.  We are going to try some 
other Java version + GC combinations ... do you have any insight into a stable 
mix of Java versions and GC?

 map/red jobs fail using G1 GC (Couldn't find heap)
 --

 Key: PIG-1470
 URL: https://issues.apache.org/jira/browse/PIG-1470
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
 x86_64 x86_64 x86_64 GNU/Linux
 Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
 Hadoop: 0.20.1
Reporter: Randy Prager

 Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
 -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value
 /property
 {noformat}
 Here is the hadoop map/red configuration that succeeds
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis 
 -XX:+UseCompressedOops/value
 /property
 {noformat}
 Here is the exception from the pig script.
 {noformat}
 Backend error message
 -
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 set up the load function.
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
 with arguments '[,]'
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
 ... 5 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
 ... 6 more
 Caused by: java.lang.RuntimeException: Couldn't find heap
 at 
 org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95)
 at org.apache.pig.data.BagFactory.init(BagFactory.java:106)
 at 
 org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71)
 at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
 at 
 org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79)
 ... 11 more
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)

inline UDFs in scripting languages
--

 Key: PIG-1471
 URL: https://issues.apache.org/jira/browse/PIG-1471
 Project: Pig
  Issue Type: New Feature
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.8.0


It should be possible to write UDFs in scripting languages such as python, 
ruby, etc. This frees users from needing to compile Java, generate a jar, etc. 
It also opens Pig to programmers who prefer scripting languages over Java. It 
should be possible to write these scripts inline as part of pig scripts. This 
feature is an extension of https://issues.apache.org/jira/browse/PIG-928


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883327#action_12883327
 ] 

Aniket Mokashi commented on PIG-1471:
-

The proposed syntax is
{code}
define hellopig using org.apache.pig.scripting.jython.JythonScriptEngine as 
'@outputSchema(x:{t:(word:chararray)})\ndef helloworld():\n\treturn ('Hello, 
World')';
{code}

 inline UDFs in scripting languages
 --

 Key: PIG-1471
 URL: https://issues.apache.org/jira/browse/PIG-1471
 Project: Pig
  Issue Type: New Feature
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.8.0


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc. This frees users from needing to compile Java, generate a jar, 
 etc. It also opens Pig to programmers who prefer scripting languages over 
 Java. It should be possible to write these scripts inline as part of pig 
 scripts. This feature is an extension of 
 https://issues.apache.org/jira/browse/PIG-928

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

Optimize serialization/deserialization between Map and Reduce and between MR 
jobs
-

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


In certain types of pig queries most of the execution time is spent in 
serializing/deserializing (sedes) records between Map and Reduce and between MR 
jobs. 
For example, if PigMix queries are modified to specify types for all the fields 
in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) 
that have records with bags and maps being transmitted across map or reduce 
boundaries run a lot longer (runtime increase of few times has been seen.

There are a few optimizations that have shown to improve the performance of 
sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if a 
bytearray is smaller than 255 bytes , a byte can be used to store the length 
instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 

Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
serialization format that these loaders use cannot change, so after the 
optimization their format is going to be different from the format used between 
M/R boundaries.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.6.patch

Ok, if the user does not use DefaultTuple we fall back to the default
deserialization case.

I added handling of nested tuples via recursion and appropriate unit tests.

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-06-28 Thread Yan Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883348#action_12883348
 ] 

Yan Zhou commented on PIG-1399:
---

Other expression optimizations include:

3.  Optimization of erasure of logical implicated expression in AND
Example:
B = filter A by (a0  5 and a0  7);
= B = filter A by a0  7;

4. Optimization of erasure of logical implicated expression in OR
Example:
B = filter A by ((a0  5) or (a0  6 and a1  15);
= B = filter C by a0  5;

A comprehensive example of 2, 3 and 4 optimizations is:
B = filter A by NOT((a0  1 and a0  0) or (a1  3 and a0 5));
= B = filter A by a0 = 1;

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou

 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files


[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883354#action_12883354
 ] 

Richard Ding commented on PIG-1389:
---

It seems there is no good solution for Merge Join and Merge Cogroup in this 
case. So I'm going to treat them the same way as Replicated Join and not add 
counters for all side files.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

Avoid serialization/deserialization costs for PigStorage data - Use custom Map 
and Bag implementation
-

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
 Fix For: 0.8.0


Cost of serialization/deserialization (sedes) can be very high and avoiding it 
will improve performance.

Avoid sedes when possible by implementing approach #3 proposed in 
http://wiki.apache.org/pig/AvoidingSedes .

The load function uses subclass of Map and DataBag which holds the serialized 
copy.  LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or DataBag is called. 

Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;  
fil = FILTER f BY $0  5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .

{CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation


 [ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1473:
--

Assignee: Thejas M Nair

 Avoid serialization/deserialization costs for PigStorage data - Use custom 
 Map and Bag implementation
 -

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Cost of serialization/deserialization (sedes) can be very high and avoiding 
 it will improve performance.
 Avoid sedes when possible by implementing approach #3 proposed in 
 http://wiki.apache.org/pig/AvoidingSedes .
 The load function uses subclass of Map and DataBag which holds the serialized 
 copy.  LoadFunction delays deserialization of map and bag types until a 
 member function of java.util.Map or DataBag is called. 
 Example of query where this will help -
 {CODE}
 l = LOAD 'file1' AS (a : int, b : map [ ]);
 f = FOREACH l GENERATE udf1(a), b;  
 fil = FILTER f BY $0  5;
 dump fil; -- Serialization of column b can be delayed until here using this 
 approach .
 {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883361#action_12883361
]

Daniel Dai commented on PIG-1295:
-

Thanks, is the patch ready for review?

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair

I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .

http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas

[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple


 Key: PIG-1474
 URL: https://issues.apache.org/jira/browse/PIG-1474
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


Avoid sedes when possible for data loaded using PigStorage by implementing 
approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

The write() and readFields() functions of tuple returned by TupleFactory  is 
used to serialize data between Map and Reduce. By using a tuple that knows the 
serialization format of the loader, we avoid sedes at Map Recue boundary and 
use the load functions serialized format between Map and Reduce . 
To use a new custom tuple for this purpose, a custom TupleFactory that returns 
tuples of this type has to be specified using the property 
pig.data.tuple.factory.name .
This approach will work only for a set of load functions in the query that 
share same serialization format for map and bags. If this approach proves to be 
very useful, it will build a case for more extensible approach.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Dai updated PIG-1295:

Status: Patch Available (was: Open)
Fix Version/s: 0.8.0

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files


 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389_1.patch

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883367#action_12883367
]

Gianmarco De Francisci Morales commented on PIG-1295:
-

I think it is

Binary comparator for secondary sort

Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch,
PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files


 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Open  (was: Patch Available)

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files


 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Patch Available  (was: Open)

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Dmitriy Ryaboy

For what it's worth, I saw very significant speed improvements (order of
magnitude for wide tables with few projected columns) when I implemented (2)
for our protocol buffer - based loaders.

I have a feeling that propagating schemas when known, and using them to for
(de)serialization instead of reflecting every field, would also be a big
win.

Thoughts on just using Avro for the internal PigStorage?

-D

On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote:

 I have created a wiki which puts together some ideas that can help in
 improving performance by avoiding/delaying serialization/de-serialization .

 http://wiki.apache.org/pig/AvoidingSedes

 These are ideas that don't involve changes to optimizer. Most of them
 involve changes in the load/store functions.

 Your feedback is welcome.

 Thanks,
 Thejas

Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Russell Jurney

I don't fully understand the repercussions of this, but I like it.  We're
moving from our VoldemortStorage stuff to Avro and it would be great to pipe
Avro all the way through.

Russ

On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 For what it's worth, I saw very significant speed improvements (order of
 magnitude for wide tables with few projected columns) when I implemented
 (2)
 for our protocol buffer - based loaders.

 I have a feeling that propagating schemas when known, and using them to for
 (de)serialization instead of reflecting every field, would also be a big
 win.

 Thoughts on just using Avro for the internal PigStorage?

 -D

 On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote:

  I have created a wiki which puts together some ideas that can help in
  improving performance by avoiding/delaying serialization/de-serialization
 .
 
  http://wiki.apache.org/pig/AvoidingSedes
 
  These are ideas that don't involve changes to optimizer. Most of them
  involve changes in the load/store functions.
 
  Your feedback is welcome.
 
  Thanks,
  Thejas

[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading _


 [ 
https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1350:


Fix Version/s: (was: 0.8.0)

 [Zebra] Zebra column names cannot have leading _
 --

 Key: PIG-1350
 URL: https://issues.apache.org/jira/browse/PIG-1350
 Project: Pig
  Issue Type: Improvement
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: pig-1350.patch, pig-1350.patch


 Disallowing '_' as leading character in column names in Zebra schema is too 
 restrictive, which should be lifted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint


 [ 
https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1120:


Fix Version/s: (was: 0.8.0)

 [zebra] should support  using org.apache.hadoop.zebra.pig.TableStorer() if 
 user does not want to specify storage hint
 -

 Key: PIG-1120
 URL: https://issues.apache.org/jira/browse/PIG-1120
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang

 If user doesn't want to specify storage hint, current zebra implementation 
 only support  using org.apache.hadoop.zebra.pig.TableStorer('')  Note: empty 
 string in TableStorer(' ').
 We should support the format of  using 
 org.apache.hadoop.zebra.pig.TableStorer() as we do on  using 
 org.apache.hadoop.zebra.pig.TableLoader()
 sample pig script:
 register /grid/0/dev/hadoopqa/jars/zebra.jar;
 a = load '1.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 b = load '2.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 c = join a by a, b by a;
 d = foreach c generate a::a, a::b, b::c;
 describe d;
 dump d;
 store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer('');
 --this will fail
 --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( );

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements


 [ 
https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1137:


Fix Version/s: (was: 0.8.0)

 [zebra] get* methods of Zebra Map/Reduce APIs need improvements
 ---

 Key: PIG-1137
 URL: https://issues.apache.org/jira/browse/PIG-1137
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou

 Currently the set* methods takes external Zebra objects, namely objects of  
 ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. 
 Correspondingly, the get* methods should return such objects instead of 
 String or Zebra internal objects like Schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

[
https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich updated PIG-1355:

Fix Version/s: (was: 0.8.0)
Description:
Applications may not always want to write a record to a table. Zebra should
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables
by inspecting the data Tuple.

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs )
will write every record to a table

was:

Applications may not always want to write a record to a table. Zebra should
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables
by inspecting the data Tuple.

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs )
will write every record to a table

[Zebra] Zebra Multiple Outputs should enable application to skip records
-

Key: PIG-1355
URL: https://issues.apache.org/jira/browse/PIG-1355
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor

Applications may not always want to write a record to a table. Zebra should
allow application to do the same.
Zebra Mutipile Outputs interface allow users to stream data to different
tables by inspecting the data Tuple.
https://issues.apache.org/jira/browse/PIG-
So,
If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that
record and thus will not write to any table
However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs
) will write every record to a table

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

[
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Olga Natkovich updated PIG-1411:

Fix Version/s: (was: 0.8.0)
Description:
Due to column group structure, Zebra can create extra files for namenode to
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive
reduces the block and file count by copying data from small files ( 1M, 2M
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks
and files.

was:

Due to column group structure, Zebra can create extra files for namenode to
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

[Zebra] Can Zebra use HAR to reduce file/block count for namenode
-

Key: PIG-1411
URL: https://issues.apache.org/jira/browse/PIG-1411
Project: Pig
Issue Type: New Feature
Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor

Due to column group structure, Zebra can create extra files for namenode to
remember. That means namenode taking more memory for Zebra related files.
The goal is to reduce the no of files/blocks
The idea among various options is to use HAR ( Hadoop Archive ). Hadoop
Archive reduces the block and file count by copying data from small files (
1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of
blocks and files.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc