Re: load files

2010-06-28 Thread Gang Luo
Thanks, Jeff.
In pig, the file name look like this: part-m-x(for map result) or 
part-r-x(for reduce result), which are different from the hadoop style 
(part-x). So, can we control the name of each generated file? How?

Thanks,
-Gang



- 原始邮件 
发件人: Jeff Zhang zjf...@gmail.com
收件人: pig-dev@hadoop.apache.org
发送日期: 2010/6/27 (周日) 9:22:30 下午
主   题: Re: load files

Hi Gang,

The path specified in load can be both file or directory, besides you
can also leverage hadoop's globstatus.  The path specified in store is
a directory.



On Mon, Jun 28, 2010 at 4:44 AM, Gang Luo lgpub...@yahoo.com.cn wrote:
 Hi all,
 when we specify the path of input to a load operator, is it a file or a 
 directory? Similarly, when we use store-load to connect two MR operators, is 
 the path specified in the store and load a directory?

 Thanks,
 -Gang








-- 
Best Regards

Jeff Zhang






[jira] Created: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Randy Prager (JIRA)
map/red jobs fail using G1 GC (Couldn't find heap)
--

 Key: PIG-1470
 URL: https://issues.apache.org/jira/browse/PIG-1470
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
x86_64 x86_64 x86_64 GNU/Linux
Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
Hadoop: 0.20.1

Reporter: Randy Prager


Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails

{noformat}
 property
namemapred.child.java.opts/name
value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value
/property
{noformat}

Here is the hadoop map/red configuration that succeeds

{noformat}
 property
namemapred.child.java.opts/name
value-Xmx300m -XX:+DoEscapeAnalysis 
-XX:+UseCompressedOops/value
/property
{noformat}

Here is the exception from the pig script.

{noformat}
Backend error message
-
org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to set 
up the load function.
at 
org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' with 
arguments '[,]'
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
at 
org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
... 5 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
... 6 more
Caused by: java.lang.RuntimeException: Couldn't find heap
at 
org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95)
at org.apache.pig.data.BagFactory.init(BagFactory.java:106)
at 
org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71)
at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
at 
org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49)
at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69)
at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79)
... 11 more
{noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389.patch

sync with the latest trunk.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1467) order by fail when set fs.file.impl.disable.cache to true

2010-06-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883230#action_12883230
 ] 

Richard Ding commented on PIG-1467:
---

+1

 order by fail when set fs.file.impl.disable.cache to true
 ---

 Key: PIG-1467
 URL: https://issues.apache.org/jira/browse/PIG-1467
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0, 0.8.0

 Attachments: PIG-1467-1.patch, PIG-1467-2.patch


 Order by fail with the message:
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:551)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:630)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:314)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
 at org.apache.hadoop.mapred.Child.main(Child.java:211)
 This happens with the following hadoop settings:
 fs.file.impl.disable.cache=true
 fs.hdfs.impl.disable.cache=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1470) map/red jobs fail using G1 GC (Couldn't find heap)

2010-06-28 Thread Randy Prager (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883302#action_12883302
 ] 

Randy Prager commented on PIG-1470:
---

thanks.  we started testing w/ G1 GC on our hadoop cluster to avoid (which it 
seems to do) the exceptions

{noformat}
java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
{noformat}

which occur randomly on 6u18,6u20 and the default GC.  We are going to try some 
other Java version + GC combinations ... do you have any insight into a stable 
mix of Java versions and GC?

 map/red jobs fail using G1 GC (Couldn't find heap)
 --

 Key: PIG-1470
 URL: https://issues.apache.org/jira/browse/PIG-1470
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: OS: 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 
 x86_64 x86_64 x86_64 GNU/Linux
 Java: Java(TM) SE Runtime Environment (build 1.6.0_18-b07)
 Hadoop: 0.20.1
Reporter: Randy Prager

 Here is the hadoop map/red configuration (conf/mapred-site.xml) that fails
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis -XX:+UseCompressedOops 
 -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC/value
 /property
 {noformat}
 Here is the hadoop map/red configuration that succeeds
 {noformat}
  property
 namemapred.child.java.opts/name
 value-Xmx300m -XX:+DoEscapeAnalysis 
 -XX:+UseCompressedOops/value
 /property
 {noformat}
 Here is the exception from the pig script.
 {noformat}
 Backend error message
 -
 org.apache.pig.backend.executionengine.ExecException: ERROR 2081: Unable to 
 set up the load function.
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:89)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SliceWrapper.makeReader(SliceWrapper.java:144)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getRecordReader(PigInputFormat.java:282)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.lang.RuntimeException: could not instantiate 'PigStorage' 
 with arguments '[,]'
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:519)
 at 
 org.apache.pig.backend.executionengine.PigSlice.init(PigSlice.java:85)
 ... 5 more
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at 
 org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:487)
 ... 6 more
 Caused by: java.lang.RuntimeException: Couldn't find heap
 at 
 org.apache.pig.impl.util.SpillableMemoryManager.init(SpillableMemoryManager.java:95)
 at org.apache.pig.data.BagFactory.init(BagFactory.java:106)
 at 
 org.apache.pig.data.DefaultBagFactory.init(DefaultBagFactory.java:71)
 at org.apache.pig.data.BagFactory.getInstance(BagFactory.java:76)
 at 
 org.apache.pig.builtin.Utf8StorageConverter.init(Utf8StorageConverter.java:49)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:69)
 at org.apache.pig.builtin.PigStorage.init(PigStorage.java:79)
 ... 11 more
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)
inline UDFs in scripting languages
--

 Key: PIG-1471
 URL: https://issues.apache.org/jira/browse/PIG-1471
 Project: Pig
  Issue Type: New Feature
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.8.0


It should be possible to write UDFs in scripting languages such as python, 
ruby, etc. This frees users from needing to compile Java, generate a jar, etc. 
It also opens Pig to programmers who prefer scripting languages over Java. It 
should be possible to write these scripts inline as part of pig scripts. This 
feature is an extension of https://issues.apache.org/jira/browse/PIG-928


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1471) inline UDFs in scripting languages

2010-06-28 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883327#action_12883327
 ] 

Aniket Mokashi commented on PIG-1471:
-

The proposed syntax is
{code}
define hellopig using org.apache.pig.scripting.jython.JythonScriptEngine as 
'@outputSchema(x:{t:(word:chararray)})\ndef helloworld():\n\treturn ('Hello, 
World')';
{code}

 inline UDFs in scripting languages
 --

 Key: PIG-1471
 URL: https://issues.apache.org/jira/browse/PIG-1471
 Project: Pig
  Issue Type: New Feature
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Fix For: 0.8.0


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc. This frees users from needing to compile Java, generate a jar, 
 etc. It also opens Pig to programmers who prefer scripting languages over 
 Java. It should be possible to write these scripts inline as part of pig 
 scripts. This feature is an extension of 
 https://issues.apache.org/jira/browse/PIG-928

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2010-06-28 Thread Thejas M Nair (JIRA)
Optimize serialization/deserialization between Map and Reduce and between MR 
jobs
-

 Key: PIG-1472
 URL: https://issues.apache.org/jira/browse/PIG-1472
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


In certain types of pig queries most of the execution time is spent in 
serializing/deserializing (sedes) records between Map and Reduce and between MR 
jobs. 
For example, if PigMix queries are modified to specify types for all the fields 
in the load statement schema, some of the queries (L2,L3,L9, L10 in pigmix v1) 
that have records with bags and maps being transmitted across map or reduce 
boundaries run a lot longer (runtime increase of few times has been seen.

There are a few optimizations that have shown to improve the performance of 
sedes in my tests -
1. Use smaller number of bytes to store length of the column . For example if a 
bytearray is smaller than 255 bytes , a byte can be used to store the length 
instead of the integer that is currently used.
2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 

Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
serialization format that these loaders use cannot change, so after the 
optimization their format is going to be different from the format used between 
M/R boundaries.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-1295:


Attachment: PIG-1295_0.6.patch

Ok, if the user does not use DefaultTuple we fall back to the default 
deserialization case.

I added handling of nested tuples via recursion and appropriate unit tests.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-06-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883348#action_12883348
 ] 

Yan Zhou commented on PIG-1399:
---

Other expression optimizations include:

3.  Optimization of erasure of logical implicated expression in AND
Example:
B = filter A by (a0  5 and a0  7);
= B = filter A by a0  7;

4. Optimization of erasure of logical implicated expression in OR
Example:
B = filter A by ((a0  5) or (a0  6 and a1  15);
= B = filter C by a0  5;

A comprehensive example of 2, 3 and 4 optimizations is:
B = filter A by NOT((a0  1 and a0  0) or (a1  3 and a0 5));
= B = filter A by a0 = 1;

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou

 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883354#action_12883354
 ] 

Richard Ding commented on PIG-1389:
---

It seems there is no good solution for Merge Join and Merge Cogroup in this 
case. So I'm going to treat them the same way as Replicated Join and not add 
counters for all side files.

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Thejas M Nair (JIRA)
Avoid serialization/deserialization costs for PigStorage data - Use custom Map 
and Bag implementation
-

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
 Fix For: 0.8.0


Cost of serialization/deserialization (sedes) can be very high and avoiding it 
will improve performance.

Avoid sedes when possible by implementing approach #3 proposed in 
http://wiki.apache.org/pig/AvoidingSedes .

The load function uses subclass of Map and DataBag which holds the serialized 
copy.  LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or DataBag is called. 

Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;  
fil = FILTER f BY $0  5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .

{CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-1473:
--

Assignee: Thejas M Nair

 Avoid serialization/deserialization costs for PigStorage data - Use custom 
 Map and Bag implementation
 -

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Cost of serialization/deserialization (sedes) can be very high and avoiding 
 it will improve performance.
 Avoid sedes when possible by implementing approach #3 proposed in 
 http://wiki.apache.org/pig/AvoidingSedes .
 The load function uses subclass of Map and DataBag which holds the serialized 
 copy.  LoadFunction delays deserialization of map and bag types until a 
 member function of java.util.Map or DataBag is called. 
 Example of query where this will help -
 {CODE}
 l = LOAD 'file1' AS (a : int, b : map [ ]);
 f = FOREACH l GENERATE udf1(a), b;  
 fil = FILTER f BY $0  5;
 dump fil; -- Serialization of column b can be delayed until here using this 
 approach .
 {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883361#action_12883361
 ] 

Daniel Dai commented on PIG-1295:
-

Thanks, is the patch ready for review?

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .

http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas



[jira] Created: (PIG-1474) Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

2010-06-28 Thread Thejas M Nair (JIRA)
Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple


 Key: PIG-1474
 URL: https://issues.apache.org/jira/browse/PIG-1474
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


Avoid sedes when possible for data loaded using PigStorage by implementing 
approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

The write() and readFields() functions of tuple returned by TupleFactory  is 
used to serialize data between Map and Reduce. By using a tuple that knows the 
serialization format of the loader, we avoid sedes at Map Recue boundary and 
use the load functions serialized format between Map and Reduce . 
To use a new custom tuple for this purpose, a custom TupleFactory that returns 
tuples of this type has to be specified using the property 
pig.data.tuple.factory.name .
This approach will work only for a set of load functions in the query that 
share same serialization format for map and bags. If this approach proves to be 
very useful, it will build a case for more extensible approach.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1295:


   Status: Patch Available  (was: Open)
Fix Version/s: 0.8.0

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Attachment: PIG-1389_1.patch

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-28 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883367#action_12883367
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

I think it is

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, PIG-1295_0.6.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Open  (was: Patch Available)

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1389) Implement Pig counter to track number of rows for each input files

2010-06-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1389:
--

Status: Patch Available  (was: Open)

 Implement Pig counter to track number of rows for each input files 
 ---

 Key: PIG-1389
 URL: https://issues.apache.org/jira/browse/PIG-1389
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch


 A MR job generated by Pig not only can have multiple outputs (in the case of 
 multiquery) but also can have multiple inputs (in the case of join or 
 cogroup). In both cases, the existing Hadoop counters (e.g. 
 MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number 
 of records in the given input or output.  PIG-1299 addressed the case of 
 multiple outputs.  We need to add new counters for jobs with multiple inputs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Dmitriy Ryaboy
For what it's worth, I saw very significant speed improvements (order of
magnitude for wide tables with few projected columns) when I implemented (2)
for our protocol buffer - based loaders.

I have a feeling that propagating schemas when known, and using them to for
(de)serialization instead of reflecting every field, would also be a big
win.

Thoughts on just using Avro for the internal PigStorage?

-D

On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote:

 I have created a wiki which puts together some ideas that can help in
 improving performance by avoiding/delaying serialization/de-serialization .

 http://wiki.apache.org/pig/AvoidingSedes

 These are ideas that don't involve changes to optimizer. Most of them
 involve changes in the load/store functions.

 Your feedback is welcome.

 Thanks,
 Thejas




Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Russell Jurney
I don't fully understand the repercussions of this, but I like it.  We're
moving from our VoldemortStorage stuff to Avro and it would be great to pipe
Avro all the way through.

Russ

On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 For what it's worth, I saw very significant speed improvements (order of
 magnitude for wide tables with few projected columns) when I implemented
 (2)
 for our protocol buffer - based loaders.

 I have a feeling that propagating schemas when known, and using them to for
 (de)serialization instead of reflecting every field, would also be a big
 win.

 Thoughts on just using Avro for the internal PigStorage?

 -D

 On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote:

  I have created a wiki which puts together some ideas that can help in
  improving performance by avoiding/delaying serialization/de-serialization
 .
 
  http://wiki.apache.org/pig/AvoidingSedes
 
  These are ideas that don't involve changes to optimizer. Most of them
  involve changes in the load/store functions.
 
  Your feedback is welcome.
 
  Thanks,
  Thejas
 
 



[jira] Updated: (PIG-1350) [Zebra] Zebra column names cannot have leading _

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1350:


Fix Version/s: (was: 0.8.0)

 [Zebra] Zebra column names cannot have leading _
 --

 Key: PIG-1350
 URL: https://issues.apache.org/jira/browse/PIG-1350
 Project: Pig
  Issue Type: Improvement
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: pig-1350.patch, pig-1350.patch


 Disallowing '_' as leading character in column names in Zebra schema is too 
 restrictive, which should be lifted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1120) [zebra] should support using org.apache.hadoop.zebra.pig.TableStorer() if user does not want to specify storage hint

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1120:


Fix Version/s: (was: 0.8.0)

 [zebra] should support  using org.apache.hadoop.zebra.pig.TableStorer() if 
 user does not want to specify storage hint
 -

 Key: PIG-1120
 URL: https://issues.apache.org/jira/browse/PIG-1120
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang

 If user doesn't want to specify storage hint, current zebra implementation 
 only support  using org.apache.hadoop.zebra.pig.TableStorer('')  Note: empty 
 string in TableStorer(' ').
 We should support the format of  using 
 org.apache.hadoop.zebra.pig.TableStorer() as we do on  using 
 org.apache.hadoop.zebra.pig.TableLoader()
 sample pig script:
 register /grid/0/dev/hadoopqa/jars/zebra.jar;
 a = load '1.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 b = load '2.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 c = join a by a, b by a;
 d = foreach c generate a::a, a::b, b::c;
 describe d;
 dump d;
 store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer('');
 --this will fail
 --store d into 'join3' using org.apache.hadoop.zebra.pig.TableStorer( );

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1137) [zebra] get* methods of Zebra Map/Reduce APIs need improvements

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1137:


Fix Version/s: (was: 0.8.0)

 [zebra] get* methods of Zebra Map/Reduce APIs need improvements
 ---

 Key: PIG-1137
 URL: https://issues.apache.org/jira/browse/PIG-1137
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Yan Zhou
Assignee: Yan Zhou

 Currently the set* methods takes external Zebra objects, namely objects of  
 ZebraStorageHint, ZebraSchema, ZebraSortInfo or ZebraProjection. 
 Correspondingly, the get* methods should return such objects instead of 
 String or Zebra internal objects like Schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1355) [Zebra] Zebra Multiple Outputs should enable application to skip records

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1355:


Fix Version/s: (was: 0.8.0)
  Description: 
Applications may not always want to write a record to a table. Zebra should 
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables 
by inspecting the data Tuple. 

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) 
will write every record to a table

  was:

Applications may not always want to write a record to a table. Zebra should 
allow application to do the same.

Zebra Mutipile Outputs interface allow users to stream data to different tables 
by inspecting the data Tuple. 

https://issues.apache.org/jira/browse/PIG-

So,

If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
record and thus will not write to any table

However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs ) 
will write every record to a table


 [Zebra]  Zebra Multiple Outputs should enable application to skip records
 -

 Key: PIG-1355
 URL: https://issues.apache.org/jira/browse/PIG-1355
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor

 Applications may not always want to write a record to a table. Zebra should 
 allow application to do the same.
 Zebra Mutipile Outputs interface allow users to stream data to different 
 tables by inspecting the data Tuple. 
 https://issues.apache.org/jira/browse/PIG-
 So,
 If ZebraOutputPartition returns -1, Zebra Multiple Outputs will skip that 
 record and thus will not write to any table
 However, Zebra BasicTableOutputFormat ( different from Zebra Multiple Outputs 
 ) will write every record to a table

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1411) [Zebra] Can Zebra use HAR to reduce file/block count for namenode

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1411:


Fix Version/s: (was: 0.8.0)
  Description: 
Due to column group structure,  Zebra can create extra files for namenode to 
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive 
reduces the block  and file count by copying data from small files ( 1M, 2M 
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks 
and files.


 

  was:

Due to column group structure,  Zebra can create extra files for namenode to 
remember. That means namenode taking more memory for Zebra related files.

The goal is to reduce the no of files/blocks

The idea among various options is to use HAR ( Hadoop Archive ). Hadoop Archive 
reduces the block  and file count by copying data from small files ( 1M, 2M 
...) into a hdfs-block of larger size. Thus, reducing the total no. of blocks 
and files.


 


 [Zebra] Can Zebra use HAR to reduce file/block count for namenode
 -

 Key: PIG-1411
 URL: https://issues.apache.org/jira/browse/PIG-1411
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Gaurav Jain
Assignee: Gaurav Jain
Priority: Minor

 Due to column group structure,  Zebra can create extra files for namenode to 
 remember. That means namenode taking more memory for Zebra related files.
 The goal is to reduce the no of files/blocks
 The idea among various options is to use HAR ( Hadoop Archive ). Hadoop 
 Archive reduces the block  and file count by copying data from small files ( 
 1M, 2M ...) into a hdfs-block of larger size. Thus, reducing the total no. of 
 blocks and files.
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-06-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1337:


Fix Version/s: (was: 0.8.0)

 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang

 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-06-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883382#action_12883382
 ] 

Jeff Zhang commented on PIG-1473:
-

This sounds like the lazy deserialization in Hive, Great !

 Avoid serialization/deserialization costs for PigStorage data - Use custom 
 Map and Bag implementation
 -

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Cost of serialization/deserialization (sedes) can be very high and avoiding 
 it will improve performance.
 Avoid sedes when possible by implementing approach #3 proposed in 
 http://wiki.apache.org/pig/AvoidingSedes .
 The load function uses subclass of Map and DataBag which holds the serialized 
 copy.  LoadFunction delays deserialization of map and bag types until a 
 member function of java.util.Map or DataBag is called. 
 Example of query where this will help -
 {CODE}
 l = LOAD 'file1' AS (a : int, b : map [ ]);
 f = FOREACH l GENERATE udf1(a), b;  
 fil = FILTER f BY $0  5;
 dump fil; -- Serialization of column b can be delayed until here using this 
 approach .
 {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.