date:20100607

[jira] Created: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

2010-06-07 Thread Dirk Schmid (JIRA)

java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)
---

 Key: PIG-1442
 URL: https://issues.apache.org/jira/browse/PIG-1442
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.2.0
 Environment: Apache-Hadoop 0.20.2 + Pig 0.7.0 and also for 0.8.0-dev 
(18/may)
Hadoop-0.18.3 (cloudera RPMs) + PIG 0.2.0

Reporter: Dirk Schmid


As mentioned by Ashutosh this is a reopen of 
https://issues.apache.org/jira/browse/PIG-766 because there is still a problem 
which causes that PIG scales only by memory.

For convenience here comes the last entry of the PIG-766-Jira-Ticket:

{quote}1. Are you getting the exact same stack trace as mentioned in the 
jira?{quote} Yes the same and some similar traces:
{noformat}
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:279)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
at 
org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:249)
at 
org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:214)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
at 
org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:209)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:264)
at 
org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:123)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:179)
at 
org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:880)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1201)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:199)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2563)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2501)



java.lang.OutOfMemoryError: Java heap space
at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:58)
at 
org.apache.pig.data.DefaultTupleFactory.newTuple(DefaultTupleFactory.java:35)
at 
org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:61)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
at 
org.apache.pig.data.DefaultAbstractBag.readFields(DefaultAbstractBag.java:263)
at 
org.apache.pig.data.DataReaderWriter.bytesToBag(DataReaderWriter.java:71)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:145)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
at 
org.apache.pig.data.DataReaderWriter.bytesToTuple(DataReaderWriter.java:63)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:142)
at 
org.apache.pig.data.DataReaderWriter.readDatum(DataReaderWriter.java:136)
at org.apache.pig.data.DefaultTuple.readFields(DefaultTuple.java:284)
at 
org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
at 
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
at

[jira] Commented: (PIG-1429) Add Boolean Data Type to Pig

2010-06-07 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876326#action_12876326
 ] 

Alan Gates commented on PIG-1429:
-

Is this patch ready for review or does it need more work?

 Add Boolean Data Type to Pig
 

 Key: PIG-1429
 URL: https://issues.apache.org/jira/browse/PIG-1429
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.7.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.8.0

 Attachments: working_boolean.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 Pig needs a Boolean data type.  Pig-1097 is dependent on doing this.  
 I volunteer.  Is there anything beyond the work in src/org/apache/pig/data/ 
 plus unit tests to make this work?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-06-07 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876354#action_12876354
]

Daniel Dai commented on PIG-1295:
-

I briefly review the patch, looks good. This is the approach we expected. Can
we do some initial performance test first?

Binary comparator for secondary sort

Key: PIG-1295
URL: https://issues.apache.org/jira/browse/PIG-1295
Project: Pig
Issue Type: Improvement
Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
Attachments: PIG-1295_0.1.patch

When hadoop framework doing the sorting, it will try to use binary version of
comparator if available. The benefit of binary comparator is we do not need
to instantiate the object before we compare. We see a ~30% speedup after we
switch to binary comparator. Currently, Pig use binary comparator in
following case:
1. When semantics of order doesn't matter. For example, in distinct, we need
to do a sort in order to filter out duplicate values; however, we do not care
how comparator sort keys. Groupby also share this character. In this case, we
rely on hadoop's default binary comparator
2. Semantics of order matter, but the key is of simple type. In this case, we
have implementation for simple types, such as integer, long, float,
chararray, databytearray, string
However, if the key is a tuple and the sort semantics matters, we do not have
a binary comparator implementation. This especially matters when we switch to
use secondary sort. In secondary sort, we convert the inner sort of nested
foreach into the secondary key and rely on hadoop to sorting on both main key
and secondary key. The sorting key will become a two items tuple. Since the
secondary key the sorting key of the nested foreach, so the sorting semantics
matters. It turns out we do not have binary comparator once we use secondary
sort, and we see a significant slow down.
Binary comparator for tuple should be doable once we understand the binary
structure of the serialized tuple. We can focus on most common use cases
first, which is group by followed by a nested sort. In this case, we will
use secondary sort. Semantics of the first key does not matter but semantics
of secondary key matters. We need to identify the boundary of main key and
secondary key in the binary tuple buffer without instantiate tuple itself.
Then if the first key equals, we use a binary comparator to compare secondary
key. Secondary key can also be a complex data type, but for the first step,
we focus on simple secondary key, which is the most common use case.
We mark this issue to be a candidate project for Google summer of code 2010
program.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-490) Combiner not used when group elements referred to in tuple notation instead of flatten.

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-490:
--

Assignee: Thejas M Nair

 Combiner not used when group elements referred to in tuple notation instead 
 of flatten.
 ---

 Key: PIG-490
 URL: https://issues.apache.org/jira/browse/PIG-490
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Alan Gates
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Given a query like:
 {code}
 A = load 'myfile';
 B = group A by ($0, $1);
 C = foreach B generate group.$0, group.$1, COUNT(A);
 {code}
 The combiner will not be invoked.  But if the last line is changed to:
 {code}
 C = foreach B generate flatten(group), COUNT(A);
 {code}
 it will be.  The reason for the discrepancy is because the CombinerOptimizer 
 checks that all of the projections are simple.  If not, it does not use the 
 combiner.  group.$0 is not a simple projection, so this is failed.  However, 
 this is a common enough case that the CombinerOptimizer should detect it and 
 still use the combiner. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1435:
---

Assignee: Richard Ding

 make sure dependent jobs fail when a jon in multiquery fails
 

 Key: PIG-1435
 URL: https://issues.apache.org/jira/browse/PIG-1435
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 Currently if one of the MQ jobs fails, Pig tries to run all remainin jobs. As 
 the result, if data was partially generated by the failed job, you might get 
 incorrect results from dependent jobs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1436) Print number of records outputted at each step of a Pig script

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1436:
---

Assignee: Richard Ding

I think Richard is already doing this as part of his stats work

 Print number of records outputted at each step of a Pig script
 --

 Key: PIG-1436
 URL: https://issues.apache.org/jira/browse/PIG-1436
 Project: Pig
  Issue Type: New Feature
  Components: grunt
Affects Versions: 0.7.0
Reporter: Russell Jurney
Assignee: Richard Ding
Priority: Minor
 Fix For: 0.8.0


 I often run a script multiple times, or have to go and look through Hadoop 
 task logs, to figure out where I broke a long script in such a way that I get 
 0 records out of it.  I think this is a common problem.
 If someone can point me in the right direction, I can make a pass at this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1434) Allow casting relations to scalars

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1434:
---

Assignee: Aniket Mokashi

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-928) UDFs in scripting languages

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-928:
--

Assignee: Aniket Mokashi

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, pig-greek.tgz, 
 pig.scripting.patch.arnab, pyg.tgz, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-1405) Need to move many standard functions from piggybank into Pig

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1405:
---

Assignee: Aniket Mokashi

 Need to move many standard functions from piggybank into Pig
 

 Key: PIG-1405
 URL: https://issues.apache.org/jira/browse/PIG-1405
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0


 There are currently a number of functions in Piggybank that represent 
 features commonly supported by languages and database engines.  We need to 
 decide which of these Pig should support as built in functions and put them 
 in org.apache.pig.builtin.  This will also mean adding unit tests and 
 javadocs for some UDFs.  The existing classes will be left in Piggybank for 
 some time for backward compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-06-07 Thread Gianmarco De Francisci Morales (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gianmarco De Francisci Morales updated PIG-1295:

Attachment: PIG-1295_0.2.patch

I added some simple performance tests.
The tests generate 1 million tuples modifying a prototypical tuple and compare
them to the prototype.
One test uses the new comparator and the other uses the old one. I generate
exactly the same tuples using a fixed seed. I also check the correctness of the
comparisons using the normal compareTo() method of the tuples.

The logic to generate the tuples is a bit involved: I tried to exercise all the
datatype comparisons in a uniform manner, so I mutate less the first elements
of the tuple, in order to have more probability of getting the comparison
further down the tuple. The probabilities are totally made up and do not make
much sense.

As a first approximation, I see a slight overall speedup in the test.
I will do some profiling to see which margins of improvement we have.

Binary comparator for secondary sort

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

distributed cache in pig

2010-06-07 Thread Gang Luo

HI all,
I notice that whether pig use distributed cache depends on the context (local 
or mapreduce). When running in mapreduce mode, the distributed cache is always 
enable (e.g. replicated join). However, I never find such method, 
DistributedCache.getLocalCacheFiles(job), which get the cached file from the 
local disk. So, how does pig read these files from local disk? I am looking at 
the pig 0.7 source code.

Thanks,
-Gang

[jira] Updated: (PIG-1441) New test targets: unit and smoke

2010-06-07 Thread Olga Natkovich (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1441:


Attachment: PIG-1441.patch

 New test targets: unit and smoke
 

 Key: PIG-1441
 URL: https://issues.apache.org/jira/browse/PIG-1441
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1441.patch


 As we get more and more tests, adding more structure would help us to 
 minimize time spent on testing. Here are 2 new targets I propose we add. 
 (Hadoop has the same targets for the same purposes).
 unit - to run all true unit tests (those that trully testing apis and 
 internal functionality and not running e2e tests through junit. This test 
 should run relatively quick 10-15 minutes and if we are good at adding unit 
 tests will give good covergae.
 smoke - this would be a set of a few e2e tests that provide good overall 
 coverage within about 30 minutes.
 I would say that for simple patche, we would still require only commit tests 
 while for more involved patches, the developers should run both unit and 
 smoke before submitting the patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1441) New test targets: unit and smoke

2010-06-07 Thread Olga Natkovich (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876537#action_12876537
]

Olga Natkovich commented on PIG-1441:
-

I uploaded the patch.

Unit test target runs in about 35 minutes and executes all non e2e tests.
Smoke test target runs for 15 minutes so we can add more. So far, I have

- simple script (load + store)
- fairly complex script
- multiquery (in local mode because MR mode one takes 45 minutes). Ths also
tests local mode so I think it is a good combination
- streaming

Please, suggest if we should add any other e2e tests.

Also, please, review the patch. It does not need to go through patch test since
no code is modified.

New test targets: unit and smoke

Key: PIG-1441
URL: https://issues.apache.org/jira/browse/PIG-1441
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Olga Natkovich
Fix For: 0.8.0

Attachments: PIG-1441.patch

As we get more and more tests, adding more structure would help us to
minimize time spent on testing. Here are 2 new targets I propose we add.
(Hadoop has the same targets for the same purposes).
unit - to run all true unit tests (those that trully testing apis and
internal functionality and not running e2e tests through junit. This test
should run relatively quick 10-15 minutes and if we are good at adding unit
tests will give good covergae.
smoke - this would be a set of a few e2e tests that provide good overall
coverage within about 30 minutes.
I would say that for simple patche, we would still require only commit tests
while for more involved patches, the developers should run both unit and
smoke before submitting the patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: distributed cache in pig

2010-06-07 Thread Gang Luo

Thanks Olga. But what it running in mapreduce mode? Once the distributed cache 
is enable in this mode, there should still be some way to read these cached 
files. Actually, searching all the source files in pig-0.7, I can't find 
'DistributedCache.getLocalCacheFiles' anywhere and I soppose there is no other 
way to read cached files. This is what confuse me. Any other ideas?

-Gang




- 原始邮件 
发件人： Olga Natkovich ol...@yahoo-inc.com
收件人： pig-dev@hadoop.apache.org
发送日期： 2010/6/7 (周一) 6:50:01 下午
主   题： RE: distributed cache in pig

This is because Hadoop 20 does not support distributed cache in local
mode. My understanding is that it would be part of Hadoop 22.

Olga

-Original Message-
From: Gang Luo [mailto:lgpub...@yahoo.com.cn] 
Sent: Monday, June 07, 2010 3:40 PM
To: pig-dev@hadoop.apache.org
Subject: distributed cache in pig

HI all,
I notice that whether pig use distributed cache depends on the context
(local or mapreduce). When running in mapreduce mode, the distributed
cache is always enable (e.g. replicated join). However, I never find
such method, DistributedCache.getLocalCacheFiles(job), which get the
cached file from the local disk. So, how does pig read these files from
local disk? I am looking at the pig 0.7 source code.

Thanks,
-Gang

[jira] Created: (PIG-1442) java.lang.OutOfMemoryError: Java heap space (Reopen of PIG-766)

[jira] Commented: (PIG-1429) Add Boolean Data Type to Pig

[jira] Commented: (PIG-1295) Binary comparator for secondary sort

[jira] Assigned: (PIG-490) Combiner not used when group elements referred to in tuple notation instead of flatten.

[jira] Assigned: (PIG-1435) make sure dependent jobs fail when a jon in multiquery fails

[jira] Assigned: (PIG-1436) Print number of records outputted at each step of a Pig script

[jira] Assigned: (PIG-1434) Allow casting relations to scalars

[jira] Assigned: (PIG-928) UDFs in scripting languages

[jira] Assigned: (PIG-1405) Need to move many standard functions from piggybank into Pig

[jira] Updated: (PIG-1295) Binary comparator for secondary sort

distributed cache in pig

[jira] Updated: (PIG-1441) New test targets: unit and smoke

[jira] Commented: (PIG-1441) New test targets: unit and smoke

Re: distributed cache in pig

14 matches

Site Navigation

Mail list logo

Footer information