gzip if the compression
was made on by default. Currently, the compression has to be specified and
takes no default value. This is to ask user to take full appreciation of pros
and cons of either compression method.
> need to investigate the impact of compression on pig perf
t is because the default compression is gzip
which is really slow and most of the time not what you want. Because of the
licensing issue with lzo, users need to setup it on their own. Once they do the
setup, they can enable the compression.
> need to investigate the impact of compression
n is there any specific reason to default
pig.tmpfilecompression to false. This seems to be a useful feature, so it
should be true by default, no ?
> need to investigate the impact of compression on pig performance
>
>
>
ueryterm' as (query_term); C = join B1 by
query_term, B by query_term using 'skewed' parallel 300; D = distinct C
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp
/grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop
rallel 300; store D into 'output.lzo'; which is launched as follows: java -cp
/grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecom
Patch committed to trunk. Thanks Yan!
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
>
/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo
org.apache.pig.Main ./test.pig
> need to investigate the impact of compression on pig performance
>
>
&
Thank for quick turnaround Tejas.
Yan
-Original Message-
From: Thejas M Nair (JIRA) [mailto:j...@apache.org]
Sent: Wednesday, August 25, 2010 8:54 AM
To: pig-dev@hadoop.apache.org
Subject: [jira] Commented: (PIG-1501) need to investigate the impact of
compression on pig performance
ression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
> Issue Type: Test
>
to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
> Issue Type: Test
>
impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
> Issue Type: Test
>
arnings are on two html files, SampleOptimizer.html and
org.apache.pig.impl.util.Utils.html.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: h
eed to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
>
e vs TFile comparison. It
appears for compressed data, TFile performs better than SeqFile.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https:
I am wondering if the additional
unused features of TFile (index, metadata) result in any overhead compared to
SequenceFile.
> need to investigate the impact of compression on pig performance
>
>
>
[
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yan Zhou updated PIG-1501:
--
Attachment: PIG-1501.patch
> need to investigate the impact of compression on pig performa
its grids) but you cannot
ship lzo with Hadoop or Pig.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/j
rs that Hadoop installation has it, at least in
my test cluster.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/P
#x27;m +1 with going with lzo/Tfile. As the lzo libs are GPL we cannot ship with
that as default. I wasn't clear from your last comment which you were
proposing as the default.
> need to investigate the impact of compress
I'll go with LZO
compression on TFile with the default option to disable compression that will
be the old behavoir.
> need to investigate the impact of compression on pig performance
>
>
>
in line with
the general observation that gzip compresses better but performs worse.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
>
test results
as an attachment.
> need to investigate the impact of compression on pig performance
>
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
>
negligible within
background noise or within a few percentages of the overall run times. But this
is not conclusive yet. Larger and more real life queries would be more suitable
for the comparison purpose ;
5) RCFile
mpression in sequence files. For now we can continue with the same
serialization used in BinStorage, though in the future we may want to change
this as well.
> need to investigate the impact of compression on pig
need to investigate the impact of compression on pig performance
Key: PIG-1501
URL: https://issues.apache.org/jira/browse/PIG-1501
Project: Pig
Issue Type: Test
[
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Dai updated PIG-200:
---
Attachment: (was: pigmix2.patch)
> Pig Performance Benchma
[
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Dai updated PIG-200:
---
Attachment: pigmix2.patch
> Pig Performance Benchmarks
> --
>
>
will use pig.jar, pigperf.jar. Scripts is in test/utils/pigmix/scripts.
To generate data, use generate_data.sh. To run PigMix2, use runpigmix-adhoc.pl.
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.
you using pig 0.6 release? What error
message did you see?
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Ta
t;ant jar compile-test".
What do I need to installed before I execute this command?
Thanks
Duncan
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Pr
[
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846782#action_12846782
]
duncan commented on PIG-200:
Thank you very Daniel~
> Pig Performance Ben
to
generate input data for Pigmix is:
1. apply perf-0.6.patch on pig 0.6 release
2. ant jar compile-test
3. export PIG_HOME=.
4. test/utils/pigmix/datagen/generate_data.sh
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
>
http://www.eli.sdsu.edu/java-SDSU/sdsuLibJKD12.jar, and put in your lib
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Task
&g
rent things in the perf.patch.
I want to generate data set and use those 14 pig queries for benchmarking.
Would you mind telling me more on how to use the perf.patch?
Thanks
Duncan
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
>
rate the input file for
the queries.
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Task
>
h in order to run
those 14 queries?
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Task
>
[
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates reassigned PIG-200:
--
Assignee: Alan Gates
> Pig Performance Benchmarks
> --
>
>
http://wiki.apache.org/pig/DataGeneratorHadoop
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Task
>
here.
http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: T
be installed on top of perf.patch.
The design doc is here.
http://twiki.corp.yahoo.com/view/Tiger/DataGeneratorHadoop
> Pig Performance Benchmarks
> --
>
> Key: PIG-200
> URL: https://issues.apache.org/jira
the SIGMOD 2009 paper.
https://issues.apache.org/jira/browse/HIVE-396
We also spent a lot of time in writing pig programs for those queries, and we
have some preliminary results.
Will somebody from the pig team take a look and help improve the pig queries?
> Pig Performance Ben
That's correct. The 10m in the names weren't really meant to be
hardcoded into the patch, as the idea is that the tables could be
created at different sizes depending on your cluster size. Sorry for
the incomplete state of things, obviously that patch needs some work
before I commit it.
Hi Alan & Others,
I am using pigmix patch at:
https://issues.apache.org/jira/browse/PIG-200 and want to generate
test data and run pigmix queries on it. As I understand, shell scripts
in the patch are intended to generate data for pigmix queries.
I have been able to adapt the shell scripts, map-re
[
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich resolved PIG-200.
Resolution: Fixed
PigMix is out set of benchmarks going forward.
> Pig Performance Benchma
computations across
multiple stores.
Olga
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Saturday, December 20, 2008 10:33 AM
To: pig-dev@hadoop.apache.org
Cc: pig-dev@hadoop.apache.org
Subject: Re: Pig performance
I think the key points that Alan brought up
combine computations across
> multiple stores.
>
> Olga
>
> > -Original Message-
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Saturday, December 20, 2008 10:33 AM
> > To: pig-dev@hadoop.apache.org
> > Cc: pig-dev@hadoop.apache.
> Sent: Saturday, December 20, 2008 10:33 AM
> To: pig-dev@hadoop.apache.org
> Cc: pig-dev@hadoop.apache.org
> Subject: Re: Pig performance
>
>
> I think the key points that Alan brought up in his blog
> comment were that trunk pig is paradoxically not the most
> curr
I think the key points that Alan brought up in his blog comment were
that trunk pig is paradoxically not the most current and that storing
intermediate results can decrease the scope of optimizations.
On Dec 20, 2008, at 10:16, Alan Gates wrote:
I left a comment on the blog addressing som
I left a comment on the blog addressing some of the issues he brought
up.
Alan.
On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:
Hey Pig team,
Did anyone check out the recent claims about Pig's poor performance
versus
Cascading? Though I haven't worked extensively with either system,
Hey Pig team,
Did anyone check out the recent claims about Pig's poor performance versus
Cascading? Though I haven't worked extensively with either system, I found
the statements made fairly bold and am curious to hear more about their
validity from the Pig development team:
http://www.manamplifie
benchmarks for pig. It contains a set of 14 queries which are designed to try
to cover a range of ways users use pig. It also includes implementations of
the same queries in java code for map reduce, so that developers can compare
pig performance against map reduce performance. See
http
51 matches
Mail list logo