Yiping,
(1) Any ETA for when that will become available?
(2) Where can we read more about the SQL functionality it will support?
(3) Where is the JIRA for this?
Thanks,
-- amr
Luc Hunt wrote:
Ricky,
One thing to mention is, SQL support is on the Pig roadmap this year.
--Yiping
On
If You want to use many small files, they are probably having the same
purpose and struc?
Why not use HBase instead of a raw HDFS ? Many small files would be packed
together and the problem would disappear.
cheers
Piotr
2009/5/7 Jonathan Cao jonath...@rockyou.com
There are at least two design
Hey,
You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
Regards,
Jeff
2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
If You want to use many small files, they are probably having the same
purpose and struc?
Why
This is better directed at the Hadoop mailing lists. I've added hadoop core
user mailing list to your query.
Cheers,
-n
On Thu, May 7, 2009 at 1:11 AM, monty123 mayurchou...@yahoo.com wrote:
My query is how hadoop manages map files, files etc. stuffs. What is the
internal data structure is
On Thu, May 7, 2009 at 6:05 AM, Foss User foss...@gmail.com wrote:
Thanks for your response again. I could not understand a few things in
your reply. So, I want to clarify them. Please find my questions
inline.
On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote:
On Wed, May
I believe Jrockit JVM have slightly higer startup time than the SUN JVM; but
that should not make a lot of difference, especially if JVMs are reused in
0.19.
Which Hadoop version are you using? What Hadoop job are you running? And
what performance do you get?
Thanks,
JQ
-Original
I am running the test on 0.18.1 and 0.19.1. Both versions have the same
issue with JRockit JVM. It is for the example sort job, to sort 20G data on
1+2 nodes.
Following is the result(version 0.18.1). The sort job running with JRockit
JVM took 260 secs more than that with Sun JVM.
My query is how hadoop manages map files, files etc. stuffs. What is the
internal data structure is uses to manage things.
Whether it is graph of something..?
Please help.
--
View this message in context:
http://www.nabble.com/Hadoop-internal-details-tp23423618p23423618.html
Sent from the
Hi all,
I have a application want the rules of sorting and grouping use
different Comparator.
I had tested 0.19.1 and 0.20.0 about this function, but both do not work for
Combiner.
In 0.19.1, I use job.setOutputValueGroupingComparator(), and
in 0.20.0, I use job.setGroupingComparatorClass()
I have two reducers running on two different machines. I ran the
example word count program with some of my own System.out.println()
statements to see what is going on.
There were 2 slaves each running datanode as well as tasktracker.
There was one namenode and one jobtracker. I know there is a
with such a small data set who knows what will happen: you are
probably hitting minimal limits of some kind
repeat this with more data
Miles
2009/5/7 Foss User foss...@gmail.com:
I have two reducers running on two different machines. I ran the
example word count program with some of my own
Just one more question, does Hadoop handles reassign of task failure
to different machines in some way?
Yes. If task fails then it is retried, preferably on a different machine.
I saw that sometimes, usually at the end, when there are more
processing units available than map() tasks to
SQL has been on Pig's roadmap for some time, see
http://wiki.apache.org/pig/ProposedRoadMap
We would like to add SQL support to Pig sometime this year. We don't
have an ETA or a JIRA for it yet.
Alan.
On May 6, 2009, at 11:20 PM, Amr Awadallah wrote:
Yiping,
(1) Any ETA for when that
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
Hey,
You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
Regards,
Jeff
2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
If You want to use many small files, they
Most likely the 3rd mapper ran as a speculative execution, and it is
possible that all of your keys hashed to a single partition. Also, if you
don't specify the default is to run a single reduce task.
From JobConf,
/**
* Get configured the number of reduce tasks for this job. Defaults to
*
The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.
2009/5/7 Edward Capriolo edlinuxg...@gmail.com
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
Hey,
a couple of years back we did a lot of experimentation between sun's
vm and jrocket. We had initially assumed that jrocket was going to
scream since thats what the press were saying. In short, what we
discovered was that certain jdk library usage was a little bit faster
with jrocket, but
It may simply be that your JVM's are spending their time doing garbage
collection instead of running your tasks.
My book, in chapterr 6 has a section on how to tune your jobs, and how to
determine what to tune. That chapter is available now as an alpha.
On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon
I have used multiple file systems in jobs, but not used Har as one of them.
Worked for me in 18
On Wed, May 6, 2009 at 4:07 AM, Tom White t...@cloudera.com wrote:
Hi Ivan,
I haven't tried this combination, but I think it should work. If it
doesn't it should be treated as a bug.
Tom
On
Chris Collins wrote:
a couple of years back we did a lot of experimentation between sun's vm
and jrocket. We had initially assumed that jrocket was going to scream
since thats what the press were saying. In short, what we discovered
was that certain jdk library usage was a little bit faster
On Thu, May 7, 2009 at 8:51 PM, jason hadoop jason.had...@gmail.com wrote:
Most likely the 3rd mapper ran as a speculative execution, and it is
possible that all of your keys hashed to a single partition. Also, if you
don't specify the default is to run a single reduce task.
As I mentioned in
I have written a rack awareness script which maps the IP addresses to
rack names in this way.
10.31.1.* - /room1/rack1
10.31.2.* - /room1/rack2
10.31.3.* - /room1/rack3
10.31.100.* - /room2/rack1
10.31.200.* - /room2/rack2
10.31.200.* - /room2/rack3
I understand that DFS will try to have
It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense is
that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
for large messages.
-- Philip
On Thu, May 7, 2009 at 9:11 AM, Foss User foss...@gmail.com wrote:
I understand that the blocks are transferred
The work was done 3 months ago, and the exact query I used may not have been
the below - it was functionally the same - two sources, arithmetic aggregation
on each inner-joined by a small set of values. We wrote a hand-coded map
reduce, a Pig script, and Hive against the same data and
Scott,
Namit is actually correct. If you do a explain on the query that he sent out,
you actually get only 2 map/reduce jobs and not 5 with Hive. We have verified
that and that is consistent with what we should expect in this case. We would
be very interested to know the exact query that you
Ok that explains a lot of that. When we started off Hive our immediate usecase
was to do group bys on data with a lot of skew on the grouping keys. In that
scenario it is better to do this in 2 map/reduce jobs using the first one to
randomly distribute data and generating the partial sums
Problem:
I am comparing two jobs. The both have the same input content, however
in one job the input file has been gziped, and in the other it has not.
I get far less output rows in the gzipped result than I do in the
uncompressed version:
Lines in output:
Gzipped: 86851
Uncompressed:
Hi,
What input format are you using for the GZipped file?
I don't believe there is a GZip input format although some people have
discussed whether it is feasible...
Cheers
Tim
On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
mmata...@millennialmedia.com wrote:
Problem:
I am comparing two
On Thu, May 7, 2009 at 3:10 AM, Owen O'Malley omal...@apache.org wrote:
On May 6, 2009, at 12:15 PM, Foss User wrote:
Is it possible to sort the intermediate values for each key before
they key, list of values pair reaches the reducer?
Look at the example SecondarySort.
Where can I find
This is the result of running gzip on the input files. There appears to be
some support for two reasons:
1) I do get some output in my results. There are 86851 lines in my output
file, and they are valid results.
2) In the job task output I pasted it states:
Philip Zeyliger wrote:
It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense is
that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
for large messages.
yes, and job classes are not distributed using this. It is a very simple
protocol used to read
On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
Philip Zeyliger wrote:
It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense
is
that it's a custom protocol because Hadoop's IPC mechanism isn't optimized
for large messages.
yes, and job classes
I was trying to write a Java code to copy a file from local system to
a file system (which is also local file system). This is my code.
package in.fossist.examples;
import java.io.File;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
On Thu, May 7, 2009 at 1:26 PM, Foss User foss...@gmail.com wrote:
I was trying to write a Java code to copy a file from local system to
a file system (which is also local file system). This is my code.
package in.fossist.examples;
import java.io.File;
import java.io.IOException;
import
On May 7, 2009, at 12:38 PM, Foss User wrote:
Where can I find this example. I was not able to find it in the
src/examples directory.
It is in 0.20.
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java
-- Owen
On Fri, May 8, 2009 at 1:59 AM, Todd Lipcon t...@cloudera.com wrote:
On Thu, May 7, 2009 at 1:26 PM, Foss User foss...@gmail.com wrote:
I was trying to write a Java code to copy a file from local system to
a file system (which is also local file system). This is my code.
package
On Thu, May 7, 2009 at 1:47 PM, Foss User foss...@gmail.com wrote:
This does not work for me as you are reading the a.txt from the DFS
while I want to read the a.txt from the local file system. Also, I
do not want to copy the file to the distributed file system. Instead I
want to copy it to
Albert Sunwoo wrote:
Thanks for the info!
I was hoping to get some more specific information though.
in short : we need to more info.
There are typically 4 machines/processes involved in a write : the
client and 3 datanodes writing the replicas. To see what really
happened, you need to
There are a lot of tuning knobs for the JRockit JVM when it comes to
performance; those tuning can make a huge difference. I'm very interested if
there are some tuning tips for Hadoop.
Grace, what are the parameters that you used in your testing?
Thanks,
JQ
On Thu, May 7, 2009 at 11:35 PM,
Hi all,
I have a few large files (4 that are 1.8GB+) I'm trying to copy from
HDFS to S3. My micro EC2 cluster is running Hadoop 0.19.1, and has
one master/two slaves.
I first tried using the hadoop fs -cp command, as in:
hadoop fs -cp output/dir/ s3n://bucket/dir/
This seemed to be
Hi Asseem,
Thank you, but after fs.trash.interval I got something else. Maybe my
version is not correct. What is your Eclipse europa version?
George
Puri, Aseem wrote:
George,
In my Eclipse Europa it is showing the attribute
hadoop.job.ugi. It is after the fs.trash.interval.
Hi Ken,
S3N doesn't work that well with large files. When uploading a file to
S3, S3N saves it to local disk during write() and then uploads to S3
during the close(). Close can take a long time for large files and it
doesn't report progress, so the call can time out.
As a work around, I'd
On Thu, May 7, 2009 at 1:04 PM, Foss User foss...@gmail.com wrote:
On Fri, May 8, 2009 at 1:20 AM, Raghu Angadi rang...@yahoo-inc.com
wrote:
Philip Zeyliger wrote:
It's over TCP/IP, in a custom protocol. See DataXceiver.java. My sense
is
that it's a custom protocol because
Thanks all for your replying.
I have run several times with different Java options for Map/Reduce
tasks. However there is no much difference.
Following is the example of my test setting:
Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages
-XgcPrio:deterministic -XXallocPrefetch
Hi All,
Is there any way that I can access the hadoop API through python. I am aware
that hadoop streaming can be used to create a mapper and reducer in a
different language but have not come accross any module that helps me apply
functions to manipulate data or control as is an option in java.
On Fri, May 8, 2009 at 9:37 AM, Aditya Desai aditya3...@gmail.com wrote:
Hi All,
Is there any way that I can access the hadoop API through python. I am aware
that hadoop streaming can be used to create a mapper and reducer in a
different language but have not come accross any module that helps
You should consider using Dumbo to run Python jobs with Hadoop Streaming:
http://wiki.github.com/klbostee/dumbo
Dumbo is already very useful, and it is improving all the time.
Zak
On Fri, May 8, 2009 at 12:07 AM, Aditya Desai aditya3...@gmail.com wrote:
Hi All,
Is there any way that I can
Hi, everyone! I am new to hadoop and recently I have set up a small
hadoop cluster and have several users access to it. However, I notice
that no matter which user login to HDFS and do some operations, the
files are always belong to the user DrWho in group Supergroup. HDFS
seems provide no access
I was upgraded to 0.20.0 last week and I noticed most everything in
org.apache.hadoop.mapred.* has been deprecated. However, I've not
been having any luck getting the new Map-Reduce classes to work.
Hadoop Streaming still seems to expect the old API and it doesn't seem
that JobClient has
read this doc:
http://hadoop.apache.org/core/docs/r0.20.0/hdfs_permissions_guide.html
On Fri, May 8, 2009 at 12:56 PM, Starry SHI starr...@gmail.com wrote:
Hi, everyone! I am new to hadoop and recently I have set up a small
hadoop cluster and have several users access to it. However, I notice
examples/wordcount has been modified to use the new API. Also, there is a
test case in the mapreduce directory that uses the new API.
Jothi
On 5/8/09 10:59 AM, Brian Ferris bdfer...@cs.washington.edu wrote:
I was upgraded to 0.20.0 last week and I noticed most everything in
Thanks so much. That did the trick.
On May 7, 2009, at 10:34 PM, Jothi Padmanabhan wrote:
examples/wordcount has been modified to use the new API. Also, there
is a
test case in the mapreduce directory that uses the new API.
Jothi
On 5/8/09 10:59 AM, Brian Ferris
52 matches
Mail list logo