How to Rename & Create DB Table in Hadoop?

2009-05-19 Thread dealmaker
Hi, I want to backup a table and then create a new empty one with following commands in Hadoop. How do I do it in java? Thanks. BEGIN; RENAME TABLE my_table TO backup_table; CREATE TABLE my_table LIKE backup_table; COMMIT; -- View this message in context: http://www.nabble.com/How-to-Renam

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
On Wed, May 20, 2009 at 3:39 AM, Chuck Lam wrote: > Can you set the number of reducers to zero and see if it becomes a map only > job? If it does, then it's able to read in the mapred.reduce.tasks property > correctly but just refuse to have 2 reducers. In that case, it's most likely > you're runn

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread jason hadoop
I always disable atime and it's ilk The deadline scheduler helps with the (non xfs hanging) du datanode timeout issues, but not much. Ultimately that is a caching failure in the kernel, due to the hadoop io patterns. Anshu, any luck getting off the PAE kernels? Is this the xfs lockup, or just the

Re: Hadoop & Python

2009-05-19 Thread Zak Stone
Dumbo certainly makes Python Streaming much nicer; there's more info here: http://wiki.github.com/klbostee/dumbo http://dumbotics.com/ For example, Dumbo makes it easy to implement combiners in Python. Zak On Tue, May 19, 2009 at 8:17 PM, Alex Loddengaard wrote: > You might also check out Dum

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread Alex Loddengaard
DBOutputFormat will very likely put significantly more load on your MySQL server vs. LOAD DATA INFILE. DBOutputFormat will trounce your MySQL server with at least one connection per reducer. This may be OK if you have a small number of reducers and a small amount of output data. LOAD DATA INFILE

Re: Hadoop & Python

2009-05-19 Thread Alex Loddengaard
You might also check out Dumbo, which is a Hadoop Python module. Alex On Tue, May 19, 2009 at 10:35 AM, s d wrote: > Thanks. > So in the overall scheme of things, what is the general feeling about using > python for this? I like the ease of de

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread Anshuman Sachdeva
Hi Brian, thanks for the mail. I have an issue when we use xfs. hadoop runs du -sk after every 10 min on my cluster and some times it goes in the loop and machine hangs. Have you seen this issue or its only me? I'll really appreciate if some one can put some light on this Anshuman ---

Re: DFS Access Error

2009-05-19 Thread George Pang
I made a mistake on misplacing the port no. for MR master with HDFS master. After correcting that, it works. George 2009/5/19 George Pang > Dear Users, > > When I tried to access the DFS, an error message apears: > > Error: java.io.IOException: Unknown protocol to job tracker: > org.apache.hado

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Chuck Lam
Can you set the number of reducers to zero and see if it becomes a map only job? If it does, then it's able to read in the mapred.reduce.tasks property correctly but just refuse to have 2 reducers. In that case, it's most likely you're running in local mode, which doesn't allow more than 1 reducer.

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread dealmaker
Does DBOutputFormat have similar performance as Load Data Infile? Thanks. TimRobertson100 wrote: > > So you are using a java program to execute a "load data infile" > command on mysql through JDBC? > If so I *think* you would have to copy it onto the mysql machine from > HDFS first, or the ma

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread Bryan Duxbury
We use XFS for our data drives, and we've had somewhat mixed results. One of the biggest pros is that XFS has more free space than ext3, even with the reserved space settings turned all the way to 0. Another is that you can format a 1TB drive as XFS in about 0 seconds, versus minutes for ex

DFS Access Error

2009-05-19 Thread George Pang
Dear Users, When I tried to access the DFS, an error message apears: Error: java.io.IOException: Unknown protocol to job tracker: org.apache.hadoop.dfs.ClientProtocol at org.apache.hadoop.mapred.jobTracker.getProtocolVersion(JobTracker.java:163) at sun.reflec.NativeMethodAccessorimpl.invoke0(Nati

Re: Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread Scott Carey
Yes and no. Most OSs/filesystems will get file data to disk within 5 seconds if the files are small. But if it is written, read, and deleted quickly it may not ever hit disk. Applications may request that data is flushed to disk earlier. In a Hadoop environment, smaller or medium sized files mo

Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
Direct link to HADOOP-4842: https://issues.apache.org/jira/browse/HADOOP-4842 On Tue, May 19, 2009 at 5:04 PM, Peter Skomoroch wrote: > Whoops, should have googled it first. Looks like this is now fixed in > trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be > adding s

Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
Whoops, should have googled it first. Looks like this is now fixed in trunk, HADOOP-4842. For people stuck using 18.3, a workaround appears to be adding something like "| sort | sh combiner.sh" to the call of the mapper script (via Klaas Bosteels) Would be great to get this patched into distribu

Re: Hadoop & Python

2009-05-19 Thread Peter Skomoroch
One area I'm curious about is the requirement that any combiners in Streaming jobs be java classes. Are there any plans to change this in the future? Prototyping streaming jobs in Python is great, and the ability to use a Python combiner would help performance a lot without needing to move to Jav

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
On Wed, May 20, 2009 at 1:52 AM, Piotr Praczyk wrote: > After a first mail I understood that you are providing additional job.xml ( > which can be done). > What version of Hadoop do you use ? In 0.20 there was some change in > configuration files - as far as I understood from the messages, > hadoo

Re: Hadoop & Python

2009-05-19 Thread Amr Awadallah
S d, It is totally fine to use Python streaming if it does the job you are after, there will be a slight performance hit, but that is noise assuming your cluster is a small one. If you are operating a large cluster continuously, then once your logic is stabilized using Python it might make s

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Piotr Praczyk
After a first mail I understood that you are providing additional job.xml ( which can be done). What version of Hadoop do you use ? In 0.20 there was some change in configuration files - as far as I understood from the messages, hadoop-site.xml was splitted into few other... where the overriding se

RE: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread Marc Limotte
You might also try using something like Fuse-dfs (http://wiki.apache.org/hadoop/MountableHDFS) to "mount" the HDFS file system on the mysql machine. You could then use a standard unix path to specify the loadfile. Marc -Original Message- From: tim robertson [mailto:timrobertson...@gma

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread tim robertson
So you are using a java program to execute a "load data infile" command on mysql through JDBC? If so I *think* you would have to copy it onto the mysql machine from HDFS first, or the machine running the command and then try a 'load data local infile'. Or pehaps use the http://hadoop.apache.org/co

Re: Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread Billy Pearson
The only way to do something like this is get them mapers to use something like /dev/shm as there storage folder that's 100% memory outside of that everything is flushed because the mapper exits when its done the tasktracker is the one delivering the output to the reduce task. Billy "paula_t

Re: Mysql Load Data Infile with Hadoop?

2009-05-19 Thread Sheldon Neuberger
You could copy the file to your local filesystem with something like `hadoop dfs -copyToLocal test.txt local_test.txt` On Tue, May 19, 2009 at 3:54 PM, dealmaker wrote: > > Hi, >  I am want to load data in mysql using a hadoop file similar to following: > LOAD DATA INFILE 'test.txt' INTO TABLE t

Mysql Load Data Infile with Hadoop?

2009-05-19 Thread dealmaker
Hi, I am want to load data in mysql using a hadoop file similar to following: LOAD DATA INFILE 'test.txt' INTO TABLE test FIELDS TERMINATED BY ',' LINES STARTING BY 'xxx'; But how do I load the hdfs file into the mysql comand above? Do I start the file name with hdfd://test.txt? I am using

difference between bytes read and local bytes read?

2009-05-19 Thread Foss User
When we see the job details on the job tracker web interface, we see "bytes read" as well as "local bytes read". What is the difference between the two?

Re: Hadoop & Python

2009-05-19 Thread Billy Pearson
I used streaming and php before to work with processing data with a data set of about 1TB with out any problems at all. Billy "s d" wrote in message news:24b53fa00905191035w41b115c1q94502ee82be43...@mail.gmail.com... Thanks. So in the overall scheme of things, what is the general feeling ab

My configuration in conf/hadoop-site.xml is not being used. Why?

2009-05-19 Thread Foss User
I ran a job. In the jobtracker web interface, I found 4 maps and 1 reduce running. This is not what I set in my configuration files (hadoop-site.xml). My configuration file, conf/hadoop-site.xml is set as follows: mapred.map.tasks = 2 mapred.reduce.tasks = 2 However, the description of these pro

Hive using EC2/S3

2009-05-19 Thread Joydeep Sen Sarma
Hi folks, I have put up a short tutorial on running SQL queries on EC2 against files in S3 using Hive and Hadoop. Please find it here: http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely Some example data and queries (from TPCH benchmark) are also made available in S3. Cc'ing core-us

Re: Finding where the file blocks are

2009-05-19 Thread Philip Zeyliger
On Tue, May 19, 2009 at 1:00 AM, Foss User wrote: > On Tue, May 19, 2009 at 12:53 PM, Ravi Phulari > wrote: > > If you have hadoop superuser/administrative permissions you can use fsck > > with correct options to view block report and locations for every block. > > > > For further information p

Re: Seattle / PNW Hadoop + Lucene User Group?

2009-05-19 Thread Bradford Stephens
Hello everyone! We (finally) have space secured (it's a tough task!): University of Washington, Allen Center Room 303, at 6:45pm on Wednesday, May 27, 2009. I'm going to put together a map, and a wiki so we can collab. What I'm envisioning is a meetup for about 2 hours: we'll have two in-depth tal

Re: Hadoop & Python

2009-05-19 Thread s d
Thanks. So in the overall scheme of things, what is the general feeling about using python for this? I like the ease of deploying and reading python compared with Java but want to make sure using python over hadoop is scalable & is standard practice and not something done only for prototyping and s

Re: Hadoop & Python

2009-05-19 Thread Alex Loddengaard
Streaming is slightly slower than native Java jobs. Otherwise Python works great in streaming. Alex On Tue, May 19, 2009 at 8:36 AM, s d wrote: > Hi, > How robust is using hadoop with python over the streaming protocol? Any > disadvantages (performance? flexibility?) ? It just strikes me that

Re: Suspend or scale back hadoop instance

2009-05-19 Thread John Clarke
The jobs will be of different sizes and some may take days to complete with only 5 machines, so yes some will run night and day. By scale back, I mean scale back on system resources (CPU, IO, RAM) so the machine can be used for other tasks during the day. I understand (as you pointed out) I can r

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
On Tue, May 19, 2009 at 8:23 PM, He Chen wrote: > I think, they are not overridden every times. If you do not give any > configuration in your source code, the hadoop-site.xml will helps you > configure the framework. At the same time, you will not configure all the > parameters of hadoop framewor

Hadoop & Python

2009-05-19 Thread s d
Hi, How robust is using hadoop with python over the streaming protocol? Any disadvantages (performance? flexibility?) ? It just strikes me that python is so much more convenient when it comes to deploying and crunching text files. Thanks,

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread He Chen
I think, they are not overridden every times. If you do not give any configuration in your source code, the hadoop-site.xml will helps you configure the framework. At the same time, you will not configure all the parameters of hadoop framework in your program, then hadoop-site.xml helps. On Tue, M

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
On Tue, May 19, 2009 at 8:04 PM, He Chen wrote: > change following parameter > mapred.reduce.max.attempts      4 > mapred.reduce.tasks     1To > mapred.reduce.max.attempts      2 > mapred.reduce.tasks     2 > In your program source code! If these parameters in hadoop-site.xml is always going to b

Re: Suspend or scale back hadoop instance

2009-05-19 Thread Kevin Weil
Will your jobs be running night and day, or just over a specified period? Depending on your setup, and on what you mean by "scale back" (CPU vs disk IO vs memory), you could potentially restart your cluster with different settings at different times of the day via cron. This will kill any running

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread He Chen
change following parameter mapred.reduce.max.attempts 4 mapred.reduce.tasks 1To mapred.reduce.max.attempts 2 mapred.reduce.tasks 2 In your program source code! On Tue, May 19, 2009 at 9:14 AM, Foss User wrote: > On Tue, May 19, 2009 at 5:32 PM, Piotr Praczyk > wrote: > > Hi >

Re: Suspend or scale back hadoop instance

2009-05-19 Thread Steve Loughran
John Clarke wrote: Hi, I am working on a project that is suited to Hadoop and so want to create a small cluster (only 5 machines!) on our servers. The servers are however used during the day and (mostly) idle at night. So, I want Hadoop to run at full throttle at night and either scale back or

Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread paula_ta
Is it possible that some intermediate data produced by mappers and written to the local file system resides in memory in the file system cache and is never flushed to disk ? Eventually reducers will retrieve this data via HTTP - possibly without the data ever being written to disk ? thanks Paula

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
On Tue, May 19, 2009 at 5:32 PM, Piotr Praczyk wrote: > Hi > > Your job configuration file specifies exactly the numbers of mappers and > reducers that are running in your system. The job configuration overrides > site configuration ( if parameters are not specified as final ) as far as I > know.

Re: Suspend or scale back hadoop instance

2009-05-19 Thread John Clarke
Hi Piotr, Thanks for the prompt reply. If the cron script shuts down Hadoop surely it won't pick up where it left off when it is restarted? All the machines will be used during the day so it is not an option to turn Hadoop off on only some of them. John 2009/5/19 Piotr Praczyk > Hi John >

Re: Suspend or scale back hadoop instance

2009-05-19 Thread Piotr Praczyk
Hi John I don't know if there is a Hadoop support for such thing, but You can do this easily writing a crontab script. It could start hadoop at specified hour and shut it down ( disable some nodes) at another one. There can be some problems with HDFS however ( if you disable all the nodes holdin

Re: Access to local filesystem working folder in map task

2009-05-19 Thread Tom White
Hi Chris, The task-attempt local working folder is actually just the current working directory of your map or reduce task. You should be able to pass your legacy command line exe and other files using the -files option (assuming you are using the Java interface to write your job, and you are imple

Suspend or scale back hadoop instance

2009-05-19 Thread John Clarke
Hi, I am working on a project that is suited to Hadoop and so want to create a small cluster (only 5 machines!) on our servers. The servers are however used during the day and (mostly) idle at night. So, I want Hadoop to run at full throttle at night and either scale back or suspend itself during

Access to local filesystem working folder in map task

2009-05-19 Thread Chris Carman
hi users, I have started writing my first project on Hadoop and am now seeking some guidance from more experienced members. The project is about running some CPU intensive computations in parallel and should be a straightforward application for MapReduce, as the input dataset can easily be par

Re: Number of maps and reduces not obeying my configuration

2009-05-19 Thread Piotr Praczyk
Hi Your job configuration file specifies exactly the numbers of mappers and reducers that are running in your system. The job configuration overrides site configuration ( if parameters are not specified as final ) as far as I know. Piotr 2009/5/19 Foss User > I ran a job. In the jobtracker we

Number of maps and reduces not obeying my configuration

2009-05-19 Thread Foss User
I ran a job. In the jobtracker web interface, I found 4 maps and 1 reduce running. This is not what I set in my configuration files (hadoop-site.xml). My configuration file is set as follows: mapred.map.tasks = 2 mapred.reduce.tasks = 2 However, the description of these properties mention that t

Re: Finding where the file blocks are

2009-05-19 Thread Foss User
On Tue, May 19, 2009 at 12:53 PM, Ravi Phulari wrote: > If you have hadoop superuser/administrative  permissions you can use fsck > with correct options to view block report and locations for every block. > > For further information please refer - > http://hadoop.apache.org/core/docs/r0.20.0/comma

Re: Shutdown in progress exception

2009-05-19 Thread Stas Oskin
Hi. Does anyone has any idea on this issue? Thanks! 2009/5/17 Stas Oskin > Hi. > > I have an issue where my application, when shutting down (at ShutdownHook > level), is unable to copy files to HDFS. > > Each copy throws the following exception: > > java.lang.IllegalStateException: Shutdown in

Re: Finding where the file blocks are

2009-05-19 Thread Arun C Murthy
On May 19, 2009, at 12:13 AM, Foss User wrote: I know that if a file is very large, it will be split into blocks and the blocks would be spread out in various data nodes. I want to know whether I can find out through GUI or logs exactly where which data nodes contain which file blocks of a part

Re: Finding where the file blocks are

2009-05-19 Thread Ravi Phulari
If you have hadoop superuser/administrative permissions you can use fsck with correct options to view block report and locations for every block. For further information please refer - http://hadoop.apache.org/core/docs/r0.20.0/commands_manual.html#fsck On 5/19/09 12:13 AM, "Foss User" wrote

Finding where the file blocks are

2009-05-19 Thread Foss User
I know that if a file is very large, it will be split into blocks and the blocks would be spread out in various data nodes. I want to know whether I can find out through GUI or logs exactly where which data nodes contain which file blocks of a particular huge text file?