Re: Native (GZIP) decompress not faster than builtin

2009-05-10 Thread Stefan Podkowinski
Jens, As your test shows, using a native codec won't make much sense for small files, since the involved JNI overhead will likely out-weight any possible gains. With all the performance improvements in java 5 + 6 its reasonable to ask whether the native implementation does really improve

Re: large files vs many files

2009-05-10 Thread Stefan Podkowinski
You just can't have many distributed jobs write into the same file without locking/synchronizing these writes. Even with append(). Its not different than using a regular file from multiple processes in this respect. Maybe you need to collect your data in front before processing them in hadoop?

Re: Hadoop / MySQL

2009-04-29 Thread Stefan Podkowinski
If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine. The only thing you have to do is to copy your hadoop produced csv file into the mysql data directory and issue a flush tables command to have mysql

Re: hadoop job controller

2009-04-02 Thread Stefan Podkowinski
You can get the job progress and completion status through an instance of org.apache.hadoop.mapred.JobClient . If you really want to use perl I guess you still need to write a small java application that talks to perl and JobClient on the other side. Theres also some support for Thrift in the

Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin greycat.na@gmail.com wrote: Couldn't you please explain, what does it do or at least what do you want it to do? Why is it better than default Hadoop web UI? Mikhail. We needed a full featured hdfs file manager for end-users that could be

Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
Hi Brian On Tue, Mar 31, 2009 at 3:46 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Stefan, I like it.  I would like to hear a bit how the security policies work.  If I open this up to the world, how does the world authenticate/authorize with my cluster? Not at all. The daemon part of

Re: Join Variation

2009-03-24 Thread Stefan Podkowinski
Have you considered hbase for this particular task? Looks like a simple lookup using the network mask as key would solve your problem. Its also possible to derive the network class (A,B,C) based on the network class of the concerned ip. But I guess your search file will cover ranges in more

csv input format handling and mapping

2009-03-13 Thread Stefan Podkowinski
Hi Can anyone share his experience or solution for the following problem? I'm having to deal with a lot of different file formats, most of them csv. Each of them shares similar semantics, ie. fields in file A exists in file B as well. What I'm not sure of is the exact index of the field in the

Re: Backing up HDFS?

2009-02-12 Thread Stefan Podkowinski
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote: The key here is to prioritize your data.  Impossible to replicate data gets backed up using whatever means necessary, hard-to-regenerate data, next priority. Easy to regenerate and ok to nuke data, doesn't get backed

Re: How to use DBInputFormat?

2009-02-06 Thread Stefan Podkowinski
awareness is one way, which would let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski

Re: How to use DBInputFormat?

2009-02-05 Thread Stefan Podkowinski
The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2)

Re: How to use DBInputFormat?

2009-02-05 Thread Stefan Podkowinski
let each database/tasktracker-node execute mappers on data where each split is a single database server for example. If you have any ideas on how the current design can be improved, please do share. Fredrik On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat