Jens,
As your test shows, using a native codec won't make much sense for
small files, since the involved JNI overhead will likely out-weight
any possible gains. With all the performance improvements in java 5 +
6 its reasonable to ask whether the native implementation does really
improve
You just can't have many distributed jobs write into the same file
without locking/synchronizing these writes. Even with append(). Its
not different than using a regular file from multiple processes in
this respect.
Maybe you need to collect your data in front before processing them in hadoop?
If you have trouble loading your data into mysql using INSERTs or LOAD
DATA, consider that MySQL supports CSV directly using the CSV storage
engine. The only thing you have to do is to copy your hadoop produced
csv file into the mysql data directory and issue a flush tables
command to have mysql
You can get the job progress and completion status through an instance
of org.apache.hadoop.mapred.JobClient . If you really want to use perl
I guess you still need to write a small java application that talks to
perl and JobClient on the other side.
Theres also some support for Thrift in the
On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin
greycat.na@gmail.com wrote:
Couldn't you please explain, what does it do or at least what do you
want it to do? Why is it better than default Hadoop web UI?
Mikhail. We needed a full featured hdfs file manager for end-users
that could be
Hi Brian
On Tue, Mar 31, 2009 at 3:46 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
Hey Stefan,
I like it. I would like to hear a bit how the security policies work. If I
open this up to the world, how does the world authenticate/authorize
with my cluster?
Not at all. The daemon part of
Have you considered hbase for this particular task?
Looks like a simple lookup using the network mask as key would solve
your problem.
Its also possible to derive the network class (A,B,C) based on the
network class of the concerned ip. But I guess your search file will
cover ranges in more
Hi
Can anyone share his experience or solution for the following problem?
I'm having to deal with a lot of different file formats, most of them csv.
Each of them shares similar semantics, ie. fields in file A exists in
file B as well.
What I'm not sure of is the exact index of the field in the
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer a...@yahoo-inc.com wrote:
The key here is to prioritize your data. Impossible to replicate data gets
backed up using whatever means necessary, hard-to-regenerate data, next
priority. Easy to regenerate and ok to nuke data, doesn't get backed
awareness is one way, which would let each
database/tasktracker-node execute mappers on data where each split is a
single database server for example.
If you have any ideas on how the current design can be improved, please
do
share.
Fredrik
On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski
The 0.19 DBInputFormat class implementation is IMHO only suitable for
very simple queries working on only few datasets. Thats due to the
fact that it tries to create splits from the query by
1) getting a count of all rows using the specified count query (huge
performance impact on large tables)
2)
let each
database/tasktracker-node execute mappers on data where each split is a
single database server for example.
If you have any ideas on how the current design can be improved, please do
share.
Fredrik
On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote:
The 0.19 DBInputFormat
12 matches
Mail list logo