Hi,
Arijit Mukherjee wrote:
Hi
We've been thinking of using Hadoop for a decision making system which
will analyze telecom-related data from various sources to take certain
decisions. The data can be huge, of the order of terabytes, and can be
stored as CSV files, which I understand will fit
Yes, you can use MultiFileInputFormat.
You can extend the MultiFileInputFormat to return a RecordReader, which
reads a record for each file in the MultiFileSplit.
Enis
chandra wrote:
hi..
By setting isSplitable false, we can set 1 file with n records 1 mapper.
Is there any way to set 1
Thanx Enis.
By workflow, I was trying to mean something like a chain of MapReduce
jobs - the first one will extract a certain amount of data from the
original set and do some computation resulting in a smaller summary,
which will then be the input to a further MR job, and so on...somewhat
similar
thanks
is there any built in record reader which performs this function..
Enis Soztutar wrote:
Yes, you can use MultiFileInputFormat.
You can extend the MultiFileInputFormat to return a RecordReader, which
reads a record for each file in the MultiFileSplit.
Enis
chandra wrote:
Thanx again Enis. I'll have a look at Pig and Hive.
Regards
Arijit
Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com
Nope, not right now. But this has came up before. Perhaps you will
contribute one?
chandravadana wrote:
thanks
is there any built in record reader which performs this function..
Enis Soztutar wrote:
Yes, you can use MultiFileInputFormat.
You can extend the MultiFileInputFormat to
Arijit,
For workflow, check out http://cascading.org -- that works quite well
and fits what you described.
Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/
In terms of PIG, Hive,
That's a very good overview Paco - thanx for that. I might get back to
you with more queries about cascade etc. at some time - hope you
wouldn't mind.
Regards
Arijit
Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt
Hi everybody,
I have a simple test case which creates a file, writes two lines into
the FSDataOutputStream and then flushes, syncs and closes the stream. I
am using hadoop 0.18.0 with cygwin.
What I observe (in contrast of using java.io.DataOutputStream) is that
the lines get written to
Certainly. It'd be great to talk with others working in analytics and
statistical computing, who have been evaluating MapReduce as well.
Paco
On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
That's a very good overview Paco - thanx for that. I might get back to
you
Hi list,
I've generated a sequence file by a reducer, then I will use it to start the
second map step, which need the record number of that sequence file. How can
it fast ?
thanks a lot.
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
Hi Jeremy,
I think the key of the map function is the number of the record in a
sequence file.
I was trying to retrieve the record number in a normal TextInputFormat
but could not find it. The best what I could use is the byte offset of
the record, which does not reflex the record number
Hmm.. neither of these filesystems seems to implement flush(). Can you
file a jira on it?
HDFS implements sync() and data should be on the disk after that, but
might not be able available to a reader yet. How did you test if it data
is on the disk?
Raghu.
Christoph Graf wrote:
Hi
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:
We are developing a project and we are intend to use Hadoop to
handle the processing vast amount of data. But to convince our
customers about the using of Hadoop in our project, we must show
them the advantages ( and maybe ? the
I got these errors I don't know what they mean, any help is appreciated.
I suspect that either its a H/W error or the cluster is out of space to
store intermediate results?
there is still lots of free space left on the cluster.
08/09/24 00:23:31 INFO mapred.JobClient: map 79% reduce 24%
Hi,
Are there any performance numbers related to how HDFS can ingest data?
I am assuming a case where multiple processes outside hadoop write into
hadoop in parallel. I understand that it is probably related to
various hardware
constraints but any existing numbers would be interesting. In
I encountered an interesting situation today. I'm running Hadoop
0.17.1. What happened was that 3 jobs started simultaneously, which
is expected in my workflow, but then resources got very mixed up.
One of the jobs grabbed all the available reducers (5) and got one
map task in before the
If I understand correctly, each mapper is sent a set number of records. Is
there a counter or variable that tells you how many records is sent to a
particular mapper?
Likewise, is there a similar thing for reducers?
Thanks in advance.
-SM
Yes, take a look at
src/mapred/org/apache/hadoop/mapred/Task_Counter.properties
Those are all the counters available for a task.
-Lohit
- Original Message
From: Sandy [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Wednesday, September 24, 2008 5:09:39 PM
Subject: counter for
On Wed, Sep 24, 2008 at 9:24 PM, Elia Mazzawi wrote:
I got these errors I don't know what they mean, any help is appreciated.
I suspect that either its a H/W error or the cluster is out of space to
store intermediate results?
there is still lots of free space left on the cluster.
08/09/24
Dear Mr/Mrs Owen O'Malley,
First I would like to thank you much for your reply, it was somehow the
exact answer which I expected. As I read about the Query Language of
Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and
more... and I could see that Hadoop have an advantage SQL-like
On 9/24/08, Bryan Duxbury [EMAIL PROTECTED] wrote:
Does Hadoop not schedule jobs first-come-first served?
Yes, Hadoop 0.17 schedules jobs fifo. If it isn't, that is a bug.
-- Owen
Hi list,
The default way hadoop doing its sorting is by keys , can it sort by
values rather than keys?
Regards,
Jeremy
--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.
http://coderplay.javaeye.com
What kind of the real-time app?
On Wed, Sep 24, 2008 at 4:50 AM, Stas Oskin [EMAIL PROTECTED] wrote:
Hi.
Is it possible to use Hadoop for real-time app, in video processing field?
Regards.
--
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org
Hi,
Is there a way to do this with streaming?
I've noticed there is a -partitioner option for streaming, does that mean
I have to write a java partitioner class to perform total order sorting?
Thanks,
Joseph
On Sun, Sep 21, 2008 at 2:12 AM, lohit [EMAIL PROTECTED] wrote:
Since this is
Sorting according to keys is a requirement for the map/reduce algorithm. I'd
suggest running a second map/reduce phase on the output files of your
application and use the values as keys in that second phase. I know that
will increase the running time, but this is how I do it when I need to get
my
As far as I know, there is a Hadoop plug-in for Eclipse but it is not
possible to debug when running on a real cluster. If you want to add watches
and expressions to trace your programs or profile your code, I'd suggest
looking at the log files or use other tracing tools such as xtrace (
27 matches
Mail list logo