Re: seqdirectory command in MapReduce

Claudio Reggiani Sat, 16 Feb 2013 09:09:47 -0800

Let say the directory has only one big text. Logically it's one file but
actually on HDFS the data is distributed among the cluster. Suppose now the
big text can't stay in memory (in any memory of the cluster), does
"seqdirectory" work?


If so, the only way is to run seqdirectory as MapReduce job.

The output will be (logically) one key-value record, where (as you said)
the key is the file name and the value is the file content in vector format.

Sorry for my vagueness
Claudio


2013/2/16 Dan Filimon <[email protected]>

> Hi Claudio,
>
> Could you be more specific? What does 'MapReduce style' mean?
> seqdirectory should create sequence files from the documents in a
> folder, where the keys are the document names and the values are the
> documents' content.
>
> What do you need it to do?
>
> On Sat, Feb 16, 2013 at 5:55 PM, Claudio Reggiani <[email protected]>
> wrote:
> > Hello,
> >
> > I have a text dataset. Running "seqdirectory" command on it I see it's
> not
> > written in MapReduce style (looking at the source code of
> > SequenceFilesFromDirectory confirms that).
> >
> > What if I have a big dataset stored in HDFS and I would like to convert
> it
> > in SequenceFile format? Do I need to create my own custom job or
> > seqdirectory does that?
> >
> > Thanks
> > Claudio Reggiani
>

Re: seqdirectory command in MapReduce

Reply via email to