Re: What about a universal input data handling mechanism for Mahout?

Xiaobo Gu Fri, 05 Aug 2011 22:03:31 -0700

It seems seqenceFiles can only be writen single threaded, map-reduce
style programming can't be used, am I right?


On Fri, Aug 5, 2011 at 10:51 PM, Xiaobo Gu <[email protected]> wrote:
> I will try to write a program named Csv2Seq, it will read all the csv
> files under input recursively, and encode all the records as vectors,
> and write all the encoded vectors into a sequenceFile of type
> SequenceFile<Text, VectorWritable>, which can be consumed by
> algorithms such as Naïve bayes. I have planed the following input
> parameters for the program:
>
> input: the root HDFS directory containing csv files to convert
> output: the HDFS path of the target sequence file
> header: the HDFS path of a file containing the header of csv files
> predictors: columns to encode as vector
> types: data types of predictors, numeric, word, or text
> target: the name of the target variable
> categories: the number of target categories to be considered
> features: the number of internal hashed features to use
> key: the column to write as the Key of the target sequence file
>
> If key is specified, then the content of the key column of each record
> will be write as the sequence file Key, or the target label will be
> write as the sequence file key.
>
> Then comes the implementation strategy, since I am not so familiar
> with Hadoop map-reduce programming style, bellowing are just two
> ideas:
> #1, Csv2Seq will extend SequenceFilesFromDirectoryFilter like
> SequenceFilesFromCsvFilter
>
> Inside the process method, for each record of each file, we can
> process the line using CsvRecordFactory, and write the encoded vector
> and key to the target sequence file using SequenceFile<Text,
> VectorWritable>.Writer (not ChunkedWriter), that is we will revise
> SequenceFilesFromDirectoryFilter to accept SequenceFile<Text,
> VectorWritable>.Writer
>
> I have a few questions about this one:
> 1.      If we have multiple csv file inside input, then these files will be
> processed sequenced, is this right?
> 2.      For each CSV file, if it contains multiple HDFS blocks, will all
> the blocks be processed paralleled on all data nodes?
> 3.      When should we create the target sequence file, and it’s writer
> 4.      When should we create the CsvRecordFactory object
>
> I’ll post the second idea in a latter mail.
>
>
> On Fri, Aug 5, 2011 at 5:22 AM, Lance Norskog <[email protected]> wrote:
>> Universal output options would also be useful. For algorithms that
>> emit a small amount of data, an option for CSV output would be really
>> handy. A perfect example: classifiers create a ConfusionMatrix object.
>> This is a useful item by itself and is very small.  Dumping this in a
>> CSV makes it a lot easier to load into other tools (like R).
>>
>> On Thu, Aug 4, 2011 at 8:36 AM, Xiaobo Gu <[email protected]> wrote:
>>> I think  my class can't extend from SequenceFilesFromDirectoryFilter,
>>> because SequenceFilesFromDirectoryFilter requires an ChunkedWriter
>>> which writes to a SequenceFile<Text, Text>, which can't be read by
>>> naive bayes trainer, am I right?
>>>
>>> Regards,
>>>
>>> Xiaobo Gu
>>>
>>> On Mon, Aug 1, 2011 at 11:15 PM, Ted Dunning <[email protected]> wrote:
>>>> Take a look at com.tdunning.oscon.DonutEncoder in the oscon-2011 branch.
>>>>
>>>> On Mon, Aug 1, 2011 at 6:57 AM, Xiaobo Gu <[email protected]> wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> Can you help which file to refer to please.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Xiaobo Gu
>>>>>
>>>>> On Fri, Jul 29, 2011 at 3:14 AM, Ted Dunning <[email protected]>
>>>>> wrote:
>>>>> > If you are looking to revise code, I would suggest that you start from
>>>>> the
>>>>> > example code found in
>>>>> >
>>>>> > [email protected]:tdunning/Chapter-16.git
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Thu, Jul 28, 2011 at 8:22 AM, Xiaobo Gu <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >> I don't know how CSVVectorIterator is used, but
>>>>> >> SequenceFilesFromCsvFilter, there are a few questions,
>>>>> >> 1. The csv files should be without headers?
>>>>> >> 2. I think the protected void process(FileStatus fst, Path current)
>>>>> >> throws IOException  function of SequenceFilesFromCsvFilter is the
>>>>> >> point we can revise to make a csv to sequence converter, the idea is
>>>>> >> following:
>>>>> >>  a.  We must create a SequenceFile<Text, VectorWritable> file object
>>>>> >> and pass it's writer to SequenceFileFromCsvFilter as a constructor
>>>>> >> parameter, the coressponding sequenceFile is our destination.
>>>>> >> b. Forf each line extract the lable value and an encoder vector, then
>>>>> >> call writer.append(lable, new VectorWritable(vector)), which column is
>>>>> >> the lable and which columns contribute to the vector can be passed
>>>>> >> through command line arguments.
>>>>> >>
>>>>> >>
>>>>> >> Regards,
>>>>> >>
>>>>> >> Xiaobo Gu
>>>>> >>
>>>>> >> On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <[email protected]>
>>>>> >> wrote:
>>>>> >> > We do have:
>>>>> >> > SequenceFilesFromCsvFilter, although it is somewhat basic
>>>>> >> > CSVVectorIterator, which takes a CSV file and produces a dense vector
>>>>> >> >
>>>>> >> >
>>>>> >> > On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote:
>>>>> >> >
>>>>> >> >> The critical design step here is to decide how to express the schema
>>>>> of
>>>>> >> the
>>>>> >> >> CSV file.  There is a beginning of this in the CsvRecordFactory, but
>>>>> I
>>>>> >> was
>>>>> >> >> never happy with the (lack of) speed.
>>>>> >> >>
>>>>> >> >> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter 
>>>>> >> >> <[email protected]
>>>>> >
>>>>> >> wrote:
>>>>> >> >>
>>>>> >> >>> 2. SequenceFile is not file format that command line users can
>>>>> >> >>>> prepare, is there tool for converting CSV files into SequenceFiles
>>>>> >> >>>>
>>>>> >> >>>
>>>>> >> >>> I don't think we have that yet, but it would be very useful imho.
>>>>> >> >>>
>>>>> >> >
>>>>> >> > --------------------------
>>>>> >> > Grant Ingersoll
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>
>

Re: What about a universal input data handling mechanism for Mahout?

Reply via email to