Re: What about a universal input data handling mechanism for Mahout?

Lance Norskog Fri, 05 Aug 2011 23:35:22 -0700

There are cases here: a simple Java main() program and a map/reduce
job. A simple CSV -> sequenceFile program would be great. I would
suggest doing a simple program first, and then understanding what a
map/reduce
job should do.


"The best is the enemy of the good:"

Lance

On Fri, Aug 5, 2011 at 10:02 PM, Xiaobo Gu <[email protected]> wrote:
> It seems seqenceFiles can only be writen single threaded, map-reduce
> style programming can't be used, am I right?
>
> On Fri, Aug 5, 2011 at 10:51 PM, Xiaobo Gu <[email protected]> wrote:
>> I will try to write a program named Csv2Seq, it will read all the csv
>> files under input recursively, and encode all the records as vectors,
>> and write all the encoded vectors into a sequenceFile of type
>> SequenceFile<Text, VectorWritable>, which can be consumed by
>> algorithms such as Naïve bayes. I have planed the following input
>> parameters for the program:
>>
>> input: the root HDFS directory containing csv files to convert
>> output: the HDFS path of the target sequence file
>> header: the HDFS path of a file containing the header of csv files
>> predictors: columns to encode as vector
>> types: data types of predictors, numeric, word, or text
>> target: the name of the target variable
>> categories: the number of target categories to be considered
>> features: the number of internal hashed features to use
>> key: the column to write as the Key of the target sequence file
>>
>> If key is specified, then the content of the key column of each record
>> will be write as the sequence file Key, or the target label will be
>> write as the sequence file key.
>>
>> Then comes the implementation strategy, since I am not so familiar
>> with Hadoop map-reduce programming style, bellowing are just two
>> ideas:
>> #1, Csv2Seq will extend SequenceFilesFromDirectoryFilter like
>> SequenceFilesFromCsvFilter
>>
>> Inside the process method, for each record of each file, we can
>> process the line using CsvRecordFactory, and write the encoded vector
>> and key to the target sequence file using SequenceFile<Text,
>> VectorWritable>.Writer (not ChunkedWriter), that is we will revise
>> SequenceFilesFromDirectoryFilter to accept SequenceFile<Text,
>> VectorWritable>.Writer
>>
>> I have a few questions about this one:
>> 1.      If we have multiple csv file inside input, then these files will be
>> processed sequenced, is this right?
>> 2.      For each CSV file, if it contains multiple HDFS blocks, will all
>> the blocks be processed paralleled on all data nodes?
>> 3.      When should we create the target sequence file, and it’s writer
>> 4.      When should we create the CsvRecordFactory object
>>
>> I’ll post the second idea in a latter mail.
>>
>>
>> On Fri, Aug 5, 2011 at 5:22 AM, Lance Norskog <[email protected]> wrote:
>>> Universal output options would also be useful. For algorithms that
>>> emit a small amount of data, an option for CSV output would be really
>>> handy. A perfect example: classifiers create a ConfusionMatrix object.
>>> This is a useful item by itself and is very small.  Dumping this in a
>>> CSV makes it a lot easier to load into other tools (like R).
>>>
>>> On Thu, Aug 4, 2011 at 8:36 AM, Xiaobo Gu <[email protected]> wrote:
>>>> I think  my class can't extend from SequenceFilesFromDirectoryFilter,
>>>> because SequenceFilesFromDirectoryFilter requires an ChunkedWriter
>>>> which writes to a SequenceFile<Text, Text>, which can't be read by
>>>> naive bayes trainer, am I right?
>>>>
>>>> Regards,
>>>>
>>>> Xiaobo Gu
>>>>
>>>> On Mon, Aug 1, 2011 at 11:15 PM, Ted Dunning <[email protected]> wrote:
>>>>> Take a look at com.tdunning.oscon.DonutEncoder in the oscon-2011 branch.
>>>>>
>>>>> On Mon, Aug 1, 2011 at 6:57 AM, Xiaobo Gu <[email protected]> wrote:
>>>>>
>>>>>> Hi Ted,
>>>>>>
>>>>>> Can you help which file to refer to please.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Xiaobo Gu
>>>>>>
>>>>>> On Fri, Jul 29, 2011 at 3:14 AM, Ted Dunning <[email protected]>
>>>>>> wrote:
>>>>>> > If you are looking to revise code, I would suggest that you start from
>>>>>> the
>>>>>> > example code found in
>>>>>> >
>>>>>> > [email protected]:tdunning/Chapter-16.git
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Jul 28, 2011 at 8:22 AM, Xiaobo Gu <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> >> I don't know how CSVVectorIterator is used, but
>>>>>> >> SequenceFilesFromCsvFilter, there are a few questions,
>>>>>> >> 1. The csv files should be without headers?
>>>>>> >> 2. I think the protected void process(FileStatus fst, Path current)
>>>>>> >> throws IOException  function of SequenceFilesFromCsvFilter is the
>>>>>> >> point we can revise to make a csv to sequence converter, the idea is
>>>>>> >> following:
>>>>>> >>  a.  We must create a SequenceFile<Text, VectorWritable> file object
>>>>>> >> and pass it's writer to SequenceFileFromCsvFilter as a constructor
>>>>>> >> parameter, the coressponding sequenceFile is our destination.
>>>>>> >> b. Forf each line extract the lable value and an encoder vector, then
>>>>>> >> call writer.append(lable, new VectorWritable(vector)), which column is
>>>>>> >> the lable and which columns contribute to the vector can be passed
>>>>>> >> through command line arguments.
>>>>>> >>
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >>
>>>>>> >> Xiaobo Gu
>>>>>> >>
>>>>>> >> On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <[email protected]>
>>>>>> >> wrote:
>>>>>> >> > We do have:
>>>>>> >> > SequenceFilesFromCsvFilter, although it is somewhat basic
>>>>>> >> > CSVVectorIterator, which takes a CSV file and produces a dense 
>>>>>> >> > vector
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote:
>>>>>> >> >
>>>>>> >> >> The critical design step here is to decide how to express the 
>>>>>> >> >> schema
>>>>>> of
>>>>>> >> the
>>>>>> >> >> CSV file.  There is a beginning of this in the CsvRecordFactory, 
>>>>>> >> >> but
>>>>>> I
>>>>>> >> was
>>>>>> >> >> never happy with the (lack of) speed.
>>>>>> >> >>
>>>>>> >> >> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter 
>>>>>> >> >> <[email protected]
>>>>>> >
>>>>>> >> wrote:
>>>>>> >> >>
>>>>>> >> >>> 2. SequenceFile is not file format that command line users can
>>>>>> >> >>>> prepare, is there tool for converting CSV files into 
>>>>>> >> >>>> SequenceFiles
>>>>>> >> >>>>
>>>>>> >> >>>
>>>>>> >> >>> I don't think we have that yet, but it would be very useful imho.
>>>>>> >> >>>
>>>>>> >> >
>>>>>> >> > --------------------------
>>>>>> >> > Grant Ingersoll
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>>
>>
>



-- 
Lance Norskog
[email protected]

Re: What about a universal input data handling mechanism for Mahout?

Reply via email to