There are cases here: a simple Java main() program and a map/reduce job. A simple CSV -> sequenceFile program would be great. I would suggest doing a simple program first, and then understanding what a map/reduce job should do.
"The best is the enemy of the good:" Lance On Fri, Aug 5, 2011 at 10:02 PM, Xiaobo Gu <[email protected]> wrote: > It seems seqenceFiles can only be writen single threaded, map-reduce > style programming can't be used, am I right? > > On Fri, Aug 5, 2011 at 10:51 PM, Xiaobo Gu <[email protected]> wrote: >> I will try to write a program named Csv2Seq, it will read all the csv >> files under input recursively, and encode all the records as vectors, >> and write all the encoded vectors into a sequenceFile of type >> SequenceFile<Text, VectorWritable>, which can be consumed by >> algorithms such as Naïve bayes. I have planed the following input >> parameters for the program: >> >> input: the root HDFS directory containing csv files to convert >> output: the HDFS path of the target sequence file >> header: the HDFS path of a file containing the header of csv files >> predictors: columns to encode as vector >> types: data types of predictors, numeric, word, or text >> target: the name of the target variable >> categories: the number of target categories to be considered >> features: the number of internal hashed features to use >> key: the column to write as the Key of the target sequence file >> >> If key is specified, then the content of the key column of each record >> will be write as the sequence file Key, or the target label will be >> write as the sequence file key. >> >> Then comes the implementation strategy, since I am not so familiar >> with Hadoop map-reduce programming style, bellowing are just two >> ideas: >> #1, Csv2Seq will extend SequenceFilesFromDirectoryFilter like >> SequenceFilesFromCsvFilter >> >> Inside the process method, for each record of each file, we can >> process the line using CsvRecordFactory, and write the encoded vector >> and key to the target sequence file using SequenceFile<Text, >> VectorWritable>.Writer (not ChunkedWriter), that is we will revise >> SequenceFilesFromDirectoryFilter to accept SequenceFile<Text, >> VectorWritable>.Writer >> >> I have a few questions about this one: >> 1. If we have multiple csv file inside input, then these files will be >> processed sequenced, is this right? >> 2. For each CSV file, if it contains multiple HDFS blocks, will all >> the blocks be processed paralleled on all data nodes? >> 3. When should we create the target sequence file, and it’s writer >> 4. When should we create the CsvRecordFactory object >> >> I’ll post the second idea in a latter mail. >> >> >> On Fri, Aug 5, 2011 at 5:22 AM, Lance Norskog <[email protected]> wrote: >>> Universal output options would also be useful. For algorithms that >>> emit a small amount of data, an option for CSV output would be really >>> handy. A perfect example: classifiers create a ConfusionMatrix object. >>> This is a useful item by itself and is very small. Dumping this in a >>> CSV makes it a lot easier to load into other tools (like R). >>> >>> On Thu, Aug 4, 2011 at 8:36 AM, Xiaobo Gu <[email protected]> wrote: >>>> I think my class can't extend from SequenceFilesFromDirectoryFilter, >>>> because SequenceFilesFromDirectoryFilter requires an ChunkedWriter >>>> which writes to a SequenceFile<Text, Text>, which can't be read by >>>> naive bayes trainer, am I right? >>>> >>>> Regards, >>>> >>>> Xiaobo Gu >>>> >>>> On Mon, Aug 1, 2011 at 11:15 PM, Ted Dunning <[email protected]> wrote: >>>>> Take a look at com.tdunning.oscon.DonutEncoder in the oscon-2011 branch. >>>>> >>>>> On Mon, Aug 1, 2011 at 6:57 AM, Xiaobo Gu <[email protected]> wrote: >>>>> >>>>>> Hi Ted, >>>>>> >>>>>> Can you help which file to refer to please. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Xiaobo Gu >>>>>> >>>>>> On Fri, Jul 29, 2011 at 3:14 AM, Ted Dunning <[email protected]> >>>>>> wrote: >>>>>> > If you are looking to revise code, I would suggest that you start from >>>>>> the >>>>>> > example code found in >>>>>> > >>>>>> > [email protected]:tdunning/Chapter-16.git >>>>>> > >>>>>> > >>>>>> > >>>>>> > On Thu, Jul 28, 2011 at 8:22 AM, Xiaobo Gu <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> >> I don't know how CSVVectorIterator is used, but >>>>>> >> SequenceFilesFromCsvFilter, there are a few questions, >>>>>> >> 1. The csv files should be without headers? >>>>>> >> 2. I think the protected void process(FileStatus fst, Path current) >>>>>> >> throws IOException function of SequenceFilesFromCsvFilter is the >>>>>> >> point we can revise to make a csv to sequence converter, the idea is >>>>>> >> following: >>>>>> >> a. We must create a SequenceFile<Text, VectorWritable> file object >>>>>> >> and pass it's writer to SequenceFileFromCsvFilter as a constructor >>>>>> >> parameter, the coressponding sequenceFile is our destination. >>>>>> >> b. Forf each line extract the lable value and an encoder vector, then >>>>>> >> call writer.append(lable, new VectorWritable(vector)), which column is >>>>>> >> the lable and which columns contribute to the vector can be passed >>>>>> >> through command line arguments. >>>>>> >> >>>>>> >> >>>>>> >> Regards, >>>>>> >> >>>>>> >> Xiaobo Gu >>>>>> >> >>>>>> >> On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <[email protected]> >>>>>> >> wrote: >>>>>> >> > We do have: >>>>>> >> > SequenceFilesFromCsvFilter, although it is somewhat basic >>>>>> >> > CSVVectorIterator, which takes a CSV file and produces a dense >>>>>> >> > vector >>>>>> >> > >>>>>> >> > >>>>>> >> > On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote: >>>>>> >> > >>>>>> >> >> The critical design step here is to decide how to express the >>>>>> >> >> schema >>>>>> of >>>>>> >> the >>>>>> >> >> CSV file. There is a beginning of this in the CsvRecordFactory, >>>>>> >> >> but >>>>>> I >>>>>> >> was >>>>>> >> >> never happy with the (lack of) speed. >>>>>> >> >> >>>>>> >> >> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter >>>>>> >> >> <[email protected] >>>>>> > >>>>>> >> wrote: >>>>>> >> >> >>>>>> >> >>> 2. SequenceFile is not file format that command line users can >>>>>> >> >>>> prepare, is there tool for converting CSV files into >>>>>> >> >>>> SequenceFiles >>>>>> >> >>>> >>>>>> >> >>> >>>>>> >> >>> I don't think we have that yet, but it would be very useful imho. >>>>>> >> >>> >>>>>> >> > >>>>>> >> > -------------------------- >>>>>> >> > Grant Ingersoll >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> >>>>>> > >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> [email protected] >>> >> > -- Lance Norskog [email protected]
