It seems seqenceFiles can only be writen single threaded, map-reduce style programming can't be used, am I right?
On Fri, Aug 5, 2011 at 10:51 PM, Xiaobo Gu <[email protected]> wrote: > I will try to write a program named Csv2Seq, it will read all the csv > files under input recursively, and encode all the records as vectors, > and write all the encoded vectors into a sequenceFile of type > SequenceFile<Text, VectorWritable>, which can be consumed by > algorithms such as Naïve bayes. I have planed the following input > parameters for the program: > > input: the root HDFS directory containing csv files to convert > output: the HDFS path of the target sequence file > header: the HDFS path of a file containing the header of csv files > predictors: columns to encode as vector > types: data types of predictors, numeric, word, or text > target: the name of the target variable > categories: the number of target categories to be considered > features: the number of internal hashed features to use > key: the column to write as the Key of the target sequence file > > If key is specified, then the content of the key column of each record > will be write as the sequence file Key, or the target label will be > write as the sequence file key. > > Then comes the implementation strategy, since I am not so familiar > with Hadoop map-reduce programming style, bellowing are just two > ideas: > #1, Csv2Seq will extend SequenceFilesFromDirectoryFilter like > SequenceFilesFromCsvFilter > > Inside the process method, for each record of each file, we can > process the line using CsvRecordFactory, and write the encoded vector > and key to the target sequence file using SequenceFile<Text, > VectorWritable>.Writer (not ChunkedWriter), that is we will revise > SequenceFilesFromDirectoryFilter to accept SequenceFile<Text, > VectorWritable>.Writer > > I have a few questions about this one: > 1. If we have multiple csv file inside input, then these files will be > processed sequenced, is this right? > 2. For each CSV file, if it contains multiple HDFS blocks, will all > the blocks be processed paralleled on all data nodes? > 3. When should we create the target sequence file, and it’s writer > 4. When should we create the CsvRecordFactory object > > I’ll post the second idea in a latter mail. > > > On Fri, Aug 5, 2011 at 5:22 AM, Lance Norskog <[email protected]> wrote: >> Universal output options would also be useful. For algorithms that >> emit a small amount of data, an option for CSV output would be really >> handy. A perfect example: classifiers create a ConfusionMatrix object. >> This is a useful item by itself and is very small. Dumping this in a >> CSV makes it a lot easier to load into other tools (like R). >> >> On Thu, Aug 4, 2011 at 8:36 AM, Xiaobo Gu <[email protected]> wrote: >>> I think my class can't extend from SequenceFilesFromDirectoryFilter, >>> because SequenceFilesFromDirectoryFilter requires an ChunkedWriter >>> which writes to a SequenceFile<Text, Text>, which can't be read by >>> naive bayes trainer, am I right? >>> >>> Regards, >>> >>> Xiaobo Gu >>> >>> On Mon, Aug 1, 2011 at 11:15 PM, Ted Dunning <[email protected]> wrote: >>>> Take a look at com.tdunning.oscon.DonutEncoder in the oscon-2011 branch. >>>> >>>> On Mon, Aug 1, 2011 at 6:57 AM, Xiaobo Gu <[email protected]> wrote: >>>> >>>>> Hi Ted, >>>>> >>>>> Can you help which file to refer to please. >>>>> >>>>> Regards, >>>>> >>>>> Xiaobo Gu >>>>> >>>>> On Fri, Jul 29, 2011 at 3:14 AM, Ted Dunning <[email protected]> >>>>> wrote: >>>>> > If you are looking to revise code, I would suggest that you start from >>>>> the >>>>> > example code found in >>>>> > >>>>> > [email protected]:tdunning/Chapter-16.git >>>>> > >>>>> > >>>>> > >>>>> > On Thu, Jul 28, 2011 at 8:22 AM, Xiaobo Gu <[email protected]> >>>>> wrote: >>>>> > >>>>> >> I don't know how CSVVectorIterator is used, but >>>>> >> SequenceFilesFromCsvFilter, there are a few questions, >>>>> >> 1. The csv files should be without headers? >>>>> >> 2. I think the protected void process(FileStatus fst, Path current) >>>>> >> throws IOException function of SequenceFilesFromCsvFilter is the >>>>> >> point we can revise to make a csv to sequence converter, the idea is >>>>> >> following: >>>>> >> a. We must create a SequenceFile<Text, VectorWritable> file object >>>>> >> and pass it's writer to SequenceFileFromCsvFilter as a constructor >>>>> >> parameter, the coressponding sequenceFile is our destination. >>>>> >> b. Forf each line extract the lable value and an encoder vector, then >>>>> >> call writer.append(lable, new VectorWritable(vector)), which column is >>>>> >> the lable and which columns contribute to the vector can be passed >>>>> >> through command line arguments. >>>>> >> >>>>> >> >>>>> >> Regards, >>>>> >> >>>>> >> Xiaobo Gu >>>>> >> >>>>> >> On Tue, Jul 26, 2011 at 5:50 PM, Grant Ingersoll <[email protected]> >>>>> >> wrote: >>>>> >> > We do have: >>>>> >> > SequenceFilesFromCsvFilter, although it is somewhat basic >>>>> >> > CSVVectorIterator, which takes a CSV file and produces a dense vector >>>>> >> > >>>>> >> > >>>>> >> > On Jul 26, 2011, at 3:58 AM, Ted Dunning wrote: >>>>> >> > >>>>> >> >> The critical design step here is to decide how to express the schema >>>>> of >>>>> >> the >>>>> >> >> CSV file. There is a beginning of this in the CsvRecordFactory, but >>>>> I >>>>> >> was >>>>> >> >> never happy with the (lack of) speed. >>>>> >> >> >>>>> >> >> On Tue, Jul 26, 2011 at 12:10 AM, Sebastian Schelter >>>>> >> >> <[email protected] >>>>> > >>>>> >> wrote: >>>>> >> >> >>>>> >> >>> 2. SequenceFile is not file format that command line users can >>>>> >> >>>> prepare, is there tool for converting CSV files into SequenceFiles >>>>> >> >>>> >>>>> >> >>> >>>>> >> >>> I don't think we have that yet, but it would be very useful imho. >>>>> >> >>> >>>>> >> > >>>>> >> > -------------------------- >>>>> >> > Grant Ingersoll >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> >>>>> > >>>>> >>>> >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >> >
