Just to be a picker of nits... this topic is more concisely Hadoop Development 101. I only mention this because I am a newbie hadoop admin and this was over my head. ;) Admins don't worry as much about Key Value Pairs and parsing as we do about where is the script that starts the NameNode. ;)
On Wed, Dec 12, 2012 at 11:16 PM, David Parks <[email protected]>wrote: > Nothing that I'm aware of for text files, I'd just use standard unix utils > to process it outside of Hadoop. > > As to getting a reader from any of the Input Formats, here's the typical > example you'd follow to get the reader for a sequence file, you could > extrapolate the example to access whichever reader you're interested in. > > > http://my.safaribooksonline.com/book/databases/hadoop/9780596521974/file-bas > ed-data-structures/id3555432 > > > -----Original Message----- > From: Pat Ferrel [mailto:[email protected]] > Sent: Wednesday, December 12, 2012 11:37 PM > To: [email protected] > Subject: Re: Hadoop 101 > > Yeah I found the TextInputFormat and TextKeyValueInputFormat and I know how > to parse text--I'm just too lazy. I was hoping there was a Text equivalent > of a SequenceFile that was hidden somewhere. As I said there is no mapper, > this is running outside of hadoop M/R. So I at least need a line reader and > not sure how the InputFormat works outside a mapper. But who cares, parsing > is simple enough from scratch. All the TextKeyValueInputFormat gives me is > splitting at the tab afaict. > > Actually this convinces me to look further into getting the values from > method calls. They aren't quite what I want to begin with. > > Thanks for saving me more fruitless searches. > > On Dec 11, 2012, at 10:04 PM, David Parks <[email protected]> wrote: > > You use TextInputFormat, you'll get the following key<LongWritable>, > value<Text> pairs in your mapper: > > file_position, your_input > > Example: > 0, > > "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]" > 100, > > "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786 > 037]" > 200, > > "25\t[284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.3482 > 1576]" > > Then just parse it out in your mapper. > > > -----Original Message----- > From: Pat Ferrel [mailto:[email protected]] > Sent: Wednesday, December 12, 2012 7:50 AM > To: [email protected] > Subject: Hadoop 101 > > Stupid question for the day. > > I have a file created by a mahout job of the form: > > 0 > [356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597] > 8 > > [356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786037] > 25 > > [284:0.34821576,286:0.34821576,287:0.34821576,288:0.34821576,289:0.34821576] > 28 > > [452:0.34802154,454:0.34802154,453:0.34802154,456:0.34802154,455:0.34802154] > . > > If this were a SequenceFile I could read it and be merrily on my way but > it's a text file. The classes written are key, value pairs <LongWritable, > VectorWritable> but the file is tab delimited text. > > I was hoping to do something like: > > SequenceFile.Reader reader = new SequenceFile.Reader(fs, inputFile, conf); > Writable userId = new LongWritable(); VectorWritable recommendations = new > VectorWritable(); while (reader.next(userId, recommendations)) { > //do something with each pair > } > > But alas Google fails me. How do you read in key, values pairs from text > files outside of a map or reduce? > > >
