Lots of ways to do this, but I'd use pig + elephant-bird-pig to (1) load
the data from tsv format into pig and (2) convert from pig tuples to
writables and store in sequence file:

{code}
-- params
%default MY_DATA_FILE '/path/to/docs.tsv';
%default OUTPUT_PATH '/path/to/output';

-- constants
%declare SEQFILE_STORAGE
'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare LONG_CONVERTER
'com.twitter.elephantbird.pig.util.LongWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';

-- pull in EB
REGISTER '/path/to/elephant-bird-pig.jar';

-- load data
doc = LOAD '$MY_DATA_FILE' USING PigStorage AS (doc_id: long, text:
chararray);

-- store data
rmf '$OUTPUT_PATH'
STORE doc INTO '$OUTPUT_PATH' USING $SEQFILE_STORAGE (
  '-c $LONG_CONVERTER', '-c '$TEXT_CONVERTER'
);
{code}

https://github.com/kevinweil/elephant-bird/

Andy


On Tue, Oct 30, 2012 at 1:00 PM, Nick Woodward <[email protected]> wrote:

>
> I have done a lot of searching on the web for this, but I've found
> nothing, even though I feel like it has to be somewhat common. I have used
> Mahout's 'seqdirectory' command to convert a folder containing text files
> (each file is a separate document) in the past. But in this case there are
> so many documents (in the 100,000s) that I have one very large text file in
> which each line is a document. How can I convert this large file to
> SequenceFile format so that Mahout understands that each line should be
> considered a separate document?  Would it be better if the file was
> structured like so....docId1 {tab} document textdocId2 {tab} document
> textdocId3 {tab} document text...
>
> Thank you very much for any help.Nick
>

Reply via email to