Suneel is right indeed. I assumed that everything performed prior to vector generation is done correctly.
By the way, if the suggestions do not work, could you try running seq2sparse from commandline with the same arguments and see if that works well? On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi <[email protected]>wrote: > I would first check to see if the input 'seqfiles' for TFIDFGenerator have > any meat in them. > This could also happen if the input seqfiles are empty. > > > ________________________________ > From: Taner Diler <[email protected]> > To: [email protected] > Sent: Sunday, September 1, 2013 2:24 AM > Subject: TFIDFConverter generates empty tfidf-vectors > > > Hi all, > > I try to run Reuters KMeans example in Java, but TFIDFComverter generates > tfidf-vectors as empty. How can I fix that? > > private static int minSupport = 2; > private static int maxNGramSize = 2; > private static float minLLRValue = 50; > private static float normPower = 2; > private static boolean logNormalize = true; > private static int numReducers = 1; > private static int chunkSizeInMegabytes = 200; > private static boolean sequentialAccess = true; > private static boolean namedVectors = false; > private static int minDf = 5; > private static long maxDF = 95; > > Path inputDir = new Path("reuters-seqfiles"); > String outputDir = "reuters-kmeans-try"; > HadoopUtil.delete(conf, new Path(outputDir)); > StandardAnalyzer analyzer = new > StandardAnalyzer(Version.LUCENE_43); > Path tokenizedPath = new > Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); > DocumentProcessor.tokenizeDocuments(inputDir, > analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); > > > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new > Path(outputDir), > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, > minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, > numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); > > > Pair<Long[], List<Path>> features = TFIDFConverter.calculateDF(new > Path(outputDir, > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new > Path(outputDir), conf, chunkSizeInMegabytes); > TFIDFConverter.processTfIdf(new Path(outputDir, > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new > Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize, > sequentialAccess, false, numReducers); >
