I think the chance that Mahout will use the RCV1 data as-is is pretty near zero. The issue is that RCV1 uses the TREC convention of separate files for documents and relevance judgements (largely because relevance to multiple queries is quite plausible in most of the TREC tasks).
That said, it doesn't take more than a few lines of glue to smash RCV1 into one of the several formats that we use in Mahout. The real problem is that input formats are not real consistent yet across the different supervised learning programs in Mahout. Naive Bayes, Random Forests, SGD and SVM all use inputs that they inherited from their original applications. There is a bit of motion afoot to converge these systems, but you can definitely help there. On Mon, Jun 28, 2010 at 1:06 PM, Brandon Mensing <[email protected]>wrote: > Has anyone used the LYRL2004 RCV1 data for input to classification? I'm > trying to determine if it's possible to plug it into the given > classification training without significant modification to the source or > the data. > > Thanks > Brandon Mensing > >
