Re: LYRL2004/RCV1 input for Classification?

Ted Dunning Mon, 28 Jun 2010 13:23:44 -0700

I think the chance that Mahout will use the RCV1 data as-is is pretty near
zero.  The issue is that RCV1 uses the TREC convention of separate files for
documents and relevance judgements (largely because relevance to multiple
queries is quite plausible  in most of the TREC tasks).

That said, it doesn't take more than a few lines of glue to smash RCV1 into
one of the several formats that we use in Mahout.

The real problem is that input formats are not real consistent yet across
the different supervised learning programs in Mahout.  Naive Bayes, Random
Forests, SGD and SVM all use inputs that they inherited from their original
applications.  There is a bit of motion afoot to converge these systems, but
you can definitely help there.

On Mon, Jun 28, 2010 at 1:06 PM, Brandon Mensing <[email protected]>wrote:

> Has anyone used the LYRL2004 RCV1 data for input to classification? I'm
> trying to determine if it's possible to plug it into the given
> classification training without significant modification to the source or
> the data.
>
> Thanks
> Brandon Mensing
>
>

Re: LYRL2004/RCV1 input for Classification?

Reply via email to