Have a look at OnlineLogisticRegressionTest.iris().

Here List.subList() is used in combination with Collections.shuffle() to
make the train and test dataset split.

So you could first read the dataset in a list and then use this trick.

I just pushed an example to Github that also uses this approach but I
wrapped this logic into a utility

See: https://github.com/frankscholten/mahout-sgd-bank-marketing and

https://github.com/frankscholten/mahout-sgd-bank-marketing/blob/master/src/main/java/bankmarketing/util/TrainAndTestSetUtil.java

Cheers,

Frank


On Mon, Feb 3, 2014 at 10:01 PM, j.barrett Strausser <
[email protected]> wrote:

> Two part question.
>
> 1. String Descriptor for input data
>
> Can anyone confirm my reasoning on the following -
>
> I believe the below code does the following.  It says the first column is
> the feature to be predicted (is a label) all other columns are to be used
> in the tree construction e.g. as variable to split on.
>
> val descriptor = "L N N"
> val trainDataValues = fileAsStringArray("myTrainFile.csv");
> val data = DataLoader.loadData(DataLoader.generateDataset(descriptor,
> false, trainDataValues), trainDataValues);
>
> Where my "myTrainFile.csv has a form like
>
> "A", .45,.55
> ...
> ...
> "B" 33.3, 22.3
>
>
>
> 2. String Descriptor for input data
>
> I'm now provided a new file "myTestData.csv"
>
> This data has no labels, but is otherwise the same as above. So if I
> attempt to create a dataset an error will be thrown with complain of no
> label.
>
> All I'm interested in is being able to call forest.classify(..., ...) but
> I'm not sure how to correctly construct my training dataset.
>
> I cannot simply split the original dataset as is done in most examples.
>
>
> Any examples showing test data construction independent of the original
> training set would be appreciated.
>
>
> --
>
>
> https://github.com/bearrito
> @deepbearrito
>

Reply via email to