Running random forest, when loading feature descriptor from JSON file with
ignored features , algorithm fails.
The root cause is in Dataset.java , fromJSON(String json) function
---------------------------
public static Dataset fromJSON(String json) {
List<Map<String, Object>> fromJSON;
try {
fromJSON = OBJECT_MAPPER.readValue(json,
newTypeReference<List<Map<String, Object>>>() {});
} catch (Exception ex) {
throw new RuntimeException(ex);
}
List<Attribute> attributes = Lists.newLinkedList();
List<Integer> ignored = Lists.newLinkedList();
String[][] nominalValues = new String[fromJSON.size()][];
Dataset dataset = new Dataset();
for (int i = 0; i < fromJSON.size(); i++) {
Map<String, Object> attribute = fromJSON.get(i);
if (Attribute.fromString((String) attribute.get(TYPE)) == Attribute.
IGNORED) {
ignored.add(i);
} else {
Attribute asAttribute = Attribute.fromString((String) attribute.get(
TYPE));
attributes.add(asAttribute);
if ((Boolean) attribute.get(LABEL)) {
dataset.labelId = i - ignored.size();
}
if (attribute.get(VALUES) != null) {
List<String> get = (List<String>) attribute.get(VALUES);
String[] array = get.toArray(new String[get.size()]);
nominalValues[i] = array; ----------------------------------line
400, original wrong
nominalValues[i - ignored.size()] = array;
----------------------------------line 400, new, fix problem
}
}
}
dataset.attributes = attributes.toArray(newAttribute[attributes.size()]);
dataset.ignored = new int[ignored.size()];
dataset.values = nominalValues;
for (int i = 0; i < dataset.ignored.length; i++) {
dataset.ignored[i] = ignored.get(i);
}
return dataset;
}
----------------------------------------------------------
****** nominalValues[i] = array; -----------------line 400, original wrong
****** nominalValues[i - ignored.size()] = array;
----------------------------------line 400, new, fix problem
I do several tests on my own data, it works as expected.
I'll file a JIRA, and if no owner, I'll file the patch.
Sam
On Sat, Dec 14, 2013 at 4:05 PM, sam wu <[email protected]> wrote:
> Hi Ted,
>
> some more debugging, my previous statement is not correct, please
> dis-regards.
> There is problem i am sure. I am using InMemeoryMapper, one of the ways to
> load data. And I found problem there.
> I am going to compare with other approach (partial, Breiman) to see what's
> the difference.
>
> My bad, well It's Saturday !
>
> Sam
>
>
> On Sat, Dec 14, 2013 at 1:38 PM, Ted Dunning <[email protected]>wrote:
>
>> Can you file a JIRA at https://issues.apache.org/jira/browse/MAHOUT ?
>>
>> It sounds like you have a test case in mind along with your fix. If you
>> could package that work up as a patch file, then it would be much
>> appreciated.
>>
>>
>> On Sat, Dec 14, 2013 at 9:24 AM, sam wu <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > I am using random forest of Mahout. It works well when I don't use
>> feature
>> > descriptor with Ignore feature ( No I flag).
>> >
>> > If using Ignore flag, the returned feature value is -1
>> > (for in the code dataset.valueOf(aId, token) return -1).
>> >
>> > I did some investigation, and found that there some problems in the
>> > DataConverter.java
>> >
>> > source code
>> > ------
>> >
>> > for (int attr = 0; attr < nball; attr++) { --51
>> > if (ArrayUtils.contains(dataset.getIgnored(), attr)) {
>> > continue; // IGNORED
>> > }
>> >
>> > String token = tokens[attr].trim();
>> >
>> > if ("?".equals(token)) {
>> > // missing value
>> > return null;
>> > }
>> >
>> > if (dataset.isNumerical(aId)) { --63
>> > vector.set(aId++, Double.parseDouble(token));
>> > } else { // CATEGORICAL
>> > vector.set(aId, dataset.valueOf(aId, token)); --66
>> > aId++;
>> > }
>> > -------
>> > Let feature descriptor be 9 I N L (Breiman Example)
>> > 11 features, 1-9 Ignored, 10th is Numeric, 11th is label variable
>> > (Is Breiman example really works based on web instruction ?)
>> >
>> > line 51 -- attr is #feature, 0-10
>> > aId is filtered feature #, 0-1 ( two non-Ignored features)
>> > Problem in line 66
>> > if attr=10, Label feature
>> > aId=1
>> > token=true
>> > dataset.valueOf(aId, token) return -1 , for current code, CATEGORICAL
>> > feature valueOf() kind mixed aId and attr concept.
>> >
>> > Just by changing line 66
>> > vector.set(aId, dataset.valueOf(aId, token)); --66
>> > to vector.set(aId, dataset.valueOf(attr, token));
>> > not working, because some validation fails (also attr, aId mixture).
>> >
>> >
>> >
>> > There might be things that I overlook, just some thoughts.
>> >
>> >
>> > Sam
>> >
>>
>
>