[jira] [Commented] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

2024-03-11 Thread Igor Belyakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825164#comment-17825164
 ] 

Igor Belyakov commented on IGNITE-20139:


[~zaleslaw], could you please review the PR?

> RandomForestClassifierTrainer accuracy issue
> 
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
>  Issue Type: Bug
>  Components: ml
>Affects Versions: 2.15
>Reporter: Alexandr Shapkin
>Assignee: Igor Belyakov
>Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use machine learning capabilities, and discovered a bug in 
> implementation of Random Forest. When comparing Ignite's output with python 
> prototype (scikit-learn lib), we noticed that Ignite's predictions have much 
> lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that Ignite generates decision trees that kinda 
> "loop". The tree starts checking the same condition over and over until it 
> reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

2024-03-11 Thread Igor Belyakov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825163#comment-17825163
 ] 

Igor Belyakov commented on IGNITE-20139:


The issue happens when one “pure“ node (with impurity{^}*{^} = 0) is presented 
in the tree. We calculate an impurity only for children nodes and not for the 
current node, as well as do not check whether the node is “pure“ and contains 
just one label, due to that, the “bestSplit” calculation is executed for the 
already “pure“ node, which decides that all items should be moved to the left 
child node and no items to the right (leaf node), which gives 2 “pure“ children 
nodes. Since we don’t calculate impurity for the current (parent) node the 
{{parentNode.getImpurity() - split.get().getImpurity() > minImpurityDelta}} 
check is always true, and we continue to split the already “pure“ node until 
the max tree depth is reached.
The following changes were made to resolve the issue:
 # Gain{^}**{^} calculation and check for the split were added.
 # Node’s impurity check is added, once the impurity becomes 0 it means that 
the node is “pure” and we don’t need to calculate a split for it.
 # Gini impurity calculation was changed to {{(1 - sum(p^2))}} to get the 
correct values in the range from 0 to 0.5 as required for the Gini index.

^*^ Impurity - is a value from 0 to 0.5, which shows whether the node is “pure“ 
(impurity = 0) having just 1 label or “impure” with impurity=0.5, which is the 
worst scenario where the label ratio is 1:1.
^**^ Gain - is a difference between the parent node’s impurity and weighted 
children nodes' impurity. The split which provides the maximum gain value is 
considered the best. See [https://www.learndatasci.com/glossary/gini-impurity/]

> RandomForestClassifierTrainer accuracy issue
> 
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
>  Issue Type: Bug
>  Components: ml
>Affects Versions: 2.15
>Reporter: Alexandr Shapkin
>Assignee: Igor Belyakov
>Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use machine learning capabilities, and discovered a bug in 
> implementation of Random Forest. When comparing Ignite's output with python 
> prototype (scikit-learn lib), we noticed that Ignite's predictions have much 
> lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that Ignite generates decision trees that kinda 
> "loop". The tree starts checking the same condition over and over until it 
> reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IGNITE-20139) RandomForestClassifierTrainer accuracy issue

2023-08-02 Thread Alexandr Shapkin (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-20139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17750303#comment-17750303
 ] 

Alexandr Shapkin commented on IGNITE-20139:
---

[~zaleslaw] could you please take a look? Seems to be a valid issue alongside 
the attached reproducer.

> RandomForestClassifierTrainer accuracy issue
> 
>
> Key: IGNITE-20139
> URL: https://issues.apache.org/jira/browse/IGNITE-20139
> Project: Ignite
>  Issue Type: Bug
>  Components: ml
>Affects Versions: 2.15
>Reporter: Alexandr Shapkin
>Priority: Major
> Attachments: TreeSample2_Portfolio_Change.png, random-forest.zip
>
>
> We tried to use GridGain's machine learning capabilities, and discovered a 
> bug in GG's implementation of Random Forest. When comparing GG's output with 
> python prototype (scikit-learn lib), we noticed that GG's predictions have 
> much lower accuracy despite using the same data set and model parameters. 
> Further investigation showed that GridGain generates decision trees that 
> kinda "loop". The tree starts checking the same condition over and over until 
> it reaches the maximum tree depth.
> I've attached a standalone reproducer which uses a small excerpt of our data 
> set. 
> It loads data from the csv file, then performs the training of the model for 
> just 1 tree. Then the reproducer finds one of the looping branches and prints 
> it. You will see that every single node in the branch uses the same feature, 
> value and has then same calculated impurity. 
> On my machine the code reproduces this issue 100% of time.
> I've also attached an example of the tree generated by python's scikit-learn 
> on the same data set with the same parameters. In python the tree usually 
> doesn't get deeper than 20 nodes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)