[Scikit-learn-general] Dataset question

ChungHung Liu Fri, 25 Oct 2013 04:25:33 -0700

I have a few questions but can't find good answers after searching on the 
internet.


1. Normally to what degree would an imbalanced dataset be? 
It's clear that imbalanced dataset comprises of majority and minority classes. 
But it's not clear that how many percentage the minor classes usually be. After 
googling , some refer that his imbalanced dataset contains 10% minority 90% 
majority[1]. What about 20%/ 30% minority, 80%/70% majority? 

2. What kind of data distribution can be seen as good for an imbalanced 
dataset? Or what technique (Principal Component Analysis?) can be applied to 
find data distribution? 

I use balanced random forest for dealing imbalanced dataset and test against 
dataset found at [2]. The result looks ok and stable (as A). However, when 
changing to dataset (say dataset X) other than that in [2]. The confusion 
matrix results changes drastically and are unstable when repeating several 
times (as B). The different part I can think of is the way how I encode the 
values in the dataset X and the dataset found at [2] is a dense matrix (nearly 
every column with all rows full of values). The dataset X has some 
characteristics:

1. it has hundred columns 
2. many columns do not contain values (sparse matrix.)
3. a few columns (which can be seen as database primary/ unique key) contain 
values with factorized result nearly the same as the row count of the column 
length. For example, the dataset has 500 rows where the column a contain 489 
different values, it becomes 
    len(numpy.unique(a_values_array, return_inverse=True)[0]) = 489.    
4. many columns are categorical so most of values are encoded using 
numpy.unique(array, return_inverse=True)[1]

The way how I encode/ factorize values is:
- remove columns which do not contain many values. For example, column value 
count falls below a threshold e.g. 50. So if a column with 500 rows contains 
only 49 values will be removed. 
- encode columns which has various classification (like 3 described above) to 
0/ 1; i.e. having value is encoded as 1, otherwise 0.
- factorize categorical columns with e.g. numpy.unique(column_value_array, 
return_inverse=True)[1] 
- leave numeric columns as it were.

I appreciate any suggestion. 

Many thanks. 

A.) uci dataset
test set -> b label has 50 records, a label has 470 records.

         a     b
a   377   93
b     11   39

test set -> b label has 41 records, a label has 437 records.

        a     b
a   370   67
b      8   33


B.) other dataset
b has 24 records, a has 152 records, percent of b: 13.64%   
        a      b
a     95   57
b       8   16

b has 20 records, a has 137 records, percent of b: 12.74%

        a     b
a   118   19
b     15    5

b has 20 records, a has 169 records, percent of b: 10.58%

        a     b
a   148   21
b     17    3


[1]. 
http://stats.stackexchange.com/questions/60564/should-my-test-set-be-balanced-or-imbalanced
[2]. http://www.cs.gsu.edu/~zding/research/benchmark-data.php

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Dataset question

Reply via email to