Hey guys,
I guess in the end this is a question about methodology and I could
write my own functions for sampling and evaluation, but I'm wondering
if this problem has already been solved in scikit-learn.
I have a dataset where I would like to group samples for
cross-validation and evaluation because each row represents a tuple
from a group of samples so it shouldn't be considered in isolation.
Let me go a little over the set-up.
I'm trying to use a binary classifier (maybe logistic regression) to
match elements in set A with elements in set B. The cardinality of B
is much larger than that of A. You can think of the elements in B as a
bunch of imperfect copies of elements in A. The goal is to match each
element in A with its closest imperfect copy in B.
After some preprocessing, each element in A has a small set of
candidates C (a subset of B) and the manually labeled data assigned a
1 to the best candidate from C and 0 to the rest. Note that the number
of candidates varies depending on the given element of A.
So each row in the data is a feature vector that comes from a tuple
(a in A, c in C_a)
and only one of the candidates c is labeled as the winner (1), e.g.
(a0, c0_0) : 0
(a0, c0_1) : 1
(a0, c0_2) : 0
(a1, c1_0) : 1
(a1, c1_1) : 0
...

Is there a way to use the scikit-learn functionality for
cross-validation and evaluation given this set-up?
Thanks!
-- 
 Hector

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to