That's something I've wanted to discuss for a while and I'd better do it before the new framework is set in stone.
I could make this really long so I'll try to keep it as concise as I can ;) One main motivation for the ARL framework was the idea of working with a "linear" data flow, compared to encapsulating objects, e.g. Standardize -> PCA -> NNet instead of ChainedLearner(ChainedLearner(Standardize, PCA), NNet) I don't recall exactly the original motivations. What I've personally found nice about such a linear approach is: - it's easier to read, because you have all (or at least big chunks) of your processing chain in a single Python list - it's easier to add / remove component from such a list What I didn't like in the first version of this framework is mainly that the linear approach is not consistently used, mostly when it comes to splitting the data and doing a typical train/evaluation setup on these splits. I've also had to use ugly hacks in some situations where some element in the chain had to access something computed by another element that was not the one just before, which is when linear stuff gets ugly. So, I thought it would be nice if the new framework would let us do things really smoothly in a linear way, typically in the form: Dataset -> Preproc -> Split -> Learner -> Evaluator but of course, without enforcing such a rigid structure. And I think I came up with a cool solution, which consists in using stacks so that some object along the line can re-use more than just the object before, so for instance you could write: Dataset1 -> Dataset2 -> ConcatDataset and ConcatDataset would know it should pop the last two things in the chain (note how I'm also suggesting that datasets could be used directly as primitive building blocks, but that's another topic) But thinking about it more, I decided it was a really bad idea, and a single word should be enough to convince you: VMatLanguage. So, my current opinion is that there is no clear winner between linearization vs. encapsulation, and the best solution may be to use both: linearization for "Markovian" chains (i.e. one component depends only on what's been computed by the previous one), and encapsulation for stuff that is more complex. E.g: ConcatSet(Set1, Set2) -> PCA -> NNet As far as splitting / testing is concerned, I still think it'd be nice to have it linearized, because it looks better, and I believe encapsulation may not be required. But I don't know what's Xavier plan for this. So I'll leave it for future discussion. So, the feedback I'm looking for about this email is whether you guys think we should try to linearize the data flow as much as possible (and then find ways to keep things consistent and clear when the flow is not naturally linear), or mix in some encapsulation (and then suggest how, especially if you don't agree with me ;) -- Olivier _______________________________________________ Mailing list: https://launchpad.net/~piaget-dev Post to : piaget-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~piaget-dev More help : https://help.launchpad.net/ListHelp