Hi there!

Here is my opinion (other devs might have different opinions)

non-numerical data: Totally, should be better documented.
-First thing should be the string kernels of course, we want to have those
things documented well
-Random forests AFAIK can deal with non-numerical data, not sure we have an
example on that but feel free to add one
-There might be more (want to contribute a list?)

labels and data in the same file & preprocessing:
There is room for certain of those things in Shogun, but as of now, shogun
is not a preprocessing library. It should load all the common data formats
(such as for example pandas dataframes, see our this year GSoC project on
Arrow), we don't want to re-implement the pandas functionality itself.
Think of the other ML libs out there (some of them do a better job than
shogun in dealing with certain data setups), most of them take the same
approach: you preprocess your data further upstream before using the

However, all the different way of loading data in Shogun should be
documented in the examples, so feel free to send a patch here.
We also want ti unify the interface, see the "user experience" GSoC project.

A real world project that uses Shogun (like reproducing a kaggle proect)
would be indeed cool to see! Want to write one?


2018-02-08 19:32 GMT+00:00 Francisco Navarro <navarromorales...@gmail.com>:

> Hi there! My name is Paco, I'm a computer science student (degree) which
> is starting to learn ML and is interested in get involved with shogun
> project in order to improve his knowledge about ML and OpenSource project
> workflows.
> I write this in order to ask some question about how to start using shogun
> with real world examples (I'm interested in use it, for example, with
> Kaggle competitions).
> It is true that there are lots of examples telling us how to use shogun
> but most of them don't explain how to preproces dataset before conver it in
> a RealFeatures instance and, all python notebooks and doc examples use
> always numerical data from start to end...
> So, I would like to know how should I proceed with shogun when I have non
> numerical data and also when I have labels and features in the same file.
> Should I modify my dataset before start using shogun in order to separate
> labels and features and to convert categorical data in some kind of dummy
> variables or even codify each possible value in a integer number? Or could
> I use shogun for this purpose?
> In university I've use pandas framework to this kind of tasks and then,
> I've been using scikit-sklearn for algorithms (always use python) and I
> think I could use pandas + shogun in the same way but... as there exists
> lot of IO class such as CCSVFile class or CStringFeatures class I think
> maybe I could use them instead of pandas (because I could use pandas only
> working in python and use shogun would be nice in others shogun-supported
> languages like C++ or R.
> That's my question, I would like to know your opinions and, if possible I
> would like to see real-world projects which uses shogun to solve problems.
> Greeting  --  Francisco (Paco) Navarro Morales.

Reply via email to