Hi there!

Here is my opinion (other devs might have different opinions)

non-numerical data: Totally, should be better documented.
-First thing should be the string kernels of course, we want to have those
things documented well
-Random forests AFAIK can deal with non-numerical data, not sure we have an
example on that but feel free to add one
-There might be more (want to contribute a list?)

labels and data in the same file & preprocessing:
There is room for certain of those things in Shogun, but as of now, shogun
is not a preprocessing library. It should load all the common data formats
(such as for example pandas dataframes, see our this year GSoC project on
Arrow), we don't want to re-implement the pandas functionality itself.
Think of the other ML libs out there (some of them do a better job than
shogun in dealing with certain data setups), most of them take the same
approach: you preprocess your data further upstream before using the
library.

However, all the different way of loading data in Shogun should be
documented in the examples, so feel free to send a patch here.
We also want ti unify the interface, see the "user experience" GSoC project.

A real world project that uses Shogun (like reproducing a kaggle proect)
would be indeed cool to see! Want to write one?

Best
H


2018-02-08 19:32 GMT+00:00 Francisco Navarro <navarromorales...@gmail.com>:

> Hi there! My name is Paco, I'm a computer science student (degree) which
> is starting to learn ML and is interested in get involved with shogun
> project in order to improve his knowledge about ML and OpenSource project
> workflows.
>
> I write this in order to ask some question about how to start using shogun
> with real world examples (I'm interested in use it, for example, with
> Kaggle competitions).
>
> It is true that there are lots of examples telling us how to use shogun
> but most of them don't explain how to preproces dataset before conver it in
> a RealFeatures instance and, all python notebooks and doc examples use
> always numerical data from start to end...
>
> So, I would like to know how should I proceed with shogun when I have non
> numerical data and also when I have labels and features in the same file.
>
> Should I modify my dataset before start using shogun in order to separate
> labels and features and to convert categorical data in some kind of dummy
> variables or even codify each possible value in a integer number? Or could
> I use shogun for this purpose?
>
> In university I've use pandas framework to this kind of tasks and then,
> I've been using scikit-sklearn for algorithms (always use python) and I
> think I could use pandas + shogun in the same way but... as there exists
> lot of IO class such as CCSVFile class or CStringFeatures class I think
> maybe I could use them instead of pandas (because I could use pandas only
> working in python and use shogun would be nice in others shogun-supported
> languages like C++ or R.
>
> That's my question, I would like to know your opinions and, if possible I
> would like to see real-world projects which uses shogun to solve problems.
>
> Greeting  --  Francisco (Paco) Navarro Morales.
>

Reply via email to