Re: Randa Meeting: Notes on Voice Control in KDE

2017-09-19 Thread Aditya Mehra
Hi Frederik,

It's awesome that you are trying out mycroft, do check out some of the cool 
plasma skills mycroft already has to control your workspace, these are 
installable directly from the plasmoid.

In addition to that I can understand currently that the plasmoid isn't yet 
packaged and can be a long procedure to install manually from git, but if you 
are running Kubuntu 17.04 or higher / KDE Neon or Fedora 25/26 Spin, I have 
written a small installer to make installations easier for whoever wants to try 
out Mycroft and the Plasmoid on Plasma. This installs mycroft and the plasmoid 
together including the plasma desktop skills

Its still new and might have bugs if you want to give it a go you can get the 
Appimage for the mycroft installer here: 
https://github.com/AIIX/mycroft-installer/releases/

I think it would be great if more people in the community gave mycroft and the 
plasmoid a go it certainly would help with looking at the finer details of 
where improvements can be made with mycroft.

I am also available for a discussion at any time or to answer any queries, 
installation issues etc. You can ping me on Mycroft's chat channels 
(userhandle: @aix) or over email.

Regards,
Aditya


From: Frederik Gladhorn <gladh...@kde.org>
Sent: Tuesday, September 19, 2017 2:24:53 AM
To: Aditya Mehra; kde-community@kde.org
Cc: Thomas Pfeiffer
Subject: Re: Randa Meeting: Notes on Voice Control in KDE

Hello Aditya :)

thanks for your mail. I have tried Mycroft a little and am very interested in
it as well (I didn't manage to get the plasmoid up and running, but that's
more due to lack of effort than anything else). Your talk and demo at Akademy
was very impressive.

We did briefly touch on Mycroft, and it certainly is a project that we should
cooperate with in my opinion. I like to start looking at the big picture and
trying to figure out the details from that sometimes, if Mycroft covers a lot
of what we inted to do then that's perfect. I just started looking around and
simply don't feel like I can recommend anything yet, since I'm pretty new to
the topic.

Your mail added one more component to the list that I didn't think about at
all: networking and several devices working together in some form.

On lørdag 16. september 2017 00.08.10 CEST Aditya Mehra wrote:
> Hi Everyone :),
>
>
> Firstly i would like to start of by introducing myself, I am Aditya, i have
> been working on the Mycroft - Plasma integration project since some time
> which includes the front-end work like having a plasmoid as well as
> back-end integration with various plasma desktop features (krunner,
> activities, kdeconnect, wallpapers etc) .
>
Nice, I didn't know that there was more thant the Plasmoid! This is very
interesting to here, I'll have to have a look at what you did so far.

>
> I have carefully read through the email and would like to add some points to
> this discussion (P.S Please don't consider me partial to the mycroft
> project in anyway, I am not employed by them but am contributing full time
> out of my romantics for Linux as a platform and the will to have voice
> control over my own plasma desktop environment in general). Apologies for
> the long email in advance but here are some of my thoughts and points i
> would like to add to the discussion:
>
>
> a)  Mycroft AI is an open source digital assistant trying to bridge the gap
> between proprietary operating systems and their AI assistant / voice
> control platforms such as "Google Now, Siri, Cortanta, Bixbi" etc in an
> open source environment.
>
Yes, that does align well.
>
> b) The mycroft project is based on the same principals as having a
> conversational interface with your computer but by  maintaining privacy and
> independence based on the "Users" own choice. (explained ahead)
>
>
> c) The basic ways how mycroft works:
>
> Mycroft AI is based of python and runs four services mainly:
>
> i) websocket server more commonly referred to as the messagebus which is
> responsible for accepting and creating websocket server and connections to
> talk between clients(example: plasmoid, mobile, hardware etc)
>
> ii) The second service is called the 'Adapt' intent parser that acts
> like an platform to understand the users intent for example "open firefox"
> or "create a new tab" or "dict mode"  with multi language support that
> performs the action that a user states.

I'd like to learn more about this part, I guess it's under heavy development.
It did work nicely for me with the raspberry pi Mycroft version. But glancing
at the code, this is based on a few heuristics at the moment, or is there a
collection of data and machine learning involved?

>
> iii) The third service is the STT (Speech to text service): This service
> is res

Re: Randa Meeting: Notes on Voice Control in KDE

2017-09-18 Thread Thomas Pfeiffer

> On 16. Sep 2017, at 00:08, Aditya Mehra  wrote:
> 
> Hi Everyone :), 
> 
> Firstly i would like to start of by introducing myself, I am Aditya, i have 
> been working on the Mycroft - Plasma integration project since some time 
> which includes the front-end work like having a plasmoid as well as back-end 
> integration with various plasma desktop features (krunner, activities, 
> kdeconnect, wallpapers etc) . 
> 
> I have carefully read through the email and would like to add some points to 
> this discussion (P.S Please don't consider me partial to the mycroft project 
> in anyway, I am not employed by them but am contributing full time out of my 
> romantics for Linux as a platform and the will to have voice control over my 
> own plasma desktop environment in general). Apologies for the long email in 
> advance but here are some of my thoughts and points i would like to add to 
> the discussion: 
> 
> a)  Mycroft AI is an open source digital assistant trying to bridge the gap 
> between proprietary operating systems and their AI assistant / voice control 
> platforms such as "Google Now, Siri, Cortanta, Bixbi" etc in an open source 
> environment.
> 
> b) The mycroft project is based on the same principals as having a 
> conversational interface with your computer but by  maintaining privacy and 
> independence based on the "Users" own choice. (explained ahead)
> 
> c) The basic ways how mycroft works:
> Mycroft AI is based of python and runs four services mainly: 
> i) websocket server more commonly referred to as the messagebus which is 
> responsible for accepting and creating websocket server and connections to 
> talk between clients(example: plasmoid, mobile, hardware etc) 
> ii) The second service is called the 'Adapt' intent parser that acts like 
> an platform to understand the users intent for example "open firefox" or 
> "create a new tab" or "dict mode"  with multi language support that performs 
> the action that a user states. 
> iii) The third service is the STT (Speech to text service): This service 
> is responsible for the speech to text actions that are sent over to adapt 
> interface after conversion  to text for performing the' specified intent
> iv.) The fourth service is called "Mimic" that much like the  "espeak TTS 
> engine"  performs the action of converting text to speech, except mimic does 
> it with customized voices with support for various formats.  
> 
> d) The mycroft project is based on the Apache license which means its 
> completely open and customizable by every interested party in  forking their 
> own customizable environment or even drastically rewriting parts of the back 
> end that they feel would be suitable for their own user case environment and 
> including the ability to host their own instance if they feel mycroft-core 
> upstream is not able to reach those levels of functionality. Additionally 
> mycroft can also be configured to run headless 
> 
> e) With regards to privacy concerns and the use of Google STT, the upstream 
> mycroft community is already working towards moving to Mozilla deep voice / 
> speech as their main STT engine as it gets more mature (one of their top 
> ranked goals), but on the side lines there  are already forks that are using 
> STT interfaces completely offline for example the "jarbas ai fork" and 
> everyone is the community is trying to integrate with more open source voice 
> trained models like CMU sphinx etc.  This sadly i would call a battle of data 
> availability and community contribution to voice vs the already having a 
> google trained engine with advantages of propitiatory multi language support 
> and highly trained voice models. 
> 
> f) The upstream mycroft community is currently very new in terms of larger 
> open source projects but is very open to interacting with everyone from the 
> KDE community and developers to extend their platform to the plasma desktop 
> environment and are committed to providing this effort and their support in 
> all ways, including myself who is constantly looking forward to integrating 
> even more with plasma and KDE applications and projects in all fronts 
> including cool functionality accessibility and dictation mode etc. 
> 
> g) Some goodies about mycroft i would like to add: The "hey mycroft" wake 
> word is completely customizable and you can name it to whatever suits your 
> taste (what ever phonetic names pocket sphinx accepts) additionally as a 
> community you can also decide to not use mycroft servers or services to 
> interact at all and can define your own api settings for stuff like wolfram 
> alpha wake words and other api calls etc including data telemetric's and STT 
> there is no requirements to follow Google STT or default Mycroft Home Api 
> services even currently.
> 
> h) As the project is based on python, the best way i have come across is 
> interacting with all plasma services is through Dbus interfaces and 

Re: Randa Meeting: Notes on Voice Control in KDE

2017-09-15 Thread Aditya Mehra
s approach on the technical 
side is also not only limited to dbus but also developers who prefer to not 
wanting to interact with dbus can choose to directly expose functionality by 
using C types in their functions they would like to expose to voice interaction.


i) There are already awesome mycroft skills being developed by the open source 
community which includes interaction with plasma desktop and stuff like 
home-assistant, mopidy, amarok,  wikipedia (migrating to wiki data) , open 
weather, other desktop applications and many cloud services like image 
recognition and more at: https://github.com/MycroftAI/mycroft-skills


j) I  personally and on the behalf of upstream would like to invite everyone 
interested in taking voice control and interaction with digital assistants 
forward on the plasma desktop and plasma mobile platform to come and join the 
mattermost mycroft chat area: https://chat.mycroft.ai where we can create our 
own KDE channel and directly discuss and talk to the upstream mycroft team 
(they are more than happy to interact directly with everyone from KDE on one to 
one basis and queries and concerns and also to take voice control and digital 
assistance to the next level) or through some IRC channel where everyone 
including myself and upstream can all interact to take this forward.



Regards,

Aditya


From: kde-community <kde-community-boun...@kde.org> on behalf of Frederik 
Gladhorn <gladh...@kde.org>
Sent: Friday, September 15, 2017 1:09 PM
To: kde-community@kde.org
Subject: Randa Meeting: Notes on Voice Control in KDE

We here at Randa had a little session about voice recognition and control of
applications.
We tried to roughly define what we mean by that - a way of talking to the
computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate,
conversational interfaces. We agreed that want this and people expect it more
and more.
Striking a balance between privacy and getting some data to enable this is a
big concern, see later.
While there is general interest (almost everyone here went out of their way to
join the disussion), it didn't seem like anyone here at the moment wanted to
drive this forward themselves, so it may just not go anywhere due to lack of
people willing to put in time. Otherwise it may be something worth considering
as a community goal.


The term "intent" seems to be OK for the event that arrives at the
application. More on that later.

We tried to break down the problem and arrived at two possible scenarios:
1) voice recognition -> string representation in user's language
1.1) translation to English -> string representation in English
2) English sentence -> English string to intent

or alternatively:
1) voice recognition -> string representation in user's language
2) user language sentence -> user language string to intent

3) appliations get "intents" and react to them.

So basically one open question is if we need a translation step or if we can
directly translate from a string in any language to an intent.

We do not think it feasible nor desirable to let every app do its own magic.
Thus a central "daemon" processes does step 1, listenting to audio and
translating to a string representation.
Then, assuming we want to do a translation step 1.1 we need to find a way to do
the translation.

For step 1 mozilla deep voice seems like a candidate, it seems to be quickly
progressing.

We assume that mid-term we need machine learning for step 2 - gather sample
sentences (somewhere between thousands and millions) to enable the step of
going from sentence to intent.
We might get away with a set of simple heuristics to get this kick-started,
but over time we would want to use machine learning to do this step. Here it's
important to gather enough sample sentences to be able to train a model. We
basically assume we need to encourage people to participate and send us the
recognized sentences to get enough raw material to work with.

On interesting point is that ideally we can keep context, so that the users
can do follow up queries/commands.
Some of the context may be expressed with state machines (talk to Emanuelle
about that).
Clearly the whole topic needs research, we want to build on other people's
stuff and cooperate as much as possible.

Hopefully we can find some centralized daemon thing to run on Linux and do a
lot of the work in step 1 and 2 for us.
Step 3 requires work on our side (in Qt?) for sure.
What should intents look like? lists of property bags?
Should apps have a way of saying which intents they support?

A starting point could be to use the common media player interface to control
the media player using voice.
Should exposing intents be a dbus thing to start with?

For querying data, we may want to interface with wikipedia, music brainz, etc,
but is that more part of the central daemon or should there be an app?

We probably want to be able to start a

Re: Randa Meeting: Notes on Voice Control in KDE

2017-09-15 Thread Thomas Pfeiffer

> On 15. Sep 2017, at 12:54, Sebastian Kügler  wrote:
> 
> Hey!
> 
> Interesting discussion. Did you guys factor in the work done by Mycroft
> on that front? I think there's a great deal of overlap, and already
> some really interesting results shown for example in the Mycroft
> Plasmoid:

Exactly. Please do not reinvent the wheel here. This is a job for Mycroft, 
which has already solved the vast majority of problems you’d need to solve, and 
is already proven to work in Plasma.
Duplicating that work would just be a waste.

The big problem that Mycroft currently has is that it uses Google for the voice 
recognition, but our goal there should be to push for adoption of Mozilla 
Common Voice in Mycroft, instead of redoing everything Mycroft does.

So yea, I’m 1.000% for allowing voice control in KDE applications as well as 
Plasma, but I’m 99% sure that the way to go there is Mycroft.

Cheers,
Thomas

> On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote:
>> We here at Randa had a little session about voice recognition and
>> control of applications.
>> We tried to roughly define what we mean by that - a way of talking to
>> the computer as Siri/Cortana/Alexa/Google Now and other projects
>> demonstrate, conversational interfaces. We agreed that want this and
>> people expect it more and more.
>> Striking a balance between privacy and getting some data to enable
>> this is a big concern, see later.
>> While there is general interest (almost everyone here went out of
>> their way to join the disussion), it didn't seem like anyone here at
>> the moment wanted to drive this forward themselves, so it may just
>> not go anywhere due to lack of people willing to put in time.
>> Otherwise it may be something worth considering as a community goal.
>> 
>> 
>> The term "intent" seems to be OK for the event that arrives at the
>> application. More on that later.
>> 
>> We tried to break down the problem and arrived at two possible
>> scenarios: 1) voice recognition -> string representation in user's
>> language 1.1) translation to English -> string representation in
>> English 2) English sentence -> English string to intent
>> 
>> or alternatively:
>> 1) voice recognition -> string representation in user's language
>> 2) user language sentence -> user language string to intent
>> 
>> 3) appliations get "intents" and react to them.
>> 
>> So basically one open question is if we need a translation step or if
>> we can directly translate from a string in any language to an intent.
>> 
>> We do not think it feasible nor desirable to let every app do its own
>> magic. Thus a central "daemon" processes does step 1, listenting to
>> audio and translating to a string representation.
>> Then, assuming we want to do a translation step 1.1 we need to find a
>> way to do the translation.
>> 
>> For step 1 mozilla deep voice seems like a candidate, it seems to be
>> quickly progressing.
>> 
>> We assume that mid-term we need machine learning for step 2 - gather
>> sample sentences (somewhere between thousands and millions) to enable
>> the step of going from sentence to intent.
>> We might get away with a set of simple heuristics to get this
>> kick-started, but over time we would want to use machine learning to
>> do this step. Here it's important to gather enough sample sentences
>> to be able to train a model. We basically assume we need to encourage
>> people to participate and send us the recognized sentences to get
>> enough raw material to work with.
>> 
>> On interesting point is that ideally we can keep context, so that the
>> users can do follow up queries/commands.
>> Some of the context may be expressed with state machines (talk to
>> Emanuelle about that).
>> Clearly the whole topic needs research, we want to build on other
>> people's stuff and cooperate as much as possible.
>> 
>> Hopefully we can find some centralized daemon thing to run on Linux
>> and do a lot of the work in step 1 and 2 for us.
>> Step 3 requires work on our side (in Qt?) for sure.
>> What should intents look like? lists of property bags?
>> Should apps have a way of saying which intents they support?
>> 
>> A starting point could be to use the common media player interface to
>> control the media player using voice.
>> Should exposing intents be a dbus thing to start with?
>> 
>> For querying data, we may want to interface with wikipedia, music
>> brainz, etc, but is that more part of the central daemon or should
>> there be an app?
>> 
>> We probably want to be able to start applications when the appropriate
>> command arrives "write a new email to Volker" launches Kube with the
>> composer open and ideally the receiver filled out, or it may ask the
>> user "I don't know who that is, please help me...".
>> So how do applications define what intents they process?
>> How can applications ask for details? after receiving an intent they
>> may need to ask for more data.
>> 
>> There is also the kpurpose 

Re: Randa Meeting: Notes on Voice Control in KDE

2017-09-15 Thread Sebastian Kügler
Hey!

Interesting discussion. Did you guys factor in the work done by Mycroft
on that front? I think there's a great deal of overlap, and already
some really interesting results shown for example in the Mycroft
Plasmoid:

https://www.youtube.com/watch?v=sUhvKTq6c40 (somewhat dated, but gives
a decent impression)

Cheers,
-- sebas

On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote:
> We here at Randa had a little session about voice recognition and
> control of applications.
> We tried to roughly define what we mean by that - a way of talking to
> the computer as Siri/Cortana/Alexa/Google Now and other projects
> demonstrate, conversational interfaces. We agreed that want this and
> people expect it more and more.
> Striking a balance between privacy and getting some data to enable
> this is a big concern, see later.
> While there is general interest (almost everyone here went out of
> their way to join the disussion), it didn't seem like anyone here at
> the moment wanted to drive this forward themselves, so it may just
> not go anywhere due to lack of people willing to put in time.
> Otherwise it may be something worth considering as a community goal.
> 
> 
> The term "intent" seems to be OK for the event that arrives at the
> application. More on that later.
> 
> We tried to break down the problem and arrived at two possible
> scenarios: 1) voice recognition -> string representation in user's
> language 1.1) translation to English -> string representation in
> English 2) English sentence -> English string to intent
> 
> or alternatively:
> 1) voice recognition -> string representation in user's language
> 2) user language sentence -> user language string to intent
> 
> 3) appliations get "intents" and react to them.
> 
> So basically one open question is if we need a translation step or if
> we can directly translate from a string in any language to an intent.
> 
> We do not think it feasible nor desirable to let every app do its own
> magic. Thus a central "daemon" processes does step 1, listenting to
> audio and translating to a string representation.
> Then, assuming we want to do a translation step 1.1 we need to find a
> way to do the translation.
> 
> For step 1 mozilla deep voice seems like a candidate, it seems to be
> quickly progressing.
> 
> We assume that mid-term we need machine learning for step 2 - gather
> sample sentences (somewhere between thousands and millions) to enable
> the step of going from sentence to intent.
> We might get away with a set of simple heuristics to get this
> kick-started, but over time we would want to use machine learning to
> do this step. Here it's important to gather enough sample sentences
> to be able to train a model. We basically assume we need to encourage
> people to participate and send us the recognized sentences to get
> enough raw material to work with.
> 
> On interesting point is that ideally we can keep context, so that the
> users can do follow up queries/commands.
> Some of the context may be expressed with state machines (talk to
> Emanuelle about that).
> Clearly the whole topic needs research, we want to build on other
> people's stuff and cooperate as much as possible.
> 
> Hopefully we can find some centralized daemon thing to run on Linux
> and do a lot of the work in step 1 and 2 for us.
> Step 3 requires work on our side (in Qt?) for sure.
> What should intents look like? lists of property bags?
> Should apps have a way of saying which intents they support?
> 
> A starting point could be to use the common media player interface to
> control the media player using voice.
> Should exposing intents be a dbus thing to start with?
> 
> For querying data, we may want to interface with wikipedia, music
> brainz, etc, but is that more part of the central daemon or should
> there be an app?
> 
> We probably want to be able to start applications when the appropriate
> command arrives "write a new email to Volker" launches Kube with the
> composer open and ideally the receiver filled out, or it may ask the
> user "I don't know who that is, please help me...".
> So how do applications define what intents they process?
> How can applications ask for details? after receiving an intent they
> may need to ask for more data.
> 
> There is also the kpurpose framework, I have no idea what it does,
> should read up on it.
> 
> This is likely to be completely new input, while app is in some
> state, may have an open modal dialog, new crashes because we're not
> prepared? Are there patterns/building blocks to make it easier when
> an app is in a certain state?
> Maybe we should look at transactional computing and finite state
> machines? We could look at network protocols as example, they have
> error recovery etc.
> 
> How would integration for online services look like? A lot of this is
> about querying information.
> Should it be by default offline, delegate stuff to online when the
> user asks for it?
> 
> We need to build for 

Randa Meeting: Notes on Voice Control in KDE

2017-09-15 Thread Frederik Gladhorn
We here at Randa had a little session about voice recognition and control of 
applications.
We tried to roughly define what we mean by that - a way of talking to the 
computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate, 
conversational interfaces. We agreed that want this and people expect it more 
and more.
Striking a balance between privacy and getting some data to enable this is a 
big concern, see later.
While there is general interest (almost everyone here went out of their way to 
join the disussion), it didn't seem like anyone here at the moment wanted to 
drive this forward themselves, so it may just not go anywhere due to lack of 
people willing to put in time. Otherwise it may be something worth considering 
as a community goal.


The term "intent" seems to be OK for the event that arrives at the 
application. More on that later.

We tried to break down the problem and arrived at two possible scenarios:
1) voice recognition -> string representation in user's language
1.1) translation to English -> string representation in English
2) English sentence -> English string to intent

or alternatively:
1) voice recognition -> string representation in user's language
2) user language sentence -> user language string to intent

3) appliations get "intents" and react to them.

So basically one open question is if we need a translation step or if we can 
directly translate from a string in any language to an intent.

We do not think it feasible nor desirable to let every app do its own magic.
Thus a central "daemon" processes does step 1, listenting to audio and 
translating to a string representation.
Then, assuming we want to do a translation step 1.1 we need to find a way to do 
the translation.

For step 1 mozilla deep voice seems like a candidate, it seems to be quickly 
progressing.

We assume that mid-term we need machine learning for step 2 - gather sample 
sentences (somewhere between thousands and millions) to enable the step of 
going from sentence to intent.
We might get away with a set of simple heuristics to get this kick-started, 
but over time we would want to use machine learning to do this step. Here it's 
important to gather enough sample sentences to be able to train a model. We 
basically assume we need to encourage people to participate and send us the 
recognized sentences to get enough raw material to work with.

On interesting point is that ideally we can keep context, so that the users 
can do follow up queries/commands.
Some of the context may be expressed with state machines (talk to Emanuelle 
about that).
Clearly the whole topic needs research, we want to build on other people's 
stuff and cooperate as much as possible.

Hopefully we can find some centralized daemon thing to run on Linux and do a 
lot of the work in step 1 and 2 for us.
Step 3 requires work on our side (in Qt?) for sure.
What should intents look like? lists of property bags?
Should apps have a way of saying which intents they support?

A starting point could be to use the common media player interface to control 
the media player using voice.
Should exposing intents be a dbus thing to start with?

For querying data, we may want to interface with wikipedia, music brainz, etc, 
but is that more part of the central daemon or should there be an app?

We probably want to be able to start applications when the appropriate command 
arrives "write a new email to Volker" launches Kube with the composer open and 
ideally the receiver filled out, or it may ask the user "I don't know who that 
is, please help me...".
So how do applications define what intents they process?
How can applications ask for details? after receiving an intent they may need 
to ask for more data.

There is also the kpurpose framework, I have no idea what it does, should read 
up on it.

This is likely to be completely new input, while app is in some state, may 
have an open modal dialog, new crashes because we're not prepared?
Are there patterns/building blocks to make it easier when an app is in a 
certain state?
Maybe we should look at transactional computing and finite state machines? We 
could look at network protocols as example, they have error recovery etc.

How would integration for online services look like? A lot of this is about 
querying information.
Should it be by default offline, delegate stuff to online when the user asks 
for 
it?

We need to build for example public transport app integration.
For centralized AI join other projects.
Maybe Qt will provide the connection to 3rd party engines on Windows and 
macOS, good testing ground.

And to end with a less serious idea, we need a big bike-shed discussion about 
wake up words.
We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh 
Kate!

I hope some of this makes sense, I'd love to see more people stepping up and 
start figuring out what is needed and move it forward :)

Cheers,
Frederik