Re: Randa Meeting: Notes on Voice Control in KDE
Hi Frederik, It's awesome that you are trying out mycroft, do check out some of the cool plasma skills mycroft already has to control your workspace, these are installable directly from the plasmoid. In addition to that I can understand currently that the plasmoid isn't yet packaged and can be a long procedure to install manually from git, but if you are running Kubuntu 17.04 or higher / KDE Neon or Fedora 25/26 Spin, I have written a small installer to make installations easier for whoever wants to try out Mycroft and the Plasmoid on Plasma. This installs mycroft and the plasmoid together including the plasma desktop skills Its still new and might have bugs if you want to give it a go you can get the Appimage for the mycroft installer here: https://github.com/AIIX/mycroft-installer/releases/ I think it would be great if more people in the community gave mycroft and the plasmoid a go it certainly would help with looking at the finer details of where improvements can be made with mycroft. I am also available for a discussion at any time or to answer any queries, installation issues etc. You can ping me on Mycroft's chat channels (userhandle: @aix) or over email. Regards, Aditya From: Frederik Gladhorn <gladh...@kde.org> Sent: Tuesday, September 19, 2017 2:24:53 AM To: Aditya Mehra; kde-community@kde.org Cc: Thomas Pfeiffer Subject: Re: Randa Meeting: Notes on Voice Control in KDE Hello Aditya :) thanks for your mail. I have tried Mycroft a little and am very interested in it as well (I didn't manage to get the plasmoid up and running, but that's more due to lack of effort than anything else). Your talk and demo at Akademy was very impressive. We did briefly touch on Mycroft, and it certainly is a project that we should cooperate with in my opinion. I like to start looking at the big picture and trying to figure out the details from that sometimes, if Mycroft covers a lot of what we inted to do then that's perfect. I just started looking around and simply don't feel like I can recommend anything yet, since I'm pretty new to the topic. Your mail added one more component to the list that I didn't think about at all: networking and several devices working together in some form. On lørdag 16. september 2017 00.08.10 CEST Aditya Mehra wrote: > Hi Everyone :), > > > Firstly i would like to start of by introducing myself, I am Aditya, i have > been working on the Mycroft - Plasma integration project since some time > which includes the front-end work like having a plasmoid as well as > back-end integration with various plasma desktop features (krunner, > activities, kdeconnect, wallpapers etc) . > Nice, I didn't know that there was more thant the Plasmoid! This is very interesting to here, I'll have to have a look at what you did so far. > > I have carefully read through the email and would like to add some points to > this discussion (P.S Please don't consider me partial to the mycroft > project in anyway, I am not employed by them but am contributing full time > out of my romantics for Linux as a platform and the will to have voice > control over my own plasma desktop environment in general). Apologies for > the long email in advance but here are some of my thoughts and points i > would like to add to the discussion: > > > a) Mycroft AI is an open source digital assistant trying to bridge the gap > between proprietary operating systems and their AI assistant / voice > control platforms such as "Google Now, Siri, Cortanta, Bixbi" etc in an > open source environment. > Yes, that does align well. > > b) The mycroft project is based on the same principals as having a > conversational interface with your computer but by maintaining privacy and > independence based on the "Users" own choice. (explained ahead) > > > c) The basic ways how mycroft works: > > Mycroft AI is based of python and runs four services mainly: > > i) websocket server more commonly referred to as the messagebus which is > responsible for accepting and creating websocket server and connections to > talk between clients(example: plasmoid, mobile, hardware etc) > > ii) The second service is called the 'Adapt' intent parser that acts > like an platform to understand the users intent for example "open firefox" > or "create a new tab" or "dict mode" with multi language support that > performs the action that a user states. I'd like to learn more about this part, I guess it's under heavy development. It did work nicely for me with the raspberry pi Mycroft version. But glancing at the code, this is based on a few heuristics at the moment, or is there a collection of data and machine learning involved? > > iii) The third service is the STT (Speech to text service): This service > is res
Re: Randa Meeting: Notes on Voice Control in KDE
> On 16. Sep 2017, at 00:08, Aditya Mehrawrote: > > Hi Everyone :), > > Firstly i would like to start of by introducing myself, I am Aditya, i have > been working on the Mycroft - Plasma integration project since some time > which includes the front-end work like having a plasmoid as well as back-end > integration with various plasma desktop features (krunner, activities, > kdeconnect, wallpapers etc) . > > I have carefully read through the email and would like to add some points to > this discussion (P.S Please don't consider me partial to the mycroft project > in anyway, I am not employed by them but am contributing full time out of my > romantics for Linux as a platform and the will to have voice control over my > own plasma desktop environment in general). Apologies for the long email in > advance but here are some of my thoughts and points i would like to add to > the discussion: > > a) Mycroft AI is an open source digital assistant trying to bridge the gap > between proprietary operating systems and their AI assistant / voice control > platforms such as "Google Now, Siri, Cortanta, Bixbi" etc in an open source > environment. > > b) The mycroft project is based on the same principals as having a > conversational interface with your computer but by maintaining privacy and > independence based on the "Users" own choice. (explained ahead) > > c) The basic ways how mycroft works: > Mycroft AI is based of python and runs four services mainly: > i) websocket server more commonly referred to as the messagebus which is > responsible for accepting and creating websocket server and connections to > talk between clients(example: plasmoid, mobile, hardware etc) > ii) The second service is called the 'Adapt' intent parser that acts like > an platform to understand the users intent for example "open firefox" or > "create a new tab" or "dict mode" with multi language support that performs > the action that a user states. > iii) The third service is the STT (Speech to text service): This service > is responsible for the speech to text actions that are sent over to adapt > interface after conversion to text for performing the' specified intent > iv.) The fourth service is called "Mimic" that much like the "espeak TTS > engine" performs the action of converting text to speech, except mimic does > it with customized voices with support for various formats. > > d) The mycroft project is based on the Apache license which means its > completely open and customizable by every interested party in forking their > own customizable environment or even drastically rewriting parts of the back > end that they feel would be suitable for their own user case environment and > including the ability to host their own instance if they feel mycroft-core > upstream is not able to reach those levels of functionality. Additionally > mycroft can also be configured to run headless > > e) With regards to privacy concerns and the use of Google STT, the upstream > mycroft community is already working towards moving to Mozilla deep voice / > speech as their main STT engine as it gets more mature (one of their top > ranked goals), but on the side lines there are already forks that are using > STT interfaces completely offline for example the "jarbas ai fork" and > everyone is the community is trying to integrate with more open source voice > trained models like CMU sphinx etc. This sadly i would call a battle of data > availability and community contribution to voice vs the already having a > google trained engine with advantages of propitiatory multi language support > and highly trained voice models. > > f) The upstream mycroft community is currently very new in terms of larger > open source projects but is very open to interacting with everyone from the > KDE community and developers to extend their platform to the plasma desktop > environment and are committed to providing this effort and their support in > all ways, including myself who is constantly looking forward to integrating > even more with plasma and KDE applications and projects in all fronts > including cool functionality accessibility and dictation mode etc. > > g) Some goodies about mycroft i would like to add: The "hey mycroft" wake > word is completely customizable and you can name it to whatever suits your > taste (what ever phonetic names pocket sphinx accepts) additionally as a > community you can also decide to not use mycroft servers or services to > interact at all and can define your own api settings for stuff like wolfram > alpha wake words and other api calls etc including data telemetric's and STT > there is no requirements to follow Google STT or default Mycroft Home Api > services even currently. > > h) As the project is based on python, the best way i have come across is > interacting with all plasma services is through Dbus interfaces and
Re: Randa Meeting: Notes on Voice Control in KDE
s approach on the technical side is also not only limited to dbus but also developers who prefer to not wanting to interact with dbus can choose to directly expose functionality by using C types in their functions they would like to expose to voice interaction. i) There are already awesome mycroft skills being developed by the open source community which includes interaction with plasma desktop and stuff like home-assistant, mopidy, amarok, wikipedia (migrating to wiki data) , open weather, other desktop applications and many cloud services like image recognition and more at: https://github.com/MycroftAI/mycroft-skills j) I personally and on the behalf of upstream would like to invite everyone interested in taking voice control and interaction with digital assistants forward on the plasma desktop and plasma mobile platform to come and join the mattermost mycroft chat area: https://chat.mycroft.ai where we can create our own KDE channel and directly discuss and talk to the upstream mycroft team (they are more than happy to interact directly with everyone from KDE on one to one basis and queries and concerns and also to take voice control and digital assistance to the next level) or through some IRC channel where everyone including myself and upstream can all interact to take this forward. Regards, Aditya From: kde-community <kde-community-boun...@kde.org> on behalf of Frederik Gladhorn <gladh...@kde.org> Sent: Friday, September 15, 2017 1:09 PM To: kde-community@kde.org Subject: Randa Meeting: Notes on Voice Control in KDE We here at Randa had a little session about voice recognition and control of applications. We tried to roughly define what we mean by that - a way of talking to the computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate, conversational interfaces. We agreed that want this and people expect it more and more. Striking a balance between privacy and getting some data to enable this is a big concern, see later. While there is general interest (almost everyone here went out of their way to join the disussion), it didn't seem like anyone here at the moment wanted to drive this forward themselves, so it may just not go anywhere due to lack of people willing to put in time. Otherwise it may be something worth considering as a community goal. The term "intent" seems to be OK for the event that arrives at the application. More on that later. We tried to break down the problem and arrived at two possible scenarios: 1) voice recognition -> string representation in user's language 1.1) translation to English -> string representation in English 2) English sentence -> English string to intent or alternatively: 1) voice recognition -> string representation in user's language 2) user language sentence -> user language string to intent 3) appliations get "intents" and react to them. So basically one open question is if we need a translation step or if we can directly translate from a string in any language to an intent. We do not think it feasible nor desirable to let every app do its own magic. Thus a central "daemon" processes does step 1, listenting to audio and translating to a string representation. Then, assuming we want to do a translation step 1.1 we need to find a way to do the translation. For step 1 mozilla deep voice seems like a candidate, it seems to be quickly progressing. We assume that mid-term we need machine learning for step 2 - gather sample sentences (somewhere between thousands and millions) to enable the step of going from sentence to intent. We might get away with a set of simple heuristics to get this kick-started, but over time we would want to use machine learning to do this step. Here it's important to gather enough sample sentences to be able to train a model. We basically assume we need to encourage people to participate and send us the recognized sentences to get enough raw material to work with. On interesting point is that ideally we can keep context, so that the users can do follow up queries/commands. Some of the context may be expressed with state machines (talk to Emanuelle about that). Clearly the whole topic needs research, we want to build on other people's stuff and cooperate as much as possible. Hopefully we can find some centralized daemon thing to run on Linux and do a lot of the work in step 1 and 2 for us. Step 3 requires work on our side (in Qt?) for sure. What should intents look like? lists of property bags? Should apps have a way of saying which intents they support? A starting point could be to use the common media player interface to control the media player using voice. Should exposing intents be a dbus thing to start with? For querying data, we may want to interface with wikipedia, music brainz, etc, but is that more part of the central daemon or should there be an app? We probably want to be able to start a
Re: Randa Meeting: Notes on Voice Control in KDE
> On 15. Sep 2017, at 12:54, Sebastian Küglerwrote: > > Hey! > > Interesting discussion. Did you guys factor in the work done by Mycroft > on that front? I think there's a great deal of overlap, and already > some really interesting results shown for example in the Mycroft > Plasmoid: Exactly. Please do not reinvent the wheel here. This is a job for Mycroft, which has already solved the vast majority of problems you’d need to solve, and is already proven to work in Plasma. Duplicating that work would just be a waste. The big problem that Mycroft currently has is that it uses Google for the voice recognition, but our goal there should be to push for adoption of Mozilla Common Voice in Mycroft, instead of redoing everything Mycroft does. So yea, I’m 1.000% for allowing voice control in KDE applications as well as Plasma, but I’m 99% sure that the way to go there is Mycroft. Cheers, Thomas > On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote: >> We here at Randa had a little session about voice recognition and >> control of applications. >> We tried to roughly define what we mean by that - a way of talking to >> the computer as Siri/Cortana/Alexa/Google Now and other projects >> demonstrate, conversational interfaces. We agreed that want this and >> people expect it more and more. >> Striking a balance between privacy and getting some data to enable >> this is a big concern, see later. >> While there is general interest (almost everyone here went out of >> their way to join the disussion), it didn't seem like anyone here at >> the moment wanted to drive this forward themselves, so it may just >> not go anywhere due to lack of people willing to put in time. >> Otherwise it may be something worth considering as a community goal. >> >> >> The term "intent" seems to be OK for the event that arrives at the >> application. More on that later. >> >> We tried to break down the problem and arrived at two possible >> scenarios: 1) voice recognition -> string representation in user's >> language 1.1) translation to English -> string representation in >> English 2) English sentence -> English string to intent >> >> or alternatively: >> 1) voice recognition -> string representation in user's language >> 2) user language sentence -> user language string to intent >> >> 3) appliations get "intents" and react to them. >> >> So basically one open question is if we need a translation step or if >> we can directly translate from a string in any language to an intent. >> >> We do not think it feasible nor desirable to let every app do its own >> magic. Thus a central "daemon" processes does step 1, listenting to >> audio and translating to a string representation. >> Then, assuming we want to do a translation step 1.1 we need to find a >> way to do the translation. >> >> For step 1 mozilla deep voice seems like a candidate, it seems to be >> quickly progressing. >> >> We assume that mid-term we need machine learning for step 2 - gather >> sample sentences (somewhere between thousands and millions) to enable >> the step of going from sentence to intent. >> We might get away with a set of simple heuristics to get this >> kick-started, but over time we would want to use machine learning to >> do this step. Here it's important to gather enough sample sentences >> to be able to train a model. We basically assume we need to encourage >> people to participate and send us the recognized sentences to get >> enough raw material to work with. >> >> On interesting point is that ideally we can keep context, so that the >> users can do follow up queries/commands. >> Some of the context may be expressed with state machines (talk to >> Emanuelle about that). >> Clearly the whole topic needs research, we want to build on other >> people's stuff and cooperate as much as possible. >> >> Hopefully we can find some centralized daemon thing to run on Linux >> and do a lot of the work in step 1 and 2 for us. >> Step 3 requires work on our side (in Qt?) for sure. >> What should intents look like? lists of property bags? >> Should apps have a way of saying which intents they support? >> >> A starting point could be to use the common media player interface to >> control the media player using voice. >> Should exposing intents be a dbus thing to start with? >> >> For querying data, we may want to interface with wikipedia, music >> brainz, etc, but is that more part of the central daemon or should >> there be an app? >> >> We probably want to be able to start applications when the appropriate >> command arrives "write a new email to Volker" launches Kube with the >> composer open and ideally the receiver filled out, or it may ask the >> user "I don't know who that is, please help me...". >> So how do applications define what intents they process? >> How can applications ask for details? after receiving an intent they >> may need to ask for more data. >> >> There is also the kpurpose
Re: Randa Meeting: Notes on Voice Control in KDE
Hey! Interesting discussion. Did you guys factor in the work done by Mycroft on that front? I think there's a great deal of overlap, and already some really interesting results shown for example in the Mycroft Plasmoid: https://www.youtube.com/watch?v=sUhvKTq6c40 (somewhat dated, but gives a decent impression) Cheers, -- sebas On Friday, September 15, 2017 9:39:13 AM CEST Frederik Gladhorn wrote: > We here at Randa had a little session about voice recognition and > control of applications. > We tried to roughly define what we mean by that - a way of talking to > the computer as Siri/Cortana/Alexa/Google Now and other projects > demonstrate, conversational interfaces. We agreed that want this and > people expect it more and more. > Striking a balance between privacy and getting some data to enable > this is a big concern, see later. > While there is general interest (almost everyone here went out of > their way to join the disussion), it didn't seem like anyone here at > the moment wanted to drive this forward themselves, so it may just > not go anywhere due to lack of people willing to put in time. > Otherwise it may be something worth considering as a community goal. > > > The term "intent" seems to be OK for the event that arrives at the > application. More on that later. > > We tried to break down the problem and arrived at two possible > scenarios: 1) voice recognition -> string representation in user's > language 1.1) translation to English -> string representation in > English 2) English sentence -> English string to intent > > or alternatively: > 1) voice recognition -> string representation in user's language > 2) user language sentence -> user language string to intent > > 3) appliations get "intents" and react to them. > > So basically one open question is if we need a translation step or if > we can directly translate from a string in any language to an intent. > > We do not think it feasible nor desirable to let every app do its own > magic. Thus a central "daemon" processes does step 1, listenting to > audio and translating to a string representation. > Then, assuming we want to do a translation step 1.1 we need to find a > way to do the translation. > > For step 1 mozilla deep voice seems like a candidate, it seems to be > quickly progressing. > > We assume that mid-term we need machine learning for step 2 - gather > sample sentences (somewhere between thousands and millions) to enable > the step of going from sentence to intent. > We might get away with a set of simple heuristics to get this > kick-started, but over time we would want to use machine learning to > do this step. Here it's important to gather enough sample sentences > to be able to train a model. We basically assume we need to encourage > people to participate and send us the recognized sentences to get > enough raw material to work with. > > On interesting point is that ideally we can keep context, so that the > users can do follow up queries/commands. > Some of the context may be expressed with state machines (talk to > Emanuelle about that). > Clearly the whole topic needs research, we want to build on other > people's stuff and cooperate as much as possible. > > Hopefully we can find some centralized daemon thing to run on Linux > and do a lot of the work in step 1 and 2 for us. > Step 3 requires work on our side (in Qt?) for sure. > What should intents look like? lists of property bags? > Should apps have a way of saying which intents they support? > > A starting point could be to use the common media player interface to > control the media player using voice. > Should exposing intents be a dbus thing to start with? > > For querying data, we may want to interface with wikipedia, music > brainz, etc, but is that more part of the central daemon or should > there be an app? > > We probably want to be able to start applications when the appropriate > command arrives "write a new email to Volker" launches Kube with the > composer open and ideally the receiver filled out, or it may ask the > user "I don't know who that is, please help me...". > So how do applications define what intents they process? > How can applications ask for details? after receiving an intent they > may need to ask for more data. > > There is also the kpurpose framework, I have no idea what it does, > should read up on it. > > This is likely to be completely new input, while app is in some > state, may have an open modal dialog, new crashes because we're not > prepared? Are there patterns/building blocks to make it easier when > an app is in a certain state? > Maybe we should look at transactional computing and finite state > machines? We could look at network protocols as example, they have > error recovery etc. > > How would integration for online services look like? A lot of this is > about querying information. > Should it be by default offline, delegate stuff to online when the > user asks for it? > > We need to build for
Randa Meeting: Notes on Voice Control in KDE
We here at Randa had a little session about voice recognition and control of applications. We tried to roughly define what we mean by that - a way of talking to the computer as Siri/Cortana/Alexa/Google Now and other projects demonstrate, conversational interfaces. We agreed that want this and people expect it more and more. Striking a balance between privacy and getting some data to enable this is a big concern, see later. While there is general interest (almost everyone here went out of their way to join the disussion), it didn't seem like anyone here at the moment wanted to drive this forward themselves, so it may just not go anywhere due to lack of people willing to put in time. Otherwise it may be something worth considering as a community goal. The term "intent" seems to be OK for the event that arrives at the application. More on that later. We tried to break down the problem and arrived at two possible scenarios: 1) voice recognition -> string representation in user's language 1.1) translation to English -> string representation in English 2) English sentence -> English string to intent or alternatively: 1) voice recognition -> string representation in user's language 2) user language sentence -> user language string to intent 3) appliations get "intents" and react to them. So basically one open question is if we need a translation step or if we can directly translate from a string in any language to an intent. We do not think it feasible nor desirable to let every app do its own magic. Thus a central "daemon" processes does step 1, listenting to audio and translating to a string representation. Then, assuming we want to do a translation step 1.1 we need to find a way to do the translation. For step 1 mozilla deep voice seems like a candidate, it seems to be quickly progressing. We assume that mid-term we need machine learning for step 2 - gather sample sentences (somewhere between thousands and millions) to enable the step of going from sentence to intent. We might get away with a set of simple heuristics to get this kick-started, but over time we would want to use machine learning to do this step. Here it's important to gather enough sample sentences to be able to train a model. We basically assume we need to encourage people to participate and send us the recognized sentences to get enough raw material to work with. On interesting point is that ideally we can keep context, so that the users can do follow up queries/commands. Some of the context may be expressed with state machines (talk to Emanuelle about that). Clearly the whole topic needs research, we want to build on other people's stuff and cooperate as much as possible. Hopefully we can find some centralized daemon thing to run on Linux and do a lot of the work in step 1 and 2 for us. Step 3 requires work on our side (in Qt?) for sure. What should intents look like? lists of property bags? Should apps have a way of saying which intents they support? A starting point could be to use the common media player interface to control the media player using voice. Should exposing intents be a dbus thing to start with? For querying data, we may want to interface with wikipedia, music brainz, etc, but is that more part of the central daemon or should there be an app? We probably want to be able to start applications when the appropriate command arrives "write a new email to Volker" launches Kube with the composer open and ideally the receiver filled out, or it may ask the user "I don't know who that is, please help me...". So how do applications define what intents they process? How can applications ask for details? after receiving an intent they may need to ask for more data. There is also the kpurpose framework, I have no idea what it does, should read up on it. This is likely to be completely new input, while app is in some state, may have an open modal dialog, new crashes because we're not prepared? Are there patterns/building blocks to make it easier when an app is in a certain state? Maybe we should look at transactional computing and finite state machines? We could look at network protocols as example, they have error recovery etc. How would integration for online services look like? A lot of this is about querying information. Should it be by default offline, delegate stuff to online when the user asks for it? We need to build for example public transport app integration. For centralized AI join other projects. Maybe Qt will provide the connection to 3rd party engines on Windows and macOS, good testing ground. And to end with a less serious idea, we need a big bike-shed discussion about wake up words. We already came up with: OK KDE (try saying that out loud), OK Konqui or Oh Kate! I hope some of this makes sense, I'd love to see more people stepping up and start figuring out what is needed and move it forward :) Cheers, Frederik