Re: MXNet - Gluon - Audio

2018-11-20 Thread Sheng Zha
Hi Gaurav,

The performance concerns is not just around librosa, but also the way to 
integrate it. librosa as a python library requires holding GIL when calling it, 
which makes it hard for asynchronous data preprocessing during training. Also, 
the API design hasn't been verified on the more full-fledged use cases that you 
outlined. That, and based on the lack of expertise of audio processing 
reviewing the design doc, my suggestion is to continue the work as a Gluon 
example, until other use cases are adopted, which is what you started in 
https://github.com/apache/incubator-mxnet/pull/13325. Once you make more 
progress and become more familiar with Gluon design, please report back to this 
thread and I'd be happy to help more on the review.

-sz

On 2018/11/20 19:20:18, Gaurav Gireesh  wrote: 
> Hi All!
> Following up on this PR:
> https://github.com/apache/incubator-mxnet/pull/13241
> I would need some comments or feedback regarding the API design :
> https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio
> 
> The comments on the PR were mostly around *librosa *and its performance
> being a blocker if and when the designed API can be tested with bigger ASR
> models DeepSpeech 2, DeepSpeech 3.
> I would appreciate if the community provides their expertise/knowledge on
> loading audio data and feature extraction used currently with bigger ARS
> models.
> If there is anything in design which may be changed/improved that will
> improve the performance, I ll be happy to look into this.
> 
> Thanks and regards,
> Gaurav Gireesh
> 
> On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh 
> wrote:
> 
> > Hi Lai!
> > Thank you for your comments!
> > Below are the answers to your comments/queries:
> > 1) That's a good suggestion. However, I have added an example in the Pull
> > request related to this:
> > https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f
> > .
> > I would be happy to include a dataset similar to MNIST to support that. I
> > have come across an example dataset used in tensor flow speech
> > related example here
> > <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This
> > could be included.
> >
> > 2) Thank you for the suggestion, I shall look into the FFT operator that
> > you have pointed out. However, there are other kind of features like, mfcc,
> > mels and so on which are popular in audio data feature extraction, which
> > will find utility if implemented. I am not sure if we have operators for
> > this.
> >
> > 3) The references look good too. I shall look into them. Thank you for
> > bringing them into my notice.
> >
> > Regards,
> > Gaurav
> >
> > On Tue, Nov 13, 2018 at 11:22 AM Lai Wei  wrote:
> >
> >> Hi Gaurav,
> >>
> >> Thanks for starting this. I see the PR is out
> >> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
> >> reviews, good work!
> >>
> >> In addition to Sandeep's queries, I have the following:
> >> 1. Can we include some simple classic audio dataset for users to directly
> >> import and try out? like MNIST in vision. (e.g.:
> >> http://pytorch.org/audio/datasets.html#yesno)
> >> 2. Librosa provides some good audio feature extractions, we can use it for
> >> now. But it's slow as you have to do conversions between ndarray and
> >> numpy.
> >> In the long term, can we make transforms to use mxnet operators and change
> >> your transforms to hybrid blocks? For example, mxnet FFT
> >> <
> >> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft
> >> >
> >> operator
> >> can be used in a hybrid block transformer, which will be a lot faster.
> >>
> >> Some additional references on users already using mxnet on audio, we
> >> should
> >> aim to make it easier and automate the file load/preprocess/transform
> >> process.
> >> 1. https://github.com/chen0040/mxnet-audio
> >> 2. https://github.com/shuokay/mxnet-wavenet
> >>
> >> Looking forward to seeing this feature out.
> >> Thanks!
> >>
> >> Best Regards
> >>
> >> Lai
> >>
> >>
> >> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
> >> sandeep.krishn...@gmail.com> wrote:
> >>
> >> > Thanks, Gaurav for starting this initiative. The design document is
> >> > detailed and gives all the information.
> >> > Starting to add this in "Con

Re: MXNet - Gluon - Audio

2018-11-20 Thread Gaurav Gireesh
Hi All!
Following up on this PR:
https://github.com/apache/incubator-mxnet/pull/13241
I would need some comments or feedback regarding the API design :
https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio

The comments on the PR were mostly around *librosa *and its performance
being a blocker if and when the designed API can be tested with bigger ASR
models DeepSpeech 2, DeepSpeech 3.
I would appreciate if the community provides their expertise/knowledge on
loading audio data and feature extraction used currently with bigger ARS
models.
If there is anything in design which may be changed/improved that will
improve the performance, I ll be happy to look into this.

Thanks and regards,
Gaurav Gireesh

On Thu, Nov 15, 2018 at 10:47 AM Gaurav Gireesh 
wrote:

> Hi Lai!
> Thank you for your comments!
> Below are the answers to your comments/queries:
> 1) That's a good suggestion. However, I have added an example in the Pull
> request related to this:
> https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f
> .
> I would be happy to include a dataset similar to MNIST to support that. I
> have come across an example dataset used in tensor flow speech
> related example here
> <https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This
> could be included.
>
> 2) Thank you for the suggestion, I shall look into the FFT operator that
> you have pointed out. However, there are other kind of features like, mfcc,
> mels and so on which are popular in audio data feature extraction, which
> will find utility if implemented. I am not sure if we have operators for
> this.
>
> 3) The references look good too. I shall look into them. Thank you for
> bringing them into my notice.
>
> Regards,
> Gaurav
>
> On Tue, Nov 13, 2018 at 11:22 AM Lai Wei  wrote:
>
>> Hi Gaurav,
>>
>> Thanks for starting this. I see the PR is out
>> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
>> reviews, good work!
>>
>> In addition to Sandeep's queries, I have the following:
>> 1. Can we include some simple classic audio dataset for users to directly
>> import and try out? like MNIST in vision. (e.g.:
>> http://pytorch.org/audio/datasets.html#yesno)
>> 2. Librosa provides some good audio feature extractions, we can use it for
>> now. But it's slow as you have to do conversions between ndarray and
>> numpy.
>> In the long term, can we make transforms to use mxnet operators and change
>> your transforms to hybrid blocks? For example, mxnet FFT
>> <
>> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft
>> >
>> operator
>> can be used in a hybrid block transformer, which will be a lot faster.
>>
>> Some additional references on users already using mxnet on audio, we
>> should
>> aim to make it easier and automate the file load/preprocess/transform
>> process.
>> 1. https://github.com/chen0040/mxnet-audio
>> 2. https://github.com/shuokay/mxnet-wavenet
>>
>> Looking forward to seeing this feature out.
>> Thanks!
>>
>> Best Regards
>>
>> Lai
>>
>>
>> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
>> sandeep.krishn...@gmail.com> wrote:
>>
>> > Thanks, Gaurav for starting this initiative. The design document is
>> > detailed and gives all the information.
>> > Starting to add this in "Contrib" is a good idea while we expect a few
>> > rough edges and cleanups to follow.
>> >
>> > I had the following queries:
>> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
>> > features, performance, community usage in audio data domain.
>> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi
>> or
>> > ask the user to install if required? I prefer the latter, similar to
>> > protobuf in ONNX-MXNet.
>> > 3. I see LibROSA is a fully Python-based library. Are we getting
>> blocked on
>> > the dependency for future use cases when we want to make
>> transformations as
>> > operators and allow for cross-language support?
>> > 4. In performance design considerations, with lazy=True / False the
>> > performance difference is too scary ( 8 minutes to 4 hours!!) This
>> requires
>> > some more analysis. If we known turning a flag off/on has 24X
>> performance
>> > degradation, should we need to provide that control to user? What is the
>> > impact of this on Memory usage?
>> > 5. I see LibROSA has ISC license (
>> > https://git

Re: MXNet - Gluon - Audio

2018-11-15 Thread Gaurav Gireesh
Hi Lai!
Thank you for your comments!
Below are the answers to your comments/queries:
1) That's a good suggestion. However, I have added an example in the Pull
request related to this:
https://github.com/apache/incubator-mxnet/pull/13241/commits/eabb68256d8fd603a0075eafcd8947d92e7df27f
.
I would be happy to include a dataset similar to MNIST to support that. I
have come across an example dataset used in tensor flow speech
related example here
<https://www.tensorflow.org/tutorials/sequences/audio_recognition>. This
could be included.

2) Thank you for the suggestion, I shall look into the FFT operator that
you have pointed out. However, there are other kind of features like, mfcc,
mels and so on which are popular in audio data feature extraction, which
will find utility if implemented. I am not sure if we have operators for
this.

3) The references look good too. I shall look into them. Thank you for
bringing them into my notice.

Regards,
Gaurav

On Tue, Nov 13, 2018 at 11:22 AM Lai Wei  wrote:

> Hi Gaurav,
>
> Thanks for starting this. I see the PR is out
> <https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
> reviews, good work!
>
> In addition to Sandeep's queries, I have the following:
> 1. Can we include some simple classic audio dataset for users to directly
> import and try out? like MNIST in vision. (e.g.:
> http://pytorch.org/audio/datasets.html#yesno)
> 2. Librosa provides some good audio feature extractions, we can use it for
> now. But it's slow as you have to do conversions between ndarray and numpy.
> In the long term, can we make transforms to use mxnet operators and change
> your transforms to hybrid blocks? For example, mxnet FFT
> <
> https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft
> >
> operator
> can be used in a hybrid block transformer, which will be a lot faster.
>
> Some additional references on users already using mxnet on audio, we should
> aim to make it easier and automate the file load/preprocess/transform
> process.
> 1. https://github.com/chen0040/mxnet-audio
> 2. https://github.com/shuokay/mxnet-wavenet
>
> Looking forward to seeing this feature out.
> Thanks!
>
> Best Regards
>
> Lai
>
>
> On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
> sandeep.krishn...@gmail.com> wrote:
>
> > Thanks, Gaurav for starting this initiative. The design document is
> > detailed and gives all the information.
> > Starting to add this in "Contrib" is a good idea while we expect a few
> > rough edges and cleanups to follow.
> >
> > I had the following queries:
> > 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
> > features, performance, community usage in audio data domain.
> > 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi
> or
> > ask the user to install if required? I prefer the latter, similar to
> > protobuf in ONNX-MXNet.
> > 3. I see LibROSA is a fully Python-based library. Are we getting blocked
> on
> > the dependency for future use cases when we want to make transformations
> as
> > operators and allow for cross-language support?
> > 4. In performance design considerations, with lazy=True / False the
> > performance difference is too scary ( 8 minutes to 4 hours!!) This
> requires
> > some more analysis. If we known turning a flag off/on has 24X performance
> > degradation, should we need to provide that control to user? What is the
> > impact of this on Memory usage?
> > 5. I see LibROSA has ISC license (
> > https://github.com/librosa/librosa/blob/master/LICENSE.md) which says
> free
> > to use with same license notification. I am not sure if this is ok. I
> > request other committers/mentors to suggest.
> >
> > Best,
> > Sandeep
> >
> > On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh 
> > wrote:
> >
> > > Dear MXNet Community,
> > >
> > > I recently started looking into performing some simple sound
> multi-class
> > > classification tasks with Audio Data and realized that as a user, I
> would
> > > like MXNet to have an out of the box feature which allows us to load
> > audio
> > > data(at least 1 file format), extract features( or apply some common
> > > transforms/feature extraction) and train a model using the Audio
> Dataset.
> > > This could be a first step towards building and supporting APIs similar
> > to
> > > what we have for "vision" related use cases in MXNet.
> > >
> > > Below is the design proposal :
> > >
> > > Gluon - Audio Design Proposal
> > > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
> > >
> > > I would highly appreciate your taking time to review and provide
> > feedback,
> > > comments/suggestions on this.
> > > Looking forward to your support.
> > >
> > >
> > > Best Regards,
> > >
> > > Gaurav Gireesh
> > >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>


Re: MXNet - Gluon - Audio

2018-11-15 Thread Gaurav Gireesh
Hi Sandeep! Thank you for looking into this. Below are the answers as I
have them now:

1) As of now, I do not have any metric to compare librosa with other
libraries currently available. I am working on this to find some. As far as
community usage is concerned, I have come across blogs which speak well of
librosa as an audio load/manipulation library. One of them is here
<https://towardsdatascience.com/urban-sound-classification-part-2-sample-rate-conversion-librosa-ba7bc88f209a>
. Librosa is
performing slow as I have consulted some other frameworks that use the
library in their use cases as well. I have used scipy.io.wavfile too but it
just supports audio load and not much useful feature extraction/audio
transform. Librosa load takes care of a lot of preprocessing like
resampling the audio to a standard sampling rate, convert stereo to mono,
scale the audio samples and so on. So, this library was chosen to start
with. But I also intend to have some feedback from the community if they
have some ideas about other libraries which do these tasks performing
better.

2) Your suggestion to remove the hard dependency on this library for the
users does make sense. It should be installed only when the users really
need to perform these audio related tasks and we rely on librosa at the
moment for that.

3) I have looked into the code for librosa, however it needs more
understanding, so unless that is figured out, it will be soon to comment
how the operators would be implemented or how they can be extended to
support other languages.

4) Yes, the time difference is huge! However, librosa load( loading the
audio onto numpy array) does take the bulk of the time (80-90%) and not the
feature extraction like (mfcc, mel etc.). That is the reason why we have
disabled *lazy = True* in current design by overriding this in the method.
So, initializing gluon's dataloader is taking time, training goes quicker
then. This certainly needs more analysis of ways to do this.
Or, if we find other library(better suited for this) altogether, it will
help too.

5) Yes, I would need comments/suggestions from Committers/Contributors on
this too.

Appreciate your comments.

Thanks and regards,
Gaurav

On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
sandeep.krishn...@gmail.com> wrote:

> Thanks, Gaurav for starting this initiative. The design document is
> detailed and gives all the information.
> Starting to add this in "Contrib" is a good idea while we expect a few
> rough edges and cleanups to follow.
>
> I had the following queries:
> 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
> features, performance, community usage in audio data domain.
> 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi or
> ask the user to install if required? I prefer the latter, similar to
> protobuf in ONNX-MXNet.
> 3. I see LibROSA is a fully Python-based library. Are we getting blocked on
> the dependency for future use cases when we want to make transformations as
> operators and allow for cross-language support?
> 4. In performance design considerations, with lazy=True / False the
> performance difference is too scary ( 8 minutes to 4 hours!!) This requires
> some more analysis. If we known turning a flag off/on has 24X performance
> degradation, should we need to provide that control to user? What is the
> impact of this on Memory usage?
> 5. I see LibROSA has ISC license (
> https://github.com/librosa/librosa/blob/master/LICENSE.md) which says free
> to use with same license notification. I am not sure if this is ok. I
> request other committers/mentors to suggest.
>
> Best,
> Sandeep
>
> On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh 
> wrote:
>
> > Dear MXNet Community,
> >
> > I recently started looking into performing some simple sound multi-class
> > classification tasks with Audio Data and realized that as a user, I would
> > like MXNet to have an out of the box feature which allows us to load
> audio
> > data(at least 1 file format), extract features( or apply some common
> > transforms/feature extraction) and train a model using the Audio Dataset.
> > This could be a first step towards building and supporting APIs similar
> to
> > what we have for "vision" related use cases in MXNet.
> >
> > Below is the design proposal :
> >
> > Gluon - Audio Design Proposal
> > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
> >
> > I would highly appreciate your taking time to review and provide
> feedback,
> > comments/suggestions on this.
> > Looking forward to your support.
> >
> >
> > Best Regards,
> >
> > Gaurav Gireesh
> >
>
>
> --
> Sandeep Krishnamurthy
>


Re: MXNet - Gluon - Audio

2018-11-13 Thread Lai Wei
Hi Gaurav,

Thanks for starting this. I see the PR is out
<https://github.com/apache/incubator-mxnet/pull/13241>, left some initial
reviews, good work!

In addition to Sandeep's queries, I have the following:
1. Can we include some simple classic audio dataset for users to directly
import and try out? like MNIST in vision. (e.g.:
http://pytorch.org/audio/datasets.html#yesno)
2. Librosa provides some good audio feature extractions, we can use it for
now. But it's slow as you have to do conversions between ndarray and numpy.
In the long term, can we make transforms to use mxnet operators and change
your transforms to hybrid blocks? For example, mxnet FFT
<https://mxnet.apache.org/api/python/ndarray/contrib.html?highlight=fft#mxnet.ndarray.contrib.fft>
operator
can be used in a hybrid block transformer, which will be a lot faster.

Some additional references on users already using mxnet on audio, we should
aim to make it easier and automate the file load/preprocess/transform
process.
1. https://github.com/chen0040/mxnet-audio
2. https://github.com/shuokay/mxnet-wavenet

Looking forward to seeing this feature out.
Thanks!

Best Regards

Lai


On Tue, Nov 13, 2018 at 9:09 AM sandeep krishnamurthy <
sandeep.krishn...@gmail.com> wrote:

> Thanks, Gaurav for starting this initiative. The design document is
> detailed and gives all the information.
> Starting to add this in "Contrib" is a good idea while we expect a few
> rough edges and cleanups to follow.
>
> I had the following queries:
> 1. Is there any analysis comparing LibROSA with other libraries? w.r.t
> features, performance, community usage in audio data domain.
> 2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi or
> ask the user to install if required? I prefer the latter, similar to
> protobuf in ONNX-MXNet.
> 3. I see LibROSA is a fully Python-based library. Are we getting blocked on
> the dependency for future use cases when we want to make transformations as
> operators and allow for cross-language support?
> 4. In performance design considerations, with lazy=True / False the
> performance difference is too scary ( 8 minutes to 4 hours!!) This requires
> some more analysis. If we known turning a flag off/on has 24X performance
> degradation, should we need to provide that control to user? What is the
> impact of this on Memory usage?
> 5. I see LibROSA has ISC license (
> https://github.com/librosa/librosa/blob/master/LICENSE.md) which says free
> to use with same license notification. I am not sure if this is ok. I
> request other committers/mentors to suggest.
>
> Best,
> Sandeep
>
> On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh 
> wrote:
>
> > Dear MXNet Community,
> >
> > I recently started looking into performing some simple sound multi-class
> > classification tasks with Audio Data and realized that as a user, I would
> > like MXNet to have an out of the box feature which allows us to load
> audio
> > data(at least 1 file format), extract features( or apply some common
> > transforms/feature extraction) and train a model using the Audio Dataset.
> > This could be a first step towards building and supporting APIs similar
> to
> > what we have for "vision" related use cases in MXNet.
> >
> > Below is the design proposal :
> >
> > Gluon - Audio Design Proposal
> > <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
> >
> > I would highly appreciate your taking time to review and provide
> feedback,
> > comments/suggestions on this.
> > Looking forward to your support.
> >
> >
> > Best Regards,
> >
> > Gaurav Gireesh
> >
>
>
> --
> Sandeep Krishnamurthy
>


Re: MXNet - Gluon - Audio

2018-11-13 Thread sandeep krishnamurthy
Thanks, Gaurav for starting this initiative. The design document is
detailed and gives all the information.
Starting to add this in "Contrib" is a good idea while we expect a few
rough edges and cleanups to follow.

I had the following queries:
1. Is there any analysis comparing LibROSA with other libraries? w.r.t
features, performance, community usage in audio data domain.
2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi or
ask the user to install if required? I prefer the latter, similar to
protobuf in ONNX-MXNet.
3. I see LibROSA is a fully Python-based library. Are we getting blocked on
the dependency for future use cases when we want to make transformations as
operators and allow for cross-language support?
4. In performance design considerations, with lazy=True / False the
performance difference is too scary ( 8 minutes to 4 hours!!) This requires
some more analysis. If we known turning a flag off/on has 24X performance
degradation, should we need to provide that control to user? What is the
impact of this on Memory usage?
5. I see LibROSA has ISC license (
https://github.com/librosa/librosa/blob/master/LICENSE.md) which says free
to use with same license notification. I am not sure if this is ok. I
request other committers/mentors to suggest.

Best,
Sandeep

On Fri, Nov 9, 2018 at 5:45 PM Gaurav Gireesh 
wrote:

> Dear MXNet Community,
>
> I recently started looking into performing some simple sound multi-class
> classification tasks with Audio Data and realized that as a user, I would
> like MXNet to have an out of the box feature which allows us to load audio
> data(at least 1 file format), extract features( or apply some common
> transforms/feature extraction) and train a model using the Audio Dataset.
> This could be a first step towards building and supporting APIs similar to
> what we have for "vision" related use cases in MXNet.
>
> Below is the design proposal :
>
> Gluon - Audio Design Proposal
> <https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>
>
> I would highly appreciate your taking time to review and provide feedback,
> comments/suggestions on this.
> Looking forward to your support.
>
>
> Best Regards,
>
> Gaurav Gireesh
>


-- 
Sandeep Krishnamurthy


MXNet - Gluon - Audio

2018-11-09 Thread Gaurav Gireesh
Dear MXNet Community,

I recently started looking into performing some simple sound multi-class
classification tasks with Audio Data and realized that as a user, I would
like MXNet to have an out of the box feature which allows us to load audio
data(at least 1 file format), extract features( or apply some common
transforms/feature extraction) and train a model using the Audio Dataset.
This could be a first step towards building and supporting APIs similar to
what we have for "vision" related use cases in MXNet.

Below is the design proposal :

Gluon - Audio Design Proposal
<https://cwiki.apache.org/confluence/display/MXNET/Gluon+-+Audio>

I would highly appreciate your taking time to review and provide feedback,
comments/suggestions on this.
Looking forward to your support.


Best Regards,

Gaurav Gireesh