question

2016-06-01 Thread Khurrum Nasim
Hello All,


Seeking some advice regarding  the following: 


I have a JSON ETL task.  You know we all done some ETL in our lives before - 
extract data, apply some transformation to it, and load it back.   

I have a fairly huge amount of JSON that I need to iterate over and check for 
the existence of a particular  combination lets call it for 
argument sake the “target”.

If the “target" is found and the value is not null,  I have to create a 
duplicate of it but with a distinct name for the key.  If the “target” is found 
but has null value/no value, then simply create a new key leaving the value 
null.  

Because JSON can have a deep nested structure a recursive routine wouldn’t be 
ideal here to tackle the problem given that I don’t know how large the data is. 
 

As I said earlier I have no idea how large the data  is ?  It could be a large 
set possibly several gigs.Let’s assume 50-100 gig. 


Thanks,

Khurrum







Re: [NEW member] Hi

2016-06-01 Thread Khurrum Nasim
How are you folks getting over the learning curves associated with things like 
Nifi and AirFlow ?

> On May 28, 2016, at 9:50 AM, Suneel Marthi  wrote:
> 
> Debo,
> 
> On Tue, May 17, 2016 at 9:18 PM, Andrew Palumbo  wrote:
> 
>> We are certainly interested in  online clustering Algorithms, and
>> clustering of timeseries seems like a great fit.  (our text vectorization
>> pipeline has not yet been reworked for the new Mahout "Samsara" but that is
>> an interest too).  What type of compute platform would you require for this?
>> 
> 
> For data processing pipeline, the requirements are :
>(A) it should be agnostic to any distributed processing engine like
> Spark, Flink, etc.
>(b) should be able to scale data pipelines and be able to support back
> pressure.
>(c) should be able to ingest both Batch and Streaming data from Spark,
> Flink, Beam etc...
> 
>   So far Apache NiFi seems to fit the bill for all of the above criteria
> (they don't have a Beam interface yet but is being worked on) and they also
> have an excellent GUI along with features to define common workflow
> templates that could be imported into custom workflows.
> 
> The other alternatives being considered are Airbnb's Airflow - proposed for
> Apache incubator and defines workflows as a DAG in python,
> Apache Beam.
> 
> 
> 
>> 
>> Currently we are not looking at FPGAs.
>> 
> 
> If any of the Math packages handle FPGAs natively out-of-the-box, let's go
> for it. But we need not optimize the heck to get the last bit of
> performance from FPGAs.
> 
> 
>> 
>> The most recent, and only real Documentation for Mahout Samsara is in
>> Apache Mahout: Beyond MapReduce:
>> 
>> 
>> http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html.
>> You may want to check that out as a reference.
>> 
>> (I'm sorry for the shameless plug but it is the only thing that cover most
>> all Mahout "Samsara" features and architecture up to our previous release)
>> 
> 
> I don't see this as a shameless plug, its definitely much better than the
> dozen low grade books that have been churned out by PackT publishers and
> went nowhere, other than bringing disrepute to the project and community.
> 
> 
>> 
>> Please do let us know if you have any questions about the Samsara platform.
>> 
>> From: Debojyoti Dutta 
>> Sent: Tuesday, May 17, 2016 8:35:04 PM
>> To: dev@mahout.apache.org
>> Subject: Re: [NEW member] Hi
>> 
>> Thanks Andy! Would like to see if there is interest for algorithms such as
>> 1) clustering text in an online fashion (maybe using LSH or sim/min hash)
>> or 2) online clustering of time series. Basically my focus is "online" or
>> real time.
>> 
>> LSH on GPU sounds very interesting and would love to look at the patches.
>> Personally have helped accelerate LSH on TCAMs long ago e.g.
>> http://arxiv.org/abs/1006.3514  Is GPU the only hw accel you are
>> looking at or are you considering PCIe FPGA cards too?
>> 
>> debo
>> 
>> On Tue, May 17, 2016 at 5:27 PM, Andrew Palumbo 
>> wrote:
>> 
>>> Welcome, Debojyoti.
>>> We look forward to your contributiins.  We are currently working towards
>>> integrating GPU acceleration for our 0.13 release and LSH sounds like a
>>> great addition. Could you tell us some more about what you would like to
>> do?
>>> 
>>> Let us know if we can help you get familiar with the mahout code base.
>> We
>>> try to implement algorithms in the math-scala module.
>>> 
>>> Thanks,
>>> 
>>> Andy
>>> 
>>> 
>>> 
>>> 
>>> 
>>>  Original message 
>>> From: Debojyoti Dutta 
>>> Date: 05/17/2016 8:11 PM (GMT-05:00)
>>> To: dev@mahout.apache.org
>>> Subject: [NEW member] Hi
>>> 
>>> Hi there,
>>> 
>>> Am very interested in contributing to Mahout especially towards fast ML
>>> kernels that can be used for streaming. Have some experience with LSH
>> based
>>> techniques (including hw accel) for clustering and near neighbors based
>>> stuff in general.
>>> 
>>> Was chatting with Sunil and he suggested I join the merry band.
>>> 
>>> regards
>>> -Debo~
>>> 
>> 
>> 
>> 
>> --
>> -Debo~
>> 



Re: [NEW member] Hi

2016-06-01 Thread Khurrum Nasim
To the community, active committers, etc. 



> On Jun 1, 2016, at 11:01 AM, Suneel Marthi <smar...@apache.org> wrote:
> 
> Was that question directed to the community or were u asking urself loud ?
> 
> On Wed, Jun 1, 2016 at 10:48 AM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
> 
>> How are you folks getting over the learning curves associated with things
>> like Nifi and AirFlow ?
>> 
>>> On May 28, 2016, at 9:50 AM, Suneel Marthi <smar...@apache.org> wrote:
>>> 
>>> Debo,
>>> 
>>> On Tue, May 17, 2016 at 9:18 PM, Andrew Palumbo <ap@outlook.com>
>> wrote:
>>> 
>>>> We are certainly interested in  online clustering Algorithms, and
>>>> clustering of timeseries seems like a great fit.  (our text
>> vectorization
>>>> pipeline has not yet been reworked for the new Mahout "Samsara" but
>> that is
>>>> an interest too).  What type of compute platform would you require for
>> this?
>>>> 
>>> 
>>> For data processing pipeline, the requirements are :
>>>   (A) it should be agnostic to any distributed processing engine like
>>> Spark, Flink, etc.
>>>   (b) should be able to scale data pipelines and be able to support back
>>> pressure.
>>>   (c) should be able to ingest both Batch and Streaming data from Spark,
>>> Flink, Beam etc...
>>> 
>>>  So far Apache NiFi seems to fit the bill for all of the above criteria
>>> (they don't have a Beam interface yet but is being worked on) and they
>> also
>>> have an excellent GUI along with features to define common workflow
>>> templates that could be imported into custom workflows.
>>> 
>>> The other alternatives being considered are Airbnb's Airflow - proposed
>> for
>>> Apache incubator and defines workflows as a DAG in python,
>>> Apache Beam.
>>> 
>>> 
>>> 
>>>> 
>>>> Currently we are not looking at FPGAs.
>>>> 
>>> 
>>> If any of the Math packages handle FPGAs natively out-of-the-box, let's
>> go
>>> for it. But we need not optimize the heck to get the last bit of
>>> performance from FPGAs.
>>> 
>>> 
>>>> 
>>>> The most recent, and only real Documentation for Mahout Samsara is in
>>>> Apache Mahout: Beyond MapReduce:
>>>> 
>>>> 
>>>> 
>> http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html
>> .
>>>> You may want to check that out as a reference.
>>>> 
>>>> (I'm sorry for the shameless plug but it is the only thing that cover
>> most
>>>> all Mahout "Samsara" features and architecture up to our previous
>> release)
>>>> 
>>> 
>>> I don't see this as a shameless plug, its definitely much better than the
>>> dozen low grade books that have been churned out by PackT publishers and
>>> went nowhere, other than bringing disrepute to the project and community.
>>> 
>>> 
>>>> 
>>>> Please do let us know if you have any questions about the Samsara
>> platform.
>>>> 
>>>> From: Debojyoti Dutta <ddu...@gmail.com>
>>>> Sent: Tuesday, May 17, 2016 8:35:04 PM
>>>> To: dev@mahout.apache.org
>>>> Subject: Re: [NEW member] Hi
>>>> 
>>>> Thanks Andy! Would like to see if there is interest for algorithms such
>> as
>>>> 1) clustering text in an online fashion (maybe using LSH or sim/min
>> hash)
>>>> or 2) online clustering of time series. Basically my focus is "online"
>> or
>>>> real time.
>>>> 
>>>> LSH on GPU sounds very interesting and would love to look at the
>> patches.
>>>> Personally have helped accelerate LSH on TCAMs long ago e.g.
>>>> http://arxiv.org/abs/1006.3514  Is GPU the only hw accel you are
>>>> looking at or are you considering PCIe FPGA cards too?
>>>> 
>>>> debo
>>>> 
>>>> On Tue, May 17, 2016 at 5:27 PM, Andrew Palumbo <ap@outlook.com>
>>>> wrote:
>>>> 
>>>>> Welcome, Debojyoti.
>>>>> We look forward to your contributiins.  We are currently working
>> towards
>>>>> integrating GPU acceleration for our 0.13 release and LSH sounds like a
>>>>> great addition. Could you tell us some more about what you would like
>> to
>>>> do?
>>>>> 
>>>>> Let us know if we can help you get familiar with the mahout code base.
>>>> We
>>>>> try to implement algorithms in the math-scala module.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Andy
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  Original message 
>>>>> From: Debojyoti Dutta <ddu...@gmail.com>
>>>>> Date: 05/17/2016 8:11 PM (GMT-05:00)
>>>>> To: dev@mahout.apache.org
>>>>> Subject: [NEW member] Hi
>>>>> 
>>>>> Hi there,
>>>>> 
>>>>> Am very interested in contributing to Mahout especially towards fast ML
>>>>> kernels that can be used for streaming. Have some experience with LSH
>>>> based
>>>>> techniques (including hw accel) for clustering and near neighbors based
>>>>> stuff in general.
>>>>> 
>>>>> Was chatting with Sunil and he suggested I join the merry band.
>>>>> 
>>>>> regards
>>>>> -Debo~
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> -Debo~
>>>> 
>> 
>> 



Re: [jira] [Created] (MAHOUT-1815) dsqDist(X,Y) and dsqDist(X) failing in flink tests.

2016-03-15 Thread Khurrum Nasim
Sounds good - I’ll take a look.

Thanks,
Khurrum



> On Mar 15, 2016, at 5:12 PM, Khurrum Nasim <khurrum.na...@useitc.com> wrote:
> 
> Hi,
> 
> How do I get committer access to this project ?  I am interested in becoming 
> an active contributor.
> 
> 
> Thanks,
> Khurrum
> 



Re: [jira] [Created] (MAHOUT-1815) dsqDist(X,Y) and dsqDist(X) failing in flink tests.

2016-03-15 Thread Khurrum Nasim
Hi,

How do I get committer access to this project ?  I am interested in becoming an 
active contributor.


Thanks,
Khurrum



Re: [jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-03-30 Thread Khurrum Nasim
Thanks Dimirtry.  

I take a look at see where I can start pitching in.  Do I need contributor 
access ? how  would I create feature branch of my work ? 

Khurrum

> On Mar 30, 2016, at 1:12 PM, Dmitriy Lyubimov  wrote:
> 
> Oh but of course! please do!
> 
> You may work on any issue, this or any other of your choice, or even on any
> new issue you can think of (for sizeable contributions it is recommended to
> start discussion on the @dev list first though, to make sure to benefit
> from experience of others. Please file any new issue first to jira).
> 
> On Wed, Mar 30, 2016 at 9:05 AM, shashi bushan dongur (JIRA) <
> j...@apache.org> wrote:
> 
>> 
>>[
>> https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218216#comment-15218216
>> ]
>> 
>> shashi bushan dongur commented on MAHOUT-1788:
>> --
>> 
>> Hello. I would like to start contributing to mahout. Can I work on this
>> issue?
>> 
>>> spark-itemsimilarity integration test script cleanup
>>> 
>>> 
>>>Key: MAHOUT-1788
>>>URL: https://issues.apache.org/jira/browse/MAHOUT-1788
>>>Project: Mahout
>>> Issue Type: Improvement
>>> Components: cooccurrence
>>>   Affects Versions: 0.11.0
>>>   Reporter: Pat Ferrel
>>>   Assignee: Pat Ferrel
>>>   Priority: Trivial
>>>Fix For: 1.0.0
>>> 
>>> 
>>> binary release does not contain data for itemsimilarity tests, neith
>> binary nor source versions will run on a cluster unless data is hand copied
>> to hdfs.
>>> Clean this up so it copies data if needed and the data is in both
>> versions.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>> 



Re: [jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-03-30 Thread Khurrum Nasim
Thanks for the advice Dimitry.  I’m already signed up on ASF jira.My handle 
is “nasimk”

Do I need to be a linear algebra expert and or math phd  to contribute ?  
I have 10 plus years of computer programming experience.  my background is comp 
sci. 

Khurrum
 




> On Mar 30, 2016, at 2:57 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> PS You may also want to sign up with ASF Jira so we can assign issues to
> yourself.
> 
> On Wed, Mar 30, 2016 at 11:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
> 
>> 
>> 
>> On Wed, Mar 30, 2016 at 11:43 AM, Khurrum Nasim <khurrum.na...@useitc.com>
>> wrote:
>> 
>>> Thanks Dimirtry.
>>> 
>>> I take a look at see where I can start pitching in.  Do I need
>>> contributor access ? how  would I create feature branch of my work ?
>>> 
>> 
>> Khurrum,
>> 
>> you only need github account. What you need is to create mahout's master
>> fork in your github space and keep it in sync, as possible, with master as
>> you go (by doing regular pulls). That way you have the most chance of
>> having least conflicts possible.
>> 
>> At any point in time (I recommend at perhaps when you feel you are about
>> 50 to 70% done or just need a code advice), you can create a github pull
>> request to the apache/mahout master. Make sure to include MAHOUT-XXX issue
>> in the head of the pull request, that way ASF will automatically propagate
>> code comments to jira, and so all discussion can be done entirely on github.
>> 
>> Again, if you take on a signficant contribution (such as a new numerical
>> method contribution), I recommend to discuss the proposal on the @dev list
>> 
>> thanks.
>> 
>> 
>>> 
>>> Khurrum
>>> 
>>>> On Mar 30, 2016, at 1:12 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>>> 
>>>> Oh but of course! please do!
>>>> 
>>>> You may work on any issue, this or any other of your choice, or even on
>>> any
>>>> new issue you can think of (for sizeable contributions it is
>>> recommended to
>>>> start discussion on the @dev list first though, to make sure to benefit
>>>> from experience of others. Please file any new issue first to jira).
>>>> 
>>>> On Wed, Mar 30, 2016 at 9:05 AM, shashi bushan dongur (JIRA) <
>>>> j...@apache.org> wrote:
>>>> 
>>>>> 
>>>>>   [
>>>>> 
>>> https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218216#comment-15218216
>>>>> ]
>>>>> 
>>>>> shashi bushan dongur commented on MAHOUT-1788:
>>>>> --
>>>>> 
>>>>> Hello. I would like to start contributing to mahout. Can I work on this
>>>>> issue?
>>>>> 
>>>>>> spark-itemsimilarity integration test script cleanup
>>>>>> 
>>>>>> 
>>>>>>   Key: MAHOUT-1788
>>>>>>   URL: https://issues.apache.org/jira/browse/MAHOUT-1788
>>>>>>   Project: Mahout
>>>>>>Issue Type: Improvement
>>>>>>Components: cooccurrence
>>>>>>  Affects Versions: 0.11.0
>>>>>>  Reporter: Pat Ferrel
>>>>>>  Assignee: Pat Ferrel
>>>>>>  Priority: Trivial
>>>>>>   Fix For: 1.0.0
>>>>>> 
>>>>>> 
>>>>>> binary release does not contain data for itemsimilarity tests, neith
>>>>> binary nor source versions will run on a cluster unless data is hand
>>> copied
>>>>> to hdfs.
>>>>>> Clean this up so it copies data if needed and the data is in both
>>>>> versions.
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> This message was sent by Atlassian JIRA
>>>>> (v6.3.4#6332)
>>>>> 
>>> 
>>> 
>> 



Re: [jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-03-31 Thread Khurrum Nasim
Thanks everyone - I’m glad to be a part of this.  

Khurrum


> On Mar 30, 2016, at 3:10 PM, Suneel Marthi <smar...@apache.org> wrote:
> 
> Thanks Khurrum for stepping up.
> 
> You just need basic programming skills - Java/Scala to be able to
> contribute. We can help you with the algorithms and linear algebra stuff.
> 
> 
> Welcome aboard !!
> 
> 
> On Wed, Mar 30, 2016 at 3:05 PM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
> 
>> Thanks for the advice Dimitry.  I’m already signed up on ASF jira.My
>> handle is “nasimk”
>> 
>> Do I need to be a linear algebra expert and or math phd  to contribute ?
>> I have 10 plus years of computer programming experience.  my background is
>> comp sci.
>> 
>> Khurrum
>> 
>> 
>> 
>> 
>> 
>>> On Mar 30, 2016, at 2:57 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>> 
>>> PS You may also want to sign up with ASF Jira so we can assign issues to
>>> yourself.
>>> 
>>> On Wed, Mar 30, 2016 at 11:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> On Wed, Mar 30, 2016 at 11:43 AM, Khurrum Nasim <
>> khurrum.na...@useitc.com>
>>>> wrote:
>>>> 
>>>>> Thanks Dimirtry.
>>>>> 
>>>>> I take a look at see where I can start pitching in.  Do I need
>>>>> contributor access ? how  would I create feature branch of my work ?
>>>>> 
>>>> 
>>>> Khurrum,
>>>> 
>>>> you only need github account. What you need is to create mahout's master
>>>> fork in your github space and keep it in sync, as possible, with master
>> as
>>>> you go (by doing regular pulls). That way you have the most chance of
>>>> having least conflicts possible.
>>>> 
>>>> At any point in time (I recommend at perhaps when you feel you are about
>>>> 50 to 70% done or just need a code advice), you can create a github pull
>>>> request to the apache/mahout master. Make sure to include MAHOUT-XXX
>> issue
>>>> in the head of the pull request, that way ASF will automatically
>> propagate
>>>> code comments to jira, and so all discussion can be done entirely on
>> github.
>>>> 
>>>> Again, if you take on a signficant contribution (such as a new numerical
>>>> method contribution), I recommend to discuss the proposal on the @dev
>> list
>>>> 
>>>> thanks.
>>>> 
>>>> 
>>>>> 
>>>>> Khurrum
>>>>> 
>>>>>> On Mar 30, 2016, at 1:12 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Oh but of course! please do!
>>>>>> 
>>>>>> You may work on any issue, this or any other of your choice, or even
>> on
>>>>> any
>>>>>> new issue you can think of (for sizeable contributions it is
>>>>> recommended to
>>>>>> start discussion on the @dev list first though, to make sure to
>> benefit
>>>>>> from experience of others. Please file any new issue first to jira).
>>>>>> 
>>>>>> On Wed, Mar 30, 2016 at 9:05 AM, shashi bushan dongur (JIRA) <
>>>>>> j...@apache.org> wrote:
>>>>>> 
>>>>>>> 
>>>>>>>  [
>>>>>>> 
>>>>> 
>> https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218216#comment-15218216
>>>>>>> ]
>>>>>>> 
>>>>>>> shashi bushan dongur commented on MAHOUT-1788:
>>>>>>> --
>>>>>>> 
>>>>>>> Hello. I would like to start contributing to mahout. Can I work on
>> this
>>>>>>> issue?
>>>>>>> 
>>>>>>>> spark-itemsimilarity integration test script cleanup
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  Key: MAHOUT-1788
>>>>>>>>  URL:
>> https://issues.apache.org/jira/browse/MAHOUT-1788
>>>>>>>>  Project: Mahout
>>>>>>>>   Issue Type: Improvement
>>>>>>>>   Components: cooccurrence
>>>>>>>> Affects Versions: 0.11.0
>>>>>>>> Reporter: Pat Ferrel
>>>>>>>> Assignee: Pat Ferrel
>>>>>>>> Priority: Trivial
>>>>>>>>  Fix For: 1.0.0
>>>>>>>> 
>>>>>>>> 
>>>>>>>> binary release does not contain data for itemsimilarity tests, neith
>>>>>>> binary nor source versions will run on a cluster unless data is hand
>>>>> copied
>>>>>>> to hdfs.
>>>>>>>> Clean this up so it copies data if needed and the data is in both
>>>>>>> versions.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> This message was sent by Atlassian JIRA
>>>>>>> (v6.3.4#6332)
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 



Re: About reuters-fkmeans-centroids

2016-04-28 Thread Khurrum Nasim
@Prakash - Albeit I’m a Mahout noob - if you can represent your problem as a 
network with 2d input then yes Mahout can be used (so i’ve heard).
IMO - every machine based computation problem can be represented as a graph - 
although this may not always be optimal.


Taking this notion of fuzzy clustering a bit further - Can it be applied to 
topics such as demand prediction ?   




> On Apr 28, 2016, at 4:37 PM, Prakash Poudyal  wrote:
> 
> Dear Suneel, Dmitriy and Ted,
> 
> This is just gentle remainder to answer my confusion that I mention in my
> previous email. It would be great if you could response me sooner, so that
> I can go ahead.
> 
> Thank you so much.
> 
> Prakash
> 
> On Thu, Apr 28, 2016 at 8:02 PM, Prakash Poudyal 
> wrote:
> 
>> Hi!
>> 
>> Thank you for your emails !!
>> 
>> Actually, I  need to use fuzzy clustering to cluster the sentence in my
>> research. This is my goal.
>> 
>> I started to use Fuzzy K means clustering of Mahout since last week !!! I
>> found several blogs links, and many other helpful documents  I was
>> going through, as being new, I realize this the best, easy and fast way to
>> know about Mahout works. In my opinion, many new commers do the same as I
>> do. After being used to the tools, than only people focus on the works and
>> go deeply.
>> 
>> I had gone through many blogs and sites to know about Mahout, some of them
>> are below :
>> 
>> http://technobium.com/introduction-to-clustering-using-apache-mahout/
>> 
>> http://tuxdna.github.io/pages/mahout.html
>> 
>> 
>> https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch09/FuzzyKMeansExample.java
>> 
>> http://www.programering.com/a/MDNwgTMwATI.html
>> 
>> 
>> https://www.safaribooksonline.com/library/view/apache-mahout-clustering/9781783284436/ch04.html
>> 
>> https://ymnliu.wordpress.com/2015/11/05/install-apache-mahout-in-eclipse/
>> 
>> https://mahout.apache.org/
>> 
>> What do you say about these sites !! Is these sites are not appropriate
>> ???
>> 
>> I raise my problem several time, in mailing list and even IRC but I got
>> response !!  just today :(
>> 
>> So finally, it would be great, if you could reply the answers of my
>> following question .
>> 
>> Is Apache Mahout appropriate tool for clustering sentences through
>> fuzzy-clustering ?
>> 
>> If answer is  "YES"
>> 
>>Which version of Mahout ?
>> 
>>Can you write the steps that I need to followed, or give me
>> appropriate documentation (links) ?
>> 
>> 
>> Thanks
>> Prakash Poudyal
>> Portugal
>> 
>> On Thu, Apr 28, 2016 at 7:13 PM, Suneel Marthi  wrote:
>> 
>>> That's correct, deprecated as of Feb 2014 and will be completely purged in
>>> one of the upcoming releases (0.13.0)
>>> 
>>> On Thu, Apr 28, 2016 at 2:10 PM, Dmitriy Lyubimov 
>>> wrote:
>>> 
 Prakash,
 
 if you are using any Mahout Mapreduce algorithm for research, please
>>> make
 sure to make this disclosure:
 
 all Mahout MapReduce algorithms are officially not supported and
>>> deprecated
 since February, 2014 (IIRC). I can dig up a specific issue regarding
>>> this.
 There also has been an announcement.
 
 So before you really start drawing any comparisons, please be advised
>>> that
 you are starting with algoritms 2+ years even since their EOL (let alone
 inception).
 
 Thanks.
 -D
 
 On Thu, Apr 28, 2016 at 11:05 AM, Prakash Poudyal <
 prakashpoud...@gmail.com>
 wrote:
 
> Hi! Ted,
> 
> You mean Mahout is no more supporting "fuzzy K clustering for the
> sentences". Can you clarify in more detail . :(
> 
> Prakash
> 
> On Thu, Apr 28, 2016 at 6:58 PM, Ted Dunning 
> wrote:
> 
>> On Thu, Apr 28, 2016 at 10:54 AM, Prakash Poudyal <
>> prakashpoud...@gmail.com>
>> wrote:
>> 
>>> Actually, I need to use fuzzy clustering to cluster the sentence
>>> in
 my
>>> research. I found  fuzzy k clustering algorithm in Apache Mahout,
> thus, I
>>> am trying to use it for my purpose.
>>> 
>> 
>> That's great.
>> 
>> But that code is no longer supported.
>> 
> 
> 
> 
> --
> 
> Regards
> Prakash Poudyal
> 
 
>>> 
>> 
>> 
>> 
>> --
>> 
>> Regards
>> Prakash Poudyal
>> 
>> 
> 
> 
> -- 
> 
> Regards
> Prakash Poudyal



Re: Mahout contributions

2016-04-28 Thread Khurrum Nasim
I agree with Andrew.   Mahout should remain indigenous.  


Prakash - you may want to create your own project on github using the mahout 
library.   


> On Apr 28, 2016, at 5:43 PM, Andrew Palumbo <ap@outlook.com> wrote:
> 
> I don't  think that this sort of of integration work would be a good fit 
> directly to the Mahout project.  Mahout is more about math, algorithms and an 
> environment to develop algorithms.  We stay away from direct platform 
> integration.  In the past we did have some elasticsearch/mahout integration 
> work that is not in the code base for this exact reason.  I would suggest 
> that better places to contribute something like this may be: PIO 
> (https://prediction.io/), or even directly as a package for spark 
> http://spark-packages.org/ .
> 
> Recent projects integrating Mahout have recently been added to PIO: 
> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation.
>   
> 
> I think that the project that you are proposing would be a better fit there.
> 
> Thanks,
> 
> Andy
> 
> 
> 
> From: Saikat Kanjilal <sxk1...@hotmail.com>
> Sent: Thursday, April 28, 2016 1:50 PM
> To: dev@mahout.apache.org
> Subject: Re: Mahout contributions
> 
> I want to start with social data as an example, for example data returned 
> from FB graph API as well user Twitter data, will send some samples later if 
> you're interested.
> 
> Sent from my iPhone
> 
>> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim <khurrum.na...@useitc.com> wrote:
>> 
>> 
>> What type of JSON payload size are we talking about here ?
>> 
>>> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>>> 
>>> Because EL gives you the visualization and non Lucene type query constructs 
>>> as well and also that it already has a rest API that I plan on tying into 
>>> mahout.  I plan on wrapping some of the clustering algorithms that I 
>>> implement using Mahout and Spark as a service which can then make calls 
>>> into other services (namely elasticsearch and neo4j graph service).
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim <khurrum.na...@useitc.com> 
>>>> wrote:
>>>> 
>>>> @Saikat- why use EL instead of Lucene directly.
>>>> 
>>>> 
>>>> 
>>>>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>>>>> 
>>>>> This is great information thank you, based on this recommendation I won't 
>>>>> create a JIRA but start work on my project and when the code approaches 
>>>>> the percentages you are describing I will create the appropriate JIRA's 
>>>>> and put together a proposal to send to the list, sound ok?  Based on your 
>>>>> latest updates to the wiki i will work on a handful of the clustering 
>>>>> algorithms since I see that the Spark implementations for these are not 
>>>>> yet complete.
>>>>> Thank you again
>>>>> 
>>>>>> From: ap@outlook.com
>>>>>> To: dev@mahout.apache.org
>>>>>> Subject: Re: Mahout contributions
>>>>>> Date: Thu, 28 Apr 2016 01:31:09 +
>>>>>> 
>>>>>> Saikat,
>>>>>> 
>>>>>> One other thing that I should say is that you do not need clearance or 
>>>>>> input from the committers to begin work on your project, and the 
>>>>>> interest can and should come from the community as a whole. You can 
>>>>>> write proposal as you've done, and if you don't see any "+1"s or 
>>>>>> responses from the community at whole with in a few days, you may want 
>>>>>> to explain in more detail, give examples and use cases.  If you are 
>>>>>> still not seeing +1s or any responses from others then I think you can 
>>>>>> assume that there may not be interest; this is usually how things work.
>>>>>> 
>>>>>> However if its something that your passionate about and you feel like 
>>>>>> you can deliver this should not to stop you.  People do not always read 
>>>>>> the dev@ emails or have time to respond.  You can still move forward 
>>>>>> with your proposed contribution by following the steps laid out in my 
>>>>>> previous email; follow the protocol at:
>>>>>> 
>>>>>

Re: Mahout contributions

2016-04-28 Thread Khurrum Nasim
@Saikat- why use EL instead of Lucene directly. 



> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal  wrote:
> 
> This is great information thank you, based on this recommendation I won't 
> create a JIRA but start work on my project and when the code approaches the 
> percentages you are describing I will create the appropriate JIRA's and put 
> together a proposal to send to the list, sound ok?  Based on your latest 
> updates to the wiki i will work on a handful of the clustering algorithms 
> since I see that the Spark implementations for these are not yet complete.
> Thank you again
> 
>> From: ap@outlook.com
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> Date: Thu, 28 Apr 2016 01:31:09 +
>> 
>> Saikat, 
>> 
>> One other thing that I should say is that you do not need clearance or input 
>> from the committers to begin work on your project, and the interest can and 
>> should come from the community as a whole. You can write proposal as you've 
>> done, and if you don't see any "+1"s or responses from the community at 
>> whole with in a few days, you may want to explain in more detail, give 
>> examples and use cases.  If you are still not seeing +1s or any responses 
>> from others then I think you can assume that there may not be interest; this 
>> is usually how things work.  
>> 
>> However if its something that your passionate about and you feel like you 
>> can deliver this should not to stop you.  People do not always read the dev@ 
>> emails or have time to respond.  You can still move forward with your 
>> proposed contribution by following the steps laid out in my previous email; 
>> follow the protocol at:
>> 
>> http://mahout.apache.org/developers/how-to-contribute.html
>> 
>> and create a JIRA.  When you have reached a significant amount of completion 
>> (around 70-80%), open a PR for review, this way you can explain in more 
>> detail. 
>> 
>> But please realize that when you open a JIRA for a new issue there is some 
>> expectation of a commitment on your part to complete it. 
>> 
>> For example, I am currently investigating some new plotting features.  I 
>> have spent a good deal of time this week and last already and am even 
>> mocking up code as a sketch of what may become an implementation before I 
>> open a "New Feature" JIRA for it.
>> 
>> My point is absolutely not to discourage you or anybody else from opening 
>> JIRAs for new features, rather to let you know that when you open an JIRA 
>> for a new issue, It tells others that your are working on it, and thus may 
>> discourage another with a similar idea to contribute this feature.  So it is 
>> best to open it once you've begun your work and are committed to it.
>> 
>> Andy
>> 
>> 
>> From: Saikat Kanjilal 
>> Sent: Wednesday, April 27, 2016 8:24 PM
>> To: dev@mahout.apache.org
>> Subject: RE: Mahout contributions
>> 
>> Andrew,Thank you very much for your input, I actually want to start a new 
>> set of JIRAs, here's what I want to work on, I want to build a framework 
>> that ties together search/visualization capability with some machine 
>> learning algorithms, so essentially think of it as tying in elasticsearch 
>> and kibana  into mahout , the user can search for their data with 
>> elasticsearch and for deeper analysis on that data they can feed that data 
>> into one or more mahout backends for analysis.  Another interesting tie in 
>> might be to hack kibana to render ggplot like graphics based on the output 
>> of mahout algorithms (assuming this can be a kibana plugin).
>> Before I go hog wild to create a bunch of JIRA's I'd like to know if there's 
>> interest in this initiative.  The tool will bring together the ELK stack 
>> with dynamic machine learning algorithms.  I can go into a lot more detail 
>> around use cases if there's enough interest.
>> Looking forward to your and other committers input.Thanks
>> 
>>> From: ap@outlook.com
>>> To: dev@mahout.apache.org
>>> Subject: Re: Mahout contributions
>>> Date: Wed, 27 Apr 2016 20:16:38 +
>>> 
>>> Hello Saikat,
>>> 
>>> #1 and #2 above are already implemented.  #4 is tricky so i would not 
>>> recommend without a strong knowledge of the codebase, and #5 is now 
>>> deprecated.  (I've just updated the algorithms grid to reflect this).  The 
>>> algorithms page includes both algorithms implemented in the math-scala 
>>> library and algorithms which have CLI drivers written for them.
>>> 
>>> Please see: http://mahout.apache.org/developers/how-to-contribute.html
>>> 
>>> And please note that per that documentation, it is in everybody's best 
>>> interest to keep messages on list, contacting committers directly is 
>>> discouraged.
>>> 
>>> The best way to contribute (if you have not found a new bug or issue) would 
>>> be for you to pick a single open issue in the mahout JIRA which is not 
>>> already assigned, and start work on it.  

Re: Mahout contributions

2016-04-28 Thread Khurrum Nasim

What type of JSON payload size are we talking about here ?

> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> 
> Because EL gives you the visualization and non Lucene type query constructs 
> as well and also that it already has a rest API that I plan on tying into 
> mahout.  I plan on wrapping some of the clustering algorithms that I 
> implement using Mahout and Spark as a service which can then make calls into 
> other services (namely elasticsearch and neo4j graph service).
> 
> Sent from my iPhone
> 
>> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim <khurrum.na...@useitc.com> wrote:
>> 
>> @Saikat- why use EL instead of Lucene directly. 
>> 
>> 
>> 
>>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>>> 
>>> This is great information thank you, based on this recommendation I won't 
>>> create a JIRA but start work on my project and when the code approaches the 
>>> percentages you are describing I will create the appropriate JIRA's and put 
>>> together a proposal to send to the list, sound ok?  Based on your latest 
>>> updates to the wiki i will work on a handful of the clustering algorithms 
>>> since I see that the Spark implementations for these are not yet complete.
>>> Thank you again
>>> 
>>>> From: ap@outlook.com
>>>> To: dev@mahout.apache.org
>>>> Subject: Re: Mahout contributions
>>>> Date: Thu, 28 Apr 2016 01:31:09 +
>>>> 
>>>> Saikat, 
>>>> 
>>>> One other thing that I should say is that you do not need clearance or 
>>>> input from the committers to begin work on your project, and the interest 
>>>> can and should come from the community as a whole. You can write proposal 
>>>> as you've done, and if you don't see any "+1"s or responses from the 
>>>> community at whole with in a few days, you may want to explain in more 
>>>> detail, give examples and use cases.  If you are still not seeing +1s or 
>>>> any responses from others then I think you can assume that there may not 
>>>> be interest; this is usually how things work.  
>>>> 
>>>> However if its something that your passionate about and you feel like you 
>>>> can deliver this should not to stop you.  People do not always read the 
>>>> dev@ emails or have time to respond.  You can still move forward with your 
>>>> proposed contribution by following the steps laid out in my previous 
>>>> email; follow the protocol at:
>>>> 
>>>> http://mahout.apache.org/developers/how-to-contribute.html
>>>> 
>>>> and create a JIRA.  When you have reached a significant amount of 
>>>> completion (around 70-80%), open a PR for review, this way you can explain 
>>>> in more detail. 
>>>> 
>>>> But please realize that when you open a JIRA for a new issue there is some 
>>>> expectation of a commitment on your part to complete it. 
>>>> 
>>>> For example, I am currently investigating some new plotting features.  I 
>>>> have spent a good deal of time this week and last already and am even 
>>>> mocking up code as a sketch of what may become an implementation before I 
>>>> open a "New Feature" JIRA for it.
>>>> 
>>>> My point is absolutely not to discourage you or anybody else from opening 
>>>> JIRAs for new features, rather to let you know that when you open an JIRA 
>>>> for a new issue, It tells others that your are working on it, and thus may 
>>>> discourage another with a similar idea to contribute this feature.  So it 
>>>> is best to open it once you've begun your work and are committed to it.
>>>> 
>>>> Andy
>>>> 
>>>> 
>>>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>>>> Sent: Wednesday, April 27, 2016 8:24 PM
>>>> To: dev@mahout.apache.org
>>>> Subject: RE: Mahout contributions
>>>> 
>>>> Andrew,Thank you very much for your input, I actually want to start a new 
>>>> set of JIRAs, here's what I want to work on, I want to build a framework 
>>>> that ties together search/visualization capability with some machine 
>>>> learning algorithms, so essentially think of it as tying in elasticsearch 
>>>> and kibana  into mahout , the user can search for their data with 
>>>&

Re: [Hello] from NASa

2016-05-22 Thread Khurrum Nasim
Interesting.


> On May 21, 2016, at 10:30 AM, Steven NASa  wrote:
> 
> Hi Pat,
> 
> Thank you for your reply, I fully understand that core algorithms and data
> are 2 different part of the system, this is why we have 2 major idea: "Big
> data" and "Machine Learning".
> 
> My requirements of Recommenders are just like what Amazon does: Item-based,
> but the number of items and users is very big, so there comes to a very
> huge matrix. So I am still learning using Mahout to make the matrix
> computing on a distributed system. After I am familiar with Mahout, I think
> I can have some works on GPU acceleration for Matrix computing and some
> other mathematical optimization.
> About the data prep, I think we can define an abstraction of
> conventions in data
> prep, data ingestion, and serving components. Users can following some
> conventions to feed data to Mahout.
> 
> Steven NASa
> 2016/05/21
> 
> 2016-05-21 22:06 GMT+08:00 Pat Ferrel :
> 
>> Hi Stephen,
>> 
>> We have implemented SVD, ALS, and CCO for recommender, but these are only
>> core algorithms, not really recommenders as Mahout has done in the past.
>> The reason for this is that there are data prep, data ingestion, and
>> serving components that, in a modern system, must be supplied also. So far
>> Mahout has stayed aways from actually including servers, either for input
>> of output.
>> 
>> That said there is plenty of room for algorithm development in Mahout. I
>> worked on the CCO algorithm, which uses PredictionIO (proposed for the
>> Apache Incubator) to supply the serving components.
>> 
>> Someone with your experience in real-life use of recommenders is certainly
>> welcome.
>> 
>> What type of project did you have in mind?
>> 
>> 
>> On May 20, 2016, at 10:00 AM, Suneel Marthi  wrote:
>> 
>> Welcome to the project Steven!!
>> 
>> On Fri, May 20, 2016 at 10:07 AM, Steven NASa  wrote:
>> 
>>> Hi Folk & Masters,
>>> 
>>> My name is *NASa*. I am now working for an e-commerce B2C company in
>> China,
>>> dealing with Transaction Process development in C++ & Java on Linux
>>> environment.
>>> 
>>> As you know, *Recommender System* is quite valuable and important to an
>>> e-commerce online shopping website like Amazon. I was told and required
>> to
>>> design and implement a Recommender System which can bring some value to
>> my
>>> Company. Our System is based on C++ codes. So I was searching for an
>> robust
>>> Machine Learning framework in C++ which can help me to easily implement a
>>> Recommender System. I did not find any one which can satisfy my
>>> requirements, but only some C++ math libraries.
>>> 
>>> Our system is based on an internal distributed frameworks like RPC and DB
>>> access on Linux environment based on C++ programming language. But I find
>>> it is really inconvenient to implement a Recommender System in C++ from
>>> zero without distributed computing library supporting, like
>>> implementing *Collaborative
>>> Filtering* with SVD in a distributed computing way. So I am trying to
>> find
>>> a framework/library with is designed based on Distributed-System. There I
>>> come to *Mahout*.
>>> 
>>> I wish I can build a library that can help people easily and quickly
>> build
>>> up a Recommender System based on Distributed System and also use the
>>> Machine Learning Algorithms in distributed way. Apache has many amazing
>>> projects which can help people to build up robust distributed system
>>> easily. So I am moving to using “Java” environment.
>>> 
>>> I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
>>> Ng’s “Machine Learning” from Coursera
>>> . So I
>> have
>>> the basic knowledge of Machine Learning, and now I am keeping forward to
>>> *Deep
>>> Learning* and *Convex Optimization*, some other Mathematical Optimization
>>> implementation. I am now still learning and getting famiIiar with
>> Mahout. I
>>> hope I can contribute some codes to Mahout in the early future with
>>> learning by coding and coding by learning.
>>> NASa 2016/05/20
>>> ​
>>> 
>> 
>> 



Re: [Hello] from NASa

2016-05-20 Thread Khurrum Nasim
Sounds more like demand prediction to me.   

However your system should be able to interact with other non-C/C++ systems.  
There is something called Apache Thrift.   

Which brings me to the following - would it be a valuable feature to Mahout 
library to provide
connectivity with other systems using Thrift.   


Thoughts ?

Khurrum

p.s. Andrew Ng can put you to sleep easily. 


> On May 20, 2016, at 10:07 AM, Steven NASa  wrote:
> 
> Hi Folk & Masters,
> 
> My name is *NASa*. I am now working for an e-commerce B2C company in China,
> dealing with Transaction Process development in C++ & Java on Linux
> environment.
> 
> As you know, *Recommender System* is quite valuable and important to an
> e-commerce online shopping website like Amazon. I was told and required to
> design and implement a Recommender System which can bring some value to my
> Company. Our System is based on C++ codes. So I was searching for an robust
> Machine Learning framework in C++ which can help me to easily implement a
> Recommender System. I did not find any one which can satisfy my
> requirements, but only some C++ math libraries.
> 
> Our system is based on an internal distributed frameworks like RPC and DB
> access on Linux environment based on C++ programming language. But I find
> it is really inconvenient to implement a Recommender System in C++ from
> zero without distributed computing library supporting, like
> implementing *Collaborative
> Filtering* with SVD in a distributed computing way. So I am trying to find
> a framework/library with is designed based on Distributed-System. There I
> come to *Mahout*.
> 
> I wish I can build a library that can help people easily and quickly build
> up a Recommender System based on Distributed System and also use the
> Machine Learning Algorithms in distributed way. Apache has many amazing
> projects which can help people to build up robust distributed system
> easily. So I am moving to using “Java” environment.
> 
> I am new to *Mahout* and *Hadoop*, *Spark*, *Scala* and I learned Andrew
> Ng’s “Machine Learning” from Coursera
> . So I have
> the basic knowledge of Machine Learning, and now I am keeping forward to *Deep
> Learning* and *Convex Optimization*, some other Mathematical Optimization
> implementation. I am now still learning and getting famiIiar with Mahout. I
> hope I can contribute some codes to Mahout in the early future with
> learning by coding and coding by learning.
> NASa 2016/05/20
> ​



Re: LLR quick clarification

2016-05-12 Thread Khurrum Nasim
hey all ,


#1.  where is all the matrix operations code lying in mahout  or which packages 
i should say ?  


#2.  i have a fairly large JSON string - My question is how can I apply mahout 
library to it to analyze this string by providing/creating a training model for 
this string and hopefully reuse the training model on subsequently similar 
strings ?  You’re informed response is appreciated.  



Many Thanks,

Khurrum


> On May 12, 2016, at 2:19 PM, Ted Dunning  wrote:
> 
> It just means that there is an association. Causation is much more
> difficult to ascertain.
> 
> 
> 
> On Wed, May 4, 2016 at 6:06 AM, Nikaash Puri  wrote:
> 
>> Hi,
>> 
>> Just wanted to clarify a small doubt. On running LLR with primary
>> indicator as view and secondary indicator as purchase. Say, one line of the
>> cross-cooccurrence matrix looks as follows:
>> 
>> view-purchase cross-cooccurrence matrix:
>> 
>> I1 I2:0.9, I3:0.8, ……..
>> …
>> 
>> This, in very simple terms then means that purchasing I2 should lead to
>> the recommendation of viewing I1, is that correct? Of course, ignoring the
>> other indicators for now.
>> 
>> Thank you,
>> Nikaash Puri



stochastic nature

2016-05-02 Thread Khurrum Nasim
Hey All,

I’d like to know if Mahout uses any randomized algorithms.   I’m thinking it 
probably does.  Can somebody point me to the packages that utilized randomized 
algos.   

Thanks,

Khurrum



Re: Mahout contributions

2016-05-02 Thread Khurrum Nasim
@Saikat - One thing I shall say is that REST is slow.  There is latency because 
of deserialization overhead.  For very large datasets probably not very good to 
use REST.  


> On Apr 30, 2016, at 2:35 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> 
> Andrew et al,I wanted to ask about a few items while I'm researching my dev 
> proposal, so what I'm looking to build is a streaming analytics platform to 
> do things like collaborative filtering and anomaly detection on large amounts 
> of streaming data that are either generated from events (kafka) or through a 
> firehose like Amazon Kinesis, my initial thinking is that this pipe of 
> events/data would be connected to a rest API that sits on top of mahout, the 
> backend underneath mahout would use a hybrid form of spark as well as spark 
> streaming, I'm wondering whether Samsara was designed from the ground up to 
> deal with large amounts of streaming data or whether this is not a use case 
> targeted yet.  My goal is to build a platform with several data sources/sinks 
> and produce intermediate checkpoints where transformations are applied to the 
> data before once again sending to a set of sinks/sources.  Therefore the 
> potential fits into and out of mahout include:
> 1) A rest API that leverages spray and akka and invokes one or more 
> algorithms in mahout2) A runtime environment with scala actors that allows 
> one to either ingest data or perform transformations on data through the use 
> of various classification and clustering algorithms, the runtime environment 
> would ingest algorithms using mahout as a library3) A rich set of actors 
> dealing with various no sql and graph based datastores 
> (cassandra/neo4j/titan/mongo)
> 
> Some insight into Samsara would be great as I'm trying to understand the 
> entry points into mahout.
> Thanks in advance.
> 
>> From: ap@outlook.com
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> Date: Thu, 28 Apr 2016 21:43:19 +
>> 
>> I don't  think that this sort of of integration work would be a good fit 
>> directly to the Mahout project.  Mahout is more about math, algorithms and 
>> an environment to develop algorithms.  We stay away from direct platform 
>> integration.  In the past we did have some elasticsearch/mahout integration 
>> work that is not in the code base for this exact reason.  I would suggest 
>> that better places to contribute something like this may be: PIO 
>> (https://prediction.io/), or even directly as a package for spark 
>> http://spark-packages.org/ .
>> 
>> Recent projects integrating Mahout have recently been added to PIO: 
>> https://github.com/PredictionIO/template-scala-parallel-universal-recommendation.
>>   
>> 
>> I think that the project that you are proposing would be a better fit there.
>> 
>> Thanks,
>> 
>> Andy
>> 
>> 
>> 
>> From: Saikat Kanjilal <sxk1...@hotmail.com>
>> Sent: Thursday, April 28, 2016 1:50 PM
>> To: dev@mahout.apache.org
>> Subject: Re: Mahout contributions
>> 
>> I want to start with social data as an example, for example data returned 
>> from FB graph API as well user Twitter data, will send some samples later if 
>> you're interested.
>> 
>> Sent from my iPhone
>> 
>>> On Apr 28, 2016, at 10:41 AM, Khurrum Nasim <khurrum.na...@useitc.com> 
>>> wrote:
>>> 
>>> 
>>> What type of JSON payload size are we talking about here ?
>>> 
>>>> On Apr 28, 2016, at 1:32 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
>>>> 
>>>> Because EL gives you the visualization and non Lucene type query 
>>>> constructs as well and also that it already has a rest API that I plan on 
>>>> tying into mahout.  I plan on wrapping some of the clustering algorithms 
>>>> that I implement using Mahout and Spark as a service which can then make 
>>>> calls into other services (namely elasticsearch and neo4j graph service).
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Apr 28, 2016, at 10:22 AM, Khurrum Nasim <khurrum.na...@useitc.com> 
>>>>> wrote:
>>>>> 
>>>>> @Saikat- why use EL instead of Lucene directly.
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Apr 28, 2016, at 12:08 PM, Saikat Kanjilal <sxk1...@hotmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> This is great information thank you, based on this recommendation I 
>>>>>&g

Re: stochastic nature

2016-05-02 Thread Khurrum Nasim
Thanks for the insight Dimitri.   I will look further into spark to understand 
how it handles parallelization and distributed processing.


> On May 2, 2016, at 12:39 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> by probabilistic algorithms i mostly mean inference involving monte carlo
> type mechanisms (Gibbs sampling LDA which i think might still be part of
> our MR collection might be an example, as well as its faster counterpart,
> variational Bayes inference.
> 
> the parallelization strategies are are just standard spark mechanisms (in
> case of spark), mostly are using their standard hash samplers (which are in
> math speak are uniform multinomial samplers really).
> 
> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
> 
>> Hey Dimitri -
>> 
>> Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic
>> algos then how does it accomplish a degree of optimal parallelization ?
>> Wouldn’t you need randomization to spread out the processing of tasks.
>> 
>>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>> 
>>> yes mahout has stochastic svd and pca which are described at length in
>> the
>>> samsara book. The book examples in Andrew Palumbo's github also contain
>> an
>>> example of computing k-means|| sketch.
>>> 
>>> if you mean _probabilistic_ algorithms, although i have done some things
>>> outside the public domain, nothing has been contributed.
>>> 
>>> You are very welcome to try something if you don't have big constraints
>> on
>>> oss contribution.
>>> 
>>> -d
>>> 
>>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com>
>>> wrote:
>>> 
>>>> Hey All,
>>>> 
>>>> I’d like to know if Mahout uses any randomized algorithms.   I’m
>> thinking
>>>> it probably does.  Can somebody point me to the packages that utilized
>>>> randomized algos.
>>>> 
>>>> Thanks,
>>>> 
>>>> Khurrum
>>>> 
>>>> 
>> 
>> 



Re: stochastic nature

2016-05-02 Thread Khurrum Nasim
Hey Dimitri - 

Yes I meant probabilistic algorithms.  If mahout doesn’t use probabilistic 
algos then how does it accomplish a degree of optimal parallelization ? 
Wouldn’t you need randomization to spread out the processing of tasks.  

> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> yes mahout has stochastic svd and pca which are described at length in the
> samsara book. The book examples in Andrew Palumbo's github also contain an
> example of computing k-means|| sketch.
> 
> if you mean _probabilistic_ algorithms, although i have done some things
> outside the public domain, nothing has been contributed.
> 
> You are very welcome to try something if you don't have big constraints on
> oss contribution.
> 
> -d
> 
> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
> 
>> Hey All,
>> 
>> I’d like to know if Mahout uses any randomized algorithms.   I’m thinking
>> it probably does.  Can somebody point me to the packages that utilized
>> randomized algos.
>> 
>> Thanks,
>> 
>> Khurrum
>> 
>> 



Re: stochastic nature

2016-05-03 Thread Khurrum Nasim
Hi Dimitri,

Can you please provide code reference for this in mahout ? 

THanks,


> On May 2, 2016, at 8:59 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> also, mahout does have optimizer that simply decides on degree of
> parallelism of the _product_. I.e., if it computes C=A'B then it figures
> that final results should be split N ways. but it doesn't apply the
> partition function -- it just uses the usual hash partitioner to forward
> the keys, i don't think we ever override that.
> 
> On Mon, May 2, 2016 at 9:39 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
>> by probabilistic algorithms i mostly mean inference involving monte carlo
>> type mechanisms (Gibbs sampling LDA which i think might still be part of
>> our MR collection might be an example, as well as its faster counterpart,
>> variational Bayes inference.
>> 
>> the parallelization strategies are are just standard spark mechanisms (in
>> case of spark), mostly are using their standard hash samplers (which are in
>> math speak are uniform multinomial samplers really).
>> 
>> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <khurrum.na...@useitc.com>
>> wrote:
>> 
>>> Hey Dimitri -
>>> 
>>> Yes I meant probabilistic algorithms.  If mahout doesn’t use
>>> probabilistic algos then how does it accomplish a degree of optimal
>>> parallelization ? Wouldn’t you need randomization to spread out the
>>> processing of tasks.
>>> 
>>>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>>> 
>>>> yes mahout has stochastic svd and pca which are described at length in
>>> the
>>>> samsara book. The book examples in Andrew Palumbo's github also contain
>>> an
>>>> example of computing k-means|| sketch.
>>>> 
>>>> if you mean _probabilistic_ algorithms, although i have done some things
>>>> outside the public domain, nothing has been contributed.
>>>> 
>>>> You are very welcome to try something if you don't have big constraints
>>> on
>>>> oss contribution.
>>>> 
>>>> -d
>>>> 
>>>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hey All,
>>>>> 
>>>>> I’d like to know if Mahout uses any randomized algorithms.   I’m
>>> thinking
>>>>> it probably does.  Can somebody point me to the packages that utilized
>>>>> randomized algos.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Khurrum
>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: stochastic nature

2016-05-03 Thread Khurrum Nasim
Thank you Andrew and Dimitry for your informed responses.  





> On May 2, 2016, at 8:59 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> also, mahout does have optimizer that simply decides on degree of
> parallelism of the _product_. I.e., if it computes C=A'B then it figures
> that final results should be split N ways. but it doesn't apply the
> partition function -- it just uses the usual hash partitioner to forward
> the keys, i don't think we ever override that.
> 
> On Mon, May 2, 2016 at 9:39 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
>> by probabilistic algorithms i mostly mean inference involving monte carlo
>> type mechanisms (Gibbs sampling LDA which i think might still be part of
>> our MR collection might be an example, as well as its faster counterpart,
>> variational Bayes inference.
>> 
>> the parallelization strategies are are just standard spark mechanisms (in
>> case of spark), mostly are using their standard hash samplers (which are in
>> math speak are uniform multinomial samplers really).
>> 
>> On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <khurrum.na...@useitc.com>
>> wrote:
>> 
>>> Hey Dimitri -
>>> 
>>> Yes I meant probabilistic algorithms.  If mahout doesn’t use
>>> probabilistic algos then how does it accomplish a degree of optimal
>>> parallelization ? Wouldn’t you need randomization to spread out the
>>> processing of tasks.
>>> 
>>>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>>> 
>>>> yes mahout has stochastic svd and pca which are described at length in
>>> the
>>>> samsara book. The book examples in Andrew Palumbo's github also contain
>>> an
>>>> example of computing k-means|| sketch.
>>>> 
>>>> if you mean _probabilistic_ algorithms, although i have done some things
>>>> outside the public domain, nothing has been contributed.
>>>> 
>>>> You are very welcome to try something if you don't have big constraints
>>> on
>>>> oss contribution.
>>>> 
>>>> -d
>>>> 
>>>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <khurrum.na...@useitc.com
>>>> 
>>>> wrote:
>>>> 
>>>>> Hey All,
>>>>> 
>>>>> I’d like to know if Mahout uses any randomized algorithms.   I’m
>>> thinking
>>>>> it probably does.  Can somebody point me to the packages that utilized
>>>>> randomized algos.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Khurrum
>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: [jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-04-18 Thread Khurrum Nasim
Hi Guys,

Can Mahout be used for things like face detection ?Also which unit tests or 
integration tests do you recommend I should run just to get a better feel of 
the execution flow.  

I’m still slowly acclimating to the project.  But hopefully should come up to 
speed soon.   


Many Thanks,

Khurrum




> On Mar 30, 2016, at 3:10 PM, Suneel Marthi <smar...@apache.org> wrote:
> 
> Thanks Khurrum for stepping up.
> 
> You just need basic programming skills - Java/Scala to be able to
> contribute. We can help you with the algorithms and linear algebra stuff.
> 
> 
> Welcome aboard !!
> 
> 
> On Wed, Mar 30, 2016 at 3:05 PM, Khurrum Nasim <khurrum.na...@useitc.com>
> wrote:
> 
>> Thanks for the advice Dimitry.  I’m already signed up on ASF jira.My
>> handle is “nasimk”
>> 
>> Do I need to be a linear algebra expert and or math phd  to contribute ?
>> I have 10 plus years of computer programming experience.  my background is
>> comp sci.
>> 
>> Khurrum
>> 
>> 
>> 
>> 
>> 
>>> On Mar 30, 2016, at 2:57 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>>> 
>>> PS You may also want to sign up with ASF Jira so we can assign issues to
>>> yourself.
>>> 
>>> On Wed, Mar 30, 2016 at 11:52 AM, Dmitriy Lyubimov <dlie...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> On Wed, Mar 30, 2016 at 11:43 AM, Khurrum Nasim <
>> khurrum.na...@useitc.com>
>>>> wrote:
>>>> 
>>>>> Thanks Dimirtry.
>>>>> 
>>>>> I take a look at see where I can start pitching in.  Do I need
>>>>> contributor access ? how  would I create feature branch of my work ?
>>>>> 
>>>> 
>>>> Khurrum,
>>>> 
>>>> you only need github account. What you need is to create mahout's master
>>>> fork in your github space and keep it in sync, as possible, with master
>> as
>>>> you go (by doing regular pulls). That way you have the most chance of
>>>> having least conflicts possible.
>>>> 
>>>> At any point in time (I recommend at perhaps when you feel you are about
>>>> 50 to 70% done or just need a code advice), you can create a github pull
>>>> request to the apache/mahout master. Make sure to include MAHOUT-XXX
>> issue
>>>> in the head of the pull request, that way ASF will automatically
>> propagate
>>>> code comments to jira, and so all discussion can be done entirely on
>> github.
>>>> 
>>>> Again, if you take on a signficant contribution (such as a new numerical
>>>> method contribution), I recommend to discuss the proposal on the @dev
>> list
>>>> 
>>>> thanks.
>>>> 
>>>> 
>>>>> 
>>>>> Khurrum
>>>>> 
>>>>>> On Mar 30, 2016, at 1:12 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Oh but of course! please do!
>>>>>> 
>>>>>> You may work on any issue, this or any other of your choice, or even
>> on
>>>>> any
>>>>>> new issue you can think of (for sizeable contributions it is
>>>>> recommended to
>>>>>> start discussion on the @dev list first though, to make sure to
>> benefit
>>>>>> from experience of others. Please file any new issue first to jira).
>>>>>> 
>>>>>> On Wed, Mar 30, 2016 at 9:05 AM, shashi bushan dongur (JIRA) <
>>>>>> j...@apache.org> wrote:
>>>>>> 
>>>>>>> 
>>>>>>>  [
>>>>>>> 
>>>>> 
>> https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218216#comment-15218216
>>>>>>> ]
>>>>>>> 
>>>>>>> shashi bushan dongur commented on MAHOUT-1788:
>>>>>>> --
>>>>>>> 
>>>>>>> Hello. I would like to start contributing to mahout. Can I work on
>> this
>>>>>>> issue?
>>>>>>> 
>>>>>>>> spark-itemsimilarity integration test script cleanup
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  Key: MAHOUT-1788
>>>>>>>>  URL:
>> https://issues.apache.org/jira/browse/MAHOUT-1788
>>>>>>>>  Project: Mahout
>>>>>>>>   Issue Type: Improvement
>>>>>>>>   Components: cooccurrence
>>>>>>>> Affects Versions: 0.11.0
>>>>>>>> Reporter: Pat Ferrel
>>>>>>>> Assignee: Pat Ferrel
>>>>>>>> Priority: Trivial
>>>>>>>>  Fix For: 1.0.0
>>>>>>>> 
>>>>>>>> 
>>>>>>>> binary release does not contain data for itemsimilarity tests, neith
>>>>>>> binary nor source versions will run on a cluster unless data is hand
>>>>> copied
>>>>>>> to hdfs.
>>>>>>>> Clean this up so it copies data if needed and the data is in both
>>>>>>> versions.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> This message was sent by Atlassian JIRA
>>>>>>> (v6.3.4#6332)
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>> 



Re: [jira] [Commented] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-04-19 Thread Khurrum Nasim
okay thanks - i’ll run those tests. i actually ran a few others as well like 
the MatrixWritableTest.  

> On Apr 18, 2016, at 8:22 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
> I am not sure of your question about tests...
> 
> there are in-memory tests which you can by 'mvn test' in /math-scala
> module; distributed tests are done per engine under 'spark', 'h2o' or
> 'flink' modules.
> 
> 
> On Mon, Apr 18, 2016 at 5:19 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> 
>> i meant "not so much a library"
>> 
>> On Mon, Apr 18, 2016 at 5:18 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>> wrote:
>> 
>>> Khurrum,
>>> 
>>> mahout is so much  a library at this point.
>>> 
>>> if you mean if it can be used to build networks with 2d inputs, yes i did
>>> some of that. multi-epoch SGD based systems should be easy enough to build,
>>> and will probably have a reasonable performance -- although I think
>>> dedicated CNN systems like Caffe would still run faster at this point. Full
>>> batch trainers are somewhat slow for larger problems though, my
>>> investigation points that  there are architectural problems in spark that
>>> are hard to overcome at this point for high IO algorithms.
>>> 
>>> On Mon, Apr 18, 2016 at 11:49 AM, Khurrum Nasim <khurrum.na...@useitc.com
>>>> wrote:
>>> 
>>>> Hi Guys,
>>>> 
>>>> Can Mahout be used for things like face detection ?Also which unit
>>>> tests or integration tests do you recommend I should run just to get a
>>>> better feel of the execution flow.
>>>> 
>>>> I’m still slowly acclimating to the project.  But hopefully should come
>>>> up to speed soon.
>>>> 
>>>> 
>>>> Many Thanks,
>>>> 
>>>> Khurrum
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Mar 30, 2016, at 3:10 PM, Suneel Marthi <smar...@apache.org> wrote:
>>>>> 
>>>>> Thanks Khurrum for stepping up.
>>>>> 
>>>>> You just need basic programming skills - Java/Scala to be able to
>>>>> contribute. We can help you with the algorithms and linear algebra
>>>> stuff.
>>>>> 
>>>>> 
>>>>> Welcome aboard !!
>>>>> 
>>>>> 
>>>>> On Wed, Mar 30, 2016 at 3:05 PM, Khurrum Nasim <
>>>> khurrum.na...@useitc.com>
>>>>> wrote:
>>>>> 
>>>>>> Thanks for the advice Dimitry.  I’m already signed up on ASF jira.
>>>> My
>>>>>> handle is “nasimk”
>>>>>> 
>>>>>> Do I need to be a linear algebra expert and or math phd  to
>>>> contribute ?
>>>>>> I have 10 plus years of computer programming experience.  my
>>>> background is
>>>>>> comp sci.
>>>>>> 
>>>>>> Khurrum
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Mar 30, 2016, at 2:57 PM, Dmitriy Lyubimov <dlie...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> PS You may also want to sign up with ASF Jira so we can assign
>>>> issues to
>>>>>>> yourself.
>>>>>>> 
>>>>>>> On Wed, Mar 30, 2016 at 11:52 AM, Dmitriy Lyubimov <
>>>> dlie...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Mar 30, 2016 at 11:43 AM, Khurrum Nasim <
>>>>>> khurrum.na...@useitc.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks Dimirtry.
>>>>>>>>> 
>>>>>>>>> I take a look at see where I can start pitching in.  Do I need
>>>>>>>>> contributor access ? how  would I create feature branch of my work
>>>> ?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Khurrum,
>>>>>>>> 
>>>>>>>> you only need github account. What you need is to create mahout's
>>>> master
>>>>>>>> fork in your github space and keep it in sync, as possible, with
>>>> master
>>>&

Re: Congratulations to our new Chair

2016-04-20 Thread khurrum . nasim
Congrats.  

Sent from my iPhone

> On Apr 20, 2016, at 8:33 PM, Andrew Palumbo  wrote:
> 
> Thanks you guys!
> 
>  Original message 
> From: Andrew Musselman 
> Date: 04/20/2016 8:14 PM (GMT-05:00)
> To: dev@mahout.apache.org, u...@mahout.apache.org
> Subject: Re: Congratulations to our new Chair
> 
> Suneel, thanks your great work as Chair and thank you Andy for stepping in!
> 
>> On Wed, Apr 20, 2016 at 5:00 PM, Dmitriy Lyubimov  wrote:
>> 
>> congrats!
>> 
>>> On Wed, Apr 20, 2016 at 4:55 PM, Suneel Marthi  wrote:
>>> 
>>> Please join me in congratulating Andrew Palumbo on becoming our new
>> Project
>>> Chair.
>>> 
>>> As for me, it was a pleasure to serve as Chair starting with the Mahout
>>> 0.10.0 release and ending with the recent 0.12.0 release, and perhaps we
>>> will do it again someday


Re: Congratulations to our new Chair

2016-04-21 Thread Khurrum Nasim
andy is the popular guy !


> On Apr 21, 2016, at 11:21 AM, Pat Ferrel  wrote:
> 
> Congratulations Andy, well deserved. 
> 
> On Apr 21, 2016, at 6:01 AM, Shannon Quinn  wrote:
> 
> Thanks Suneel for your excellent leadership.
> 
> Congratulations Andrew!
> 
> On 4/21/16 3:38 AM, Alessandro Negro wrote:
>> Congratulation!
>> 
>> Il giorno 21/apr/2016, alle ore 02:36, khurrum.na...@useitc.com ha scritto:
>> 
>>> Congrats.
>>> 
>>> Sent from my iPhone
>>> 
 On Apr 20, 2016, at 8:33 PM, Andrew Palumbo  wrote:
 
 Thanks you guys!
 
  Original message 
 From: Andrew Musselman 
 Date: 04/20/2016 8:14 PM (GMT-05:00)
 To: dev@mahout.apache.org, u...@mahout.apache.org
 Subject: Re: Congratulations to our new Chair
 
 Suneel, thanks your great work as Chair and thank you Andy for stepping in!
 
> On Wed, Apr 20, 2016 at 5:00 PM, Dmitriy Lyubimov  
> wrote:
> 
> congrats!
> 
>> On Wed, Apr 20, 2016 at 4:55 PM, Suneel Marthi  
>> wrote:
>> 
>> Please join me in congratulating Andrew Palumbo on becoming our new
> Project
>> Chair.
>> 
>> As for me, it was a pleasure to serve as Chair starting with the Mahout
>> 0.10.0 release and ending with the recent 0.12.0 release, and perhaps we
>> will do it again someday
> 
> 



Re: FOSDEM 2017 Open Source Conference - Brussels

2017-01-31 Thread Khurrum Nasim
yes - stickers would be nice.

Thanks,

Khurrum.

On Jan 31, 2017, 6:28 AM -0500, Sharan F , wrote:
> Hi All
>
> Just for info - I've been talking to Andrew Palumbo about getting some
> Mahout stickers printed for the community to use and also generally to
> see if there was anyone from Mahout coming to FOSDEM that could pick up
> some ASF swag to use at any Mahout european based presentations.
>
> Please let me know if someone from Mahout will be there and and if so, I
> can plan to bring some extra stuff for you.
>
> Thanks
> Sharan
>
> On 31/01/17 12:16, Isabel Drost-Fromm wrote:
> > Hi,
> >
> > On Thu, Jan 12, 2017 at 01:12:10PM +0100, Sharan F wrote:
> > > Attending FOSDEM is completely free and the ASF will again be running a
> > > booth there. Our main focus will on talking to people about the ASF, our
> > > projects and communities.
> > Anyone from the Mahout community planning to be there?
> >
> >
> > Isabel
>


code review

2016-10-04 Thread Khurrum Nasim
Codacy is free for open source projects.  And does a decent job of reviewing 
your code. 

Might be worthwhile to have it review mahout forks and branches.

Khurrum

> On Sep 26, 2016, at 1:21 PM, Suneel Marthi  wrote:
> 
> @Tiramisu most sparse networks like DBNs are modeled as graphs and hence
> Dmitriy had mentioned a graph-based solution.
> 
> The question for you is "What/Which platform is of most interest to u - a
> graph-based solution or an algebraic solution? "
> 
> If its an algebraic solution you are looking for, Mahout provides the
> physical and logical operators for that and we are also in the process of
> rolling out native physical operators in the next release.  I believe
> (correct me here) that you are trying to use matrix multiplications for ur
> DBN solution, if so the suggestion would be to create a javacpp - MPI
> bridge.
> 
> Based on your interest and your requirements, we can take this conversation
> further.
> 
> Thanks for reaching out.
> 
> On Mon, Sep 26, 2016 at 11:53 AM, Tiramisu Ling 
> wrote:
> 
>> where is the graph based solution in here?
>> 
>> 2016-09-26 23:40 GMT+08:00 Dmitriy Lyubimov :
>> 
>>> Do you want to approach these rpoblems from mostly algebraic solution vs.
>>> e.g. graph based solution?
>>> 
>>> On Wed, Sep 21, 2016 at 10:08 PM, Tiramisu Ling 
>>> wrote:
>>> 
 Hi Dmitriy,
 
 Thank you for your reply! I'm a postgraduate student of computer
>> science
 and the research direction of mine is Deep learning. And the focus
>> point
>>> of
 my research is use DBN to do the link(between network node) prediction,
 which is the major reason makes want to get involved into mahout and do
 some contribution. Most of my program knowledge is about Python and
>>> Matlab
 and, honestly, I only have basic level of Java programing skill. But I
 believe I could learn more about how to use Java by reading the
>> codebase
>>> of
 mahout, trust me ;).
 
 Best Regards,
 MikeLing
 
 2016-09-22 6:12 GMT+08:00 Dmitriy Lyubimov :
 
> ps another way to approach it, which in fact seems to be most common
> motivator here, is to start with a pragmatic problem one already has
>> at
> hand. Abstract tinkering  rarely produces strategically useful
> contributions, it seems.
> 
> On Wed, Sep 21, 2016 at 3:09 PM, Dmitriy Lyubimov >> 
> wrote:
> 
>> if you can tell us about your background a little bit, perhaps we
>>> could
>> have ideas. frankly we have a pretty sprawling roadmap. At least a
>>> set
 of
>> ideas. It's frankly more than we can realistically do, we can use
>>> help,
> yes.
>> 
>> On Sat, Sep 17, 2016 at 8:52 AM, Tiramisu Ling <
>> saberge...@gmail.com
 
>> wrote:
>> 
>>> Hey everyone, I'm new to mahout and I would like to contribute to
>>> it.
 In
>>> general, I had read the how to contribute page in [1], and I had
>>> clone
> the
>>> repo from github. So what should I do next? Are there any issue
>> like
> 'good
>>> first bug' to work with? Thank you very much!:)
>>> 
>>> [1]http://mahout.apache.org/developers/how-to-contribute.html
>>> 
>>> Best Regards,
>>> MikeLing
>>> 
>> 
>> 
> 
 
>>> 
>> 



Re: Trying to write the KMeans Clustering Using "Apache Mahout Samsara"

2017-04-25 Thread Khurrum Nasim
Can mahout be used for self driving tech ?

Thanks,

Khurrum.

On Apr 24, 2017, 10:34 PM -0400, KHATWANI PARTH BHARAT 
, wrote:
> @Trevor and @Dmitriy
>
> Tough Bug in Aggregating Transpose is fixed. One issue is still left which
> is causing hindrance in completing the KMeans Code
> That issue is of Assigning the the Row Keys of The DRM with the "Closest
> Cluster Index" found
> Consider the Matrix of Data points given as follows
>
> {
> 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
> 1 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
> 2 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
> 3 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
> }
> Now these are
> 0 =
> 1 =
> 2 =
> 3 =
> the Row keys. Here Zeroth column(0) contains the values which will be used
> the store the count of Points assigned to each cluster and Column 1 to 3
> contains co-ordinates of the data points.
>
> So now after cluster assignment step of Kmeans algorithm which @Dmitriy has
> Outlined in the beginning of this mail chain,
>
> the above Matrix should look like this(Assuming that the 0th and 1st data
> points are assigned to the cluster with index 0 and 2nd and 3rd data points
> are assigned to cluster with index 1)
>
> {
> 0 => {0:1.0, 1: 1.0, 2: 1.0, 3: 3.0}
> 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
> 1 => {0:1.0, 1: 3.0, 2: 4.0, 3: 5.0}
> 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
> }
>
> to achieve above mentioned result i using following code lines of code
>
> //11. Iterating over the Data Matrix(in DrmLike[Int] format)
> dataDrmX.mapBlock() {
> case (keys, block) =
> for (row <- 0 until block.nrow) {
> var dataPoint = block(row, ::)
>
> //12. findTheClosestCentriod find the closest centriod to the Data
> point specified by "dataPoint"
> val closesetIndex = findTheClosestCentriod(dataPoint, centriods)
>
> //13. assigning closest index to key
> keys(row) = closesetIndex
> }
> keys -> block
> }
>
> But it turns out to be
>
> {
> 0 => {0:1.0, 1: 2.0, 2: 3.0, 3: 4.0}
> 1 => {0:1.0, 1: 4.0, 2: 5.0, 3: 6.0}
> }
>
>
> So is there any thing wrong with the syntax of the above code.I am unable
> to find any reference to the way in which i should assign a value to the
> row keys.
>
> @Trevor as per what you have mentioned in the above mail chain
> "Got it- in short no.
>
> Think of the keys like a dictionary or HashMap.
>
> That's why everything is ending up on row 1."
>
> But according to Algorithm outlined by@Dmitriy at start of the mail chain
> we assign same key To Multiple Rows is possible.
> Same is also mentioned in the Book Written by Dmitriy and Andrew.
> It is mentioned that the rows having the same row keys summed up when we
> take aggregating transpose.
>
> I now confused that weather it possible to achieve what i have mentioned
> above or it is not possible to achieve or it is the Bug in the API.
>
>
>
> Thanks & Regards
> Parth
> <#m_33347126371020841_m_5688102708516554904_


Re: New Website is Staged

2017-05-09 Thread Khurrum Nasim
I do like the idea of blending in  a video (might be extra work). The site 
needs some spizzaz.

Thanks,

Khurrum.

On May 8, 2017, 7:23 PM -0400, Trevor Grant <trevor.d.gr...@gmail.com>, wrote:
> Khurrum,
>
> Thanks for the feed back, anything more specific?
>
>
>
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things." -Virgil*
>
>
> On Mon, May 8, 2017 at 4:57 PM, Andrew Palumbo <ap@outlook.com> wrote:
>
> > I disagree with it being too bland- I find the open space and the
> > formatting much easier to navigate and read docs from.
> >
> >
> > 
> > From: Khurrum Nasim <khurrum.na...@useitc.com
> > Sent: Monday, May 8, 2017 2:36:54 PM
> > To: Mahout Dev List; u...@mahout.apache.org; dev@mahout.apache.org
> > Subject: Re: New Website is Staged
> >
> > Too bland looking
> >
> > Thanks,
> >
> > Khurrum.
> >
> > On May 8, 2017, 1:53 PM -0400, Trevor Grant <trevor.d.gr...@gmail.com>,
> > wrote:
> > > Hey all,
> > >
> > > The new website is staged. You can view it here
> > >
> > > http://mahout.staging.apache.org/
> > >
> > > Won't be publishing for a bit yet- there are still a few JIRAs left to do
> > > before its ready, but you can check it out there anyway.
> > >
> > > A couple of admin things:
> > > 1- New developer and community pages are linked from the landing site and
> > > new navbar, the landing page isn't done yet btw (one of the last todos)
> > >
> > > 2- All linkbacks from the old site should continue to work, pages were
> > > maintained however, they have had new skin applied to them.
> > >
> > > 3- The current website is also available in
> > > http://mahout.staging.apache.org/docs/0.13.0/
> > > and will be persevered for posterity.
> > >
> > > 4- new style docs, which I recommend everyone check out are available in
> > > http://mahout.staging.apache.org/docs/0.13.1-SNAPSHOT/
> > >
> > >
> > > We have 6 high level talks coming up in the next 2 weeks and would like
> > to
> > > have the shiny new website fielded if possible, working on hard on
> > getting
> > > it ready.
> > >
> > > If you have any updates recommendations, etc, feel free to open a PR (all
> > > website code is contained in master now).
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things." -Virgil*
> >


Re: Looking for help with a talk

2017-05-28 Thread Khurrum Nasim
Where is the conference.  

Sent from my iPhone

> On May 28, 2017, at 2:33 PM, Andrew Palumbo  wrote:
> 
> I won't be attending but would be happy to help any way I can, given the 
> timeline, and my schedule ..  (I have some time restraints over the next 6 - 
> 8 weeks, so probably can't be of much use)
> 
> 
> 
> Sent from my Verizon Wireless 4G LTE smartphone
> 
> 
>  Original message 
> From: Isabel Drost-Fromm 
> Date: 05/27/2017 2:02 PM (GMT-08:00)
> To: dev@mahout.apache.org
> Subject: Looking for help with a talk
> 
> Hi,
> 
> I've been invited to give a machine learning centric keynote at FrOSCon (free 
> and open source conference, the little sister of FOSDEM, roughly 2500 
> attendees of all skill levels) in August this year. Content should be less 
> technical but focus on big picture, implications and some such.
> 
> Would be great to get some help from you.
> 
> Anyone here who has time and interest to help out? Anyone planning to attend 
> the event already?
> 
> Isabel
> 
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.


Re: New Website is Staged

2017-05-08 Thread Khurrum Nasim
Too bland looking

Thanks,

Khurrum.

On May 8, 2017, 1:53 PM -0400, Trevor Grant , wrote:
> Hey all,
>
> The new website is staged. You can view it here
>
> http://mahout.staging.apache.org/
>
> Won't be publishing for a bit yet- there are still a few JIRAs left to do
> before its ready, but you can check it out there anyway.
>
> A couple of admin things:
> 1- New developer and community pages are linked from the landing site and
> new navbar, the landing page isn't done yet btw (one of the last todos)
>
> 2- All linkbacks from the old site should continue to work, pages were
> maintained however, they have had new skin applied to them.
>
> 3- The current website is also available in
> http://mahout.staging.apache.org/docs/0.13.0/
> and will be persevered for posterity.
>
> 4- new style docs, which I recommend everyone check out are available in
> http://mahout.staging.apache.org/docs/0.13.1-SNAPSHOT/
>
>
> We have 6 high level talks coming up in the next 2 weeks and would like to
> have the shiny new website fielded if possible, working on hard on getting
> it ready.
>
> If you have any updates recommendations, etc, feel free to open a PR (all
> website code is contained in master now).
>
>
> Trevor Grant
> Data Scientist
> https://github.com/rawkintrevo
> http://stackexchange.com/users/3002022/rawkintrevo
> http://trevorgrant.org
>
> *"Fortunate is he, who is able to know the causes of things." -Virgil*