Re: Research ideas using spark

2015-07-16 Thread Michael Segel
Ok… 

After having some off-line exchanges with Shashidhar Rao came up with an idea…

Apply machine learning to either implement or improve autoscaling up or down 
within a Storm/Akka cluster. 

While I don’t know what constitutes an acceptable PhD thesis, or senior project 
for undergrads… this is a real life problem that actually has some real value. 

First, storm doesn’t scale down.  Unless there’s been some improvements in the 
last year, you really can’t easily scale down the number of workers and 
transfer state to another worker. 
Looking at Akka, that would be an easier task because of the actor model. 
However, I don’t know Akka that well, so I can’t say if this is already 
implemented. 

So besides the mechanism to scale (up and down), you then have the issue of 
machine learning in terms of load and how to properly scale. 
This could be as simple as a PID function that watches the queues between 
spout/bolts and bolt/bolt, or something more advanced. This is where the 
research part of the project comes in. (What do you monitor, and how do you 
calculate and determine when to scale up or down, weighing in the cost(s) of 
the action of scaling.) 

Again its a worthwhile project, something that actually has business value, 
especially in terms of Lambda and other groovy greek lettered names for cluster 
designs (Zeta? ;-) ) 
Where you have both M/R (computational) and subjective real time (including 
micro batch) occurring either on the same cluster or within the same DC 
infrastructure. 


Again I don’t know if this is worthy of a PhD thesis, Masters Thesis, or Senior 
Project, but it is something that one could sink one’s teeth into and 
potentially lead to a commercial grade project if done properly. 

Good luck with it.

HTH 

-Mike




 On Jul 15, 2015, at 12:40 PM, vaquar khan vaquar.k...@gmail.com wrote:
 
 I would suggest study spark ,flink,strom and based on your understanding and 
 finding prepare your research paper.
 
 May be you will invented new spark ☺
 
 Regards, 
 Vaquar khan
 
 On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com 
 mailto:msegel_had...@hotmail.com wrote:
 Silly question… 
 
 When thinking about a PhD thesis… do you want to tie it to a specific 
 technology or do you want to investigate an idea but then use a specific 
 technology. 
 Or is this an outdated way of thinking? 
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.”
 
 So before we look at technologies like Spark… could the OP break down a more 
 specific concept or idea that he wants to pursue? 
 
 Looking at what Jorn said… 
 
 Using machine learning to better predict workloads in terms of managing 
 clusters… This could be interesting… but is it enough for a PhD thesis, or of 
 interest to the OP? 
 
 
 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com 
 mailto:jornfra...@gmail.com wrote:
 
 Well one of the strength of spark is standardized general distributed 
 processing allowing many different types of processing, such as graph 
 processing, stream processing etc. The limitation is that it is less 
 performant than one system focusing only on one type of processing (eg graph 
 processing). I miss - and this may not be spark specific - some artificial 
 intelligence to manage a cluster, e.g. Predicting workloads, how long a job 
 may run based on previously executed similar jobs etc. Furthermore, many 
 optimizations you have do to manually, e.g. Bloom filters, partitioning etc 
 - if you find here as well some intelligence that does this automatically 
 based on previously executed jobs taking into account that optimizations 
 themselves change over time would be great... You may also explore feature 
 interaction
 
 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com a écrit :
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new 
 features especially the above learning methods and algorithms that have not 
 been implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 
 
 



Re: Research ideas using spark

2015-07-15 Thread Vineel Yalamarthy
Hi Daniel

Well said

Regards
Vineel

On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
daniel.dara...@lynxanalytics.com wrote:

 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow
 than for a PhD thesis.

 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Research ideas using spark

2015-07-15 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of
cpu cores). What operation are you doing? It may happen that if you are
doing a join/groupBy sort of operation that task which is taking time is
having all the values, in that case you need to use a Partitioner which
will evenly distribute the keys across machines to speed up things.

Thanks
Best Regards

On Tue, Jul 14, 2015 at 11:12 AM, shahid ashraf sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: Research ideas using spark

2015-07-15 Thread shahid ashraf
Sorry Guys!

I mistakenly added my question to this thread( Research ideas using spark).
Moreover people can ask any question , this spark user group is for that.

Cheers!


On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk wrote:

 Well said Will. I would add that you might want to investigate GraphChi
 which claims to be able to run a number of large-scale graph processing
 tasks on a workstation much quicker than a very large Hadoop cluster. It
 would be interesting to know how widely applicable the approach GraphChi
 takes and what implications it has for parallel/distributed computing
 approaches. A rich seam to mine indeed.

 Robin

 On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com
 wrote:

 There seems to be a bit of confusion here - the OP (doing the PhD) had the
 thread hijacked by someone with a similar name asking a mundane question.

 It would be a shame to send someone away so rudely, who may do valuable
 work on Spark.

 Sashidar (not Sashid!) I'm personally interested in running graph
 algorithms for image segmentation using MLib and Spark.  I've got many
 questions though - like is it even going to give me a speed-up?  (
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

 It's not obvious to me which classes of graph algorithms can be
 implemented correctly and efficiently in a highly parallel manner.  There's
 tons of work to be done here, I'm sure. Also, look at parallel geospatial
 algorithms - there's a lot of work being done on this.

 Best, Will



 On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com
 wrote:

 Hi Daniel

 Well said

 Regards
 Vineel

 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com wrote:

 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow
 than for a PhD thesis.

 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com
 wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark
 and to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have 
 not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate
 my work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







-- 
with Regards
Shahid Ashraf


Re: Research ideas using spark

2015-07-15 Thread Ravindra
Look at this :
http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/

On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf sha...@trialx.com wrote:

 Sorry Guys!

 I mistakenly added my question to this thread( Research ideas using
 spark). Moreover people can ask any question , this spark user group is for
 that.

 Cheers!
 

 On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk
 wrote:

 Well said Will. I would add that you might want to investigate GraphChi
 which claims to be able to run a number of large-scale graph processing
 tasks on a workstation much quicker than a very large Hadoop cluster. It
 would be interesting to know how widely applicable the approach GraphChi
 takes and what implications it has for parallel/distributed computing
 approaches. A rich seam to mine indeed.

 Robin

 On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com
 wrote:

 There seems to be a bit of confusion here - the OP (doing the PhD) had
 the thread hijacked by someone with a similar name asking a mundane
 question.

 It would be a shame to send someone away so rudely, who may do valuable
 work on Spark.

 Sashidar (not Sashid!) I'm personally interested in running graph
 algorithms for image segmentation using MLib and Spark.  I've got many
 questions though - like is it even going to give me a speed-up?  (
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

 It's not obvious to me which classes of graph algorithms can be
 implemented correctly and efficiently in a highly parallel manner.  There's
 tons of work to be done here, I'm sure. Also, look at parallel geospatial
 algorithms - there's a lot of work being done on this.

 Best, Will



 On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com
 wrote:

 Hi Daniel

 Well said

 Regards
 Vineel

 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com wrote:

 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow
 than for a PhD thesis.

 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com
 wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all 
 nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get 
 completed
 under 7-8 minutes and one task takes around 30 minutes so causing the 
 delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark
 and to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have 
 not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate
 my work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







 --
 with Regards
 Shahid Ashraf



Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… 

When thinking about a PhD thesis… do you want to tie it to a specific 
technology or do you want to investigate an idea but then use a specific 
technology. 
Or is this an outdated way of thinking? 

I am doing my PHD thesis on large scale machine learning e.g  Online learning, 
batch and mini batch learning.”

So before we look at technologies like Spark… could the OP break down a more 
specific concept or idea that he wants to pursue? 

Looking at what Jorn said… 

Using machine learning to better predict workloads in terms of managing 
clusters… This could be interesting… but is it enough for a PhD thesis, or of 
interest to the OP? 


 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote:
 
 Well one of the strength of spark is standardized general distributed 
 processing allowing many different types of processing, such as graph 
 processing, stream processing etc. The limitation is that it is less 
 performant than one system focusing only on one type of processing (eg graph 
 processing). I miss - and this may not be spark specific - some artificial 
 intelligence to manage a cluster, e.g. Predicting workloads, how long a job 
 may run based on previously executed similar jobs etc. Furthermore, many 
 optimizations you have do to manually, e.g. Bloom filters, partitioning etc - 
 if you find here as well some intelligence that does this automatically based 
 on previously executed jobs taking into account that optimizations themselves 
 change over time would be great... You may also explore feature interaction
 
 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com a écrit :
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new features 
 especially the above learning methods and algorithms that have not been 
 implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 




Re: Research ideas using spark

2015-07-15 Thread vaquar khan
I would suggest study spark ,flink,strom and based on your understanding
and finding prepare your research paper.

May be you will invented new spark ☺

Regards,
Vaquar khan
On 16 Jul 2015 00:47, Michael Segel msegel_had...@hotmail.com wrote:

 Silly question…

 When thinking about a PhD thesis… do you want to tie it to a specific
 technology or do you want to investigate an idea but then use a specific
 technology.
 Or is this an outdated way of thinking?

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.”

 So before we look at technologies like Spark… could the OP break down a
 more specific concept or idea that he wants to pursue?

 Looking at what Jorn said…

 Using machine learning to better predict workloads in terms of managing
 clusters… This could be interesting… but is it enough for a PhD thesis, or
 of interest to the OP?


 On Jul 15, 2015, at 9:43 AM, Jörn Franke jornfra...@gmail.com wrote:

 Well one of the strength of spark is standardized general distributed
 processing allowing many different types of processing, such as graph
 processing, stream processing etc. The limitation is that it is less
 performant than one system focusing only on one type of processing (eg
 graph processing). I miss - and this may not be spark specific - some
 artificial intelligence to manage a cluster, e.g. Predicting workloads, how
 long a job may run based on previously executed similar jobs etc.
 Furthermore, many optimizations you have do to manually, e.g. Bloom
 filters, partitioning etc - if you find here as well some intelligence that
 does this automatically based on previously executed jobs taking into
 account that optimizations themselves change over time would be great...
 You may also explore feature interaction

 Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com
 a écrit :

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards






Re: Research ideas using spark

2015-07-15 Thread Jörn Franke
Well one of the strength of spark is standardized general distributed
processing allowing many different types of processing, such as graph
processing, stream processing etc. The limitation is that it is less
performant than one system focusing only on one type of processing (eg
graph processing). I miss - and this may not be spark specific - some
artificial intelligence to manage a cluster, e.g. Predicting workloads, how
long a job may run based on previously executed similar jobs etc.
Furthermore, many optimizations you have do to manually, e.g. Bloom
filters, partitioning etc - if you find here as well some intelligence that
does this automatically based on previously executed jobs taking into
account that optimizations themselves change over time would be great...
You may also explore feature interaction

Le mar. 14 juil. 2015 à 7:19, Shashidhar Rao raoshashidhar...@gmail.com a
écrit :

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards



Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which 
claims to be able to run a number of large-scale graph processing tasks on a 
workstation much quicker than a very large Hadoop cluster. It would be 
interesting to know how widely applicable the approach GraphChi takes and what 
implications it has for parallel/distributed computing approaches. A rich seam 
to mine indeed.

Robin
 On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com wrote:
 
 There seems to be a bit of confusion here - the OP (doing the PhD) had the 
 thread hijacked by someone with a similar name asking a mundane question.
 
 It would be a shame to send someone away so rudely, who may do valuable work 
 on Spark.
 
 Sashidar (not Sashid!) I'm personally interested in running graph algorithms 
 for image segmentation using MLib and Spark.  I've got many questions though 
 - like is it even going to give me a speed-up?  
 (http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html 
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
 
 It's not obvious to me which classes of graph algorithms can be implemented 
 correctly and efficiently in a highly parallel manner.  There's tons of work 
 to be done here, I'm sure. Also, look at parallel geospatial algorithms - 
 there's a lot of work being done on this.
 
 Best, Will
 
 
 
 On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com 
 mailto:vineelyalamar...@gmail.com wrote:
 Hi Daniel
 
 Well said
 
 Regards 
 Vineel
 
 
 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com mailto:daniel.dara...@lynxanalytics.com 
 wrote:
 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow than 
 for a PhD thesis.
 
 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com 
 mailto:sha...@trialx.com wrote:
 hi 
 
 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of 
 partitions i get is 9. I am running a spark application , it gets stuck on 
 one of tasks, looking at the UI it seems application is not using all nodes 
 to do calculations. attached is the screen shot of tasks, it seems tasks are 
 put on each node more then once. looking at tasks 8 tasks get completed under 
 7-8 minutes and one task takes around 30 minutes so causing the delay in 
 results. 
 
 
 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao raoshashidhar...@gmail.com 
 mailto:raoshashidhar...@gmail.com wrote:
 Hi,
 
 I am doing my PHD thesis on large scale machine learning e.g  Online 
 learning, batch and mini batch learning.
 
 Could somebody help me with ideas especially in the context of Spark and to 
 the above learning methods. 
 
 Some ideas like improvement to existing algorithms, implementing new features 
 especially the above learning methods and algorithms that have not been 
 implemented etc.
 
 If somebody could help me with some ideas it would really accelerate my work.
 
 Plus few ideas on research papers regarding Spark or Mahout.
 
 Thanks in advance.
 
 Regards 
 
 
 
 -- 
 with Regards
 Shahid Ashraf
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: Research ideas using spark

2015-07-15 Thread William Temperley
There seems to be a bit of confusion here - the OP (doing the PhD) had the
thread hijacked by someone with a similar name asking a mundane question.

It would be a shame to send someone away so rudely, who may do valuable
work on Spark.

Sashidar (not Sashid!) I'm personally interested in running graph
algorithms for image segmentation using MLib and Spark.  I've got many
questions though - like is it even going to give me a speed-up?  (
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

It's not obvious to me which classes of graph algorithms can be implemented
correctly and efficiently in a highly parallel manner.  There's tons of
work to be done here, I'm sure. Also, look at parallel geospatial
algorithms - there's a lot of work being done on this.

Best, Will



On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com
wrote:

 Hi Daniel

 Well said

 Regards
 Vineel

 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com wrote:

 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow
 than for a PhD thesis.

 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark
 and to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Research ideas using spark

2015-07-14 Thread Daniel Darabos
Hi Shahid,
To be honest I think this question is better suited for Stack Overflow than
for a PhD thesis.

On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get completed
 under 7-8 minutes and one task takes around 30 minutes so causing the delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark and
 to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate my
 work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Research ideas using spark

2015-07-13 Thread Shashidhar Rao
Hi,

I am doing my PHD thesis on large scale machine learning e.g  Online
learning, batch and mini batch learning.

Could somebody help me with ideas especially in the context of Spark and to
the above learning methods.

Some ideas like improvement to existing algorithms, implementing new
features especially the above learning methods and algorithms that have not
been implemented etc.

If somebody could help me with some ideas it would really accelerate my
work.

Plus few ideas on research papers regarding Spark or Mahout.

Thanks in advance.

Regards