Re: [julia-users] Re: Julia and Spark

2015-11-14 Thread Christof Stocker
Personally, I think the most progress is made if some person has a huge 
interest in doing it. I for one have a big interest in using Julia for 
ML, but I myself am not particularly interested in using Spark from 
Julia. I just don't feel like it would be useful to me for anything. In 
the situations that I do use spark I don't feel like I would gain 
anything from using it from Julia. That of course doesn't mean that it 
wouldn't be very useful to others, but it does mean that it is unlikely 
that I will spend any of my time on it in the near future. Maybe other 
people are in similar situations.


What I can leave you with is this: I think open source is a place in 
which one person can make all the difference in the world, if he/she 
sets his/her mind to it. So if someone is interested in doing it, go for 
it. I don't think it's to far fetched to assume that once the 
functionality is available (and reasonable mature) that people will 
gravitate towards it.


On 2015-11-14 11:51, Frank wrote:

Hi,

I would have expected more interest in a Spark & Julia integration. Is 
the lack of interest due to

a) missing use cases
b) fact that both Spark and Julia are very new - relatively speaking

What do you think?
Thanks
Frank


On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra 
wrote:


This thread is to discuss Julia - Spark integration further.

This is a continuation of discussions from
https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw
 (the
thread topic was misleading and we could not change it).

To summarize briefly, here are a few interesting packages:
- https://github.com/d9w/Spark.jl 
- https://github.com/jey/Spock.jl 
- https://github.com/benhamner/MachineLearning.jl


- packages at https://github.com/JuliaParallel


We can discuss approaches and coordinate efforts towards whichever
looks promising.





[julia-users] Re: Julia and Spark

2015-11-14 Thread Frank
Hi,

I would have expected more interest in a Spark & Julia integration. Is the 
lack of interest due to
a) missing use cases
b) fact that both Spark and Julia are very new - relatively speaking

What do you think?
Thanks
Frank


On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra 
wrote:
>
> This thread is to discuss Julia - Spark integration further.
>
> This is a continuation of discussions from 
> https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw (the 
> thread topic was misleading and we could not change it).
>
> To summarize briefly, here are a few interesting packages:
> - https://github.com/d9w/Spark.jl
> - https://github.com/jey/Spock.jl
> - https://github.com/benhamner/MachineLearning.jl 
> 
> - packages at https://github.com/JuliaParallel
>
> We can discuss approaches and coordinate efforts towards whichever looks 
> promising.
>


Re: [julia-users] Re: Julia and Spark

2015-11-14 Thread Andrei
Small number of use cases is an important reason. I see many people
interested in Julia & Spark integration, but almost nobody interested
*enough* to invest time into its development.

Another reason is that Julia infrastructure (and especially Julia-Java
integration) is not mature enough to make integrations of such level.
Instability of JNI, inconsistency between Java and Scala, serialization
issues in Julia - these are just few difficulties I faced while working on
Sparta.jl. Many people do great work to fix such issues, but at the moment
Julia is far behind, say, Python.

Finally, it's just huge amount of work. I don't mean basic functionality
like map and reduce operations over text file, but the whole variety of
supported data formats, DataFrames, subprojects like Spark Streaming and
MLlib, etc. And without these features we get back to paragraph 1 - nobody
is interested enough to invest time when there's already PySpark and
SparkR.

All of these makes me think that similar framework for big data analytics
written in pure Julia could bypass many of these issues and generate more
interest in Julia community. I wonder if somebody would want to take part
in such a challenge.


On Sat, Nov 14, 2015 at 2:20 PM, Christof Stocker <
stocker.chris...@gmail.com> wrote:

> Personally, I think the most progress is made if some person has a huge
> interest in doing it. I for one have a big interest in using Julia for ML,
> but I myself am not particularly interested in using Spark from Julia. I
> just don't feel like it would be useful to me for anything. In the
> situations that I do use spark I don't feel like I would gain anything from
> using it from Julia. That of course doesn't mean that it wouldn't be very
> useful to others, but it does mean that it is unlikely that I will spend
> any of my time on it in the near future. Maybe other people are in similar
> situations.
>
> What I can leave you with is this: I think open source is a place in which
> one person can make all the difference in the world, if he/she sets his/her
> mind to it. So if someone is interested in doing it, go for it. I don't
> think it's to far fetched to assume that once the functionality is
> available (and reasonable mature) that people will gravitate towards it.
>
>
> On 2015-11-14 11:51, Frank wrote:
>
> Hi,
>
> I would have expected more interest in a Spark & Julia integration. Is the
> lack of interest due to
> a) missing use cases
> b) fact that both Spark and Julia are very new - relatively speaking
>
> What do you think?
> Thanks
> Frank
>
>
> On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra
> wrote:
>>
>> This thread is to discuss Julia - Spark integration further.
>>
>> This is a continuation of discussions from
>> 
>> https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw (the
>> thread topic was misleading and we could not change it).
>>
>> To summarize briefly, here are a few interesting packages:
>> - https://github.com/d9w/Spark.jl
>> - https://github.com/jey/Spock.jl
>> - https://github.com/benhamner/MachineLearning.jl
>> 
>> - packages at https://github.com/JuliaParallel
>>
>> We can discuss approaches and coordinate efforts towards whichever looks
>> promising.
>>
>
>


Re: [julia-users] Re: Julia and Spark

2015-11-14 Thread Christof Stocker
A pure Julia ML ecosystem is something that is actively being worked 
on/towards. Naturally large problem sizes are one big reason that people 
are interested in working on this. It just takes time to flesh it out to 
something that people are used to from languages like python or R. As I 
see it, Julia is different and offers unique opportunities for 
connecting research, education, and application. It is an interesting 
journey to figure out how to get the most out of what the language has 
to offer.


On 2015-11-14 23:20, Andrei wrote:


All of these makes me think that similar framework for big data 
analytics written in pure Julia could bypass many of these issues and 
generate more interest in Julia community. I wonder if somebody would 
want to take part in such a challenge.




Re: [julia-users] Re: Julia and Spark

2015-11-14 Thread Andrei
I wouldn't restrict big data analytics to machine learning - it also
includes SQL queries and visualization, real-time data enrichment, ETL,
etc. I don't think Julia ML ecosystem by itself will ever expand to these
areas.

On Sun, Nov 15, 2015 at 1:43 AM, Christof Stocker <
stocker.chris...@gmail.com> wrote:

> A pure Julia ML ecosystem is something that is actively being worked
> on/towards. Naturally large problem sizes are one big reason that people
> are interested in working on this. It just takes time to flesh it out to
> something that people are used to from languages like python or R. As I see
> it, Julia is different and offers unique opportunities for connecting
> research, education, and application. It is an interesting journey to
> figure out how to get the most out of what the language has to offer.
>
>
> On 2015-11-14 23:20, Andrei wrote:
>
>>
>> All of these makes me think that similar framework for big data analytics
>> written in pure Julia could bypass many of these issues and generate more
>> interest in Julia community. I wonder if somebody would want to take part
>> in such a challenge.
>>
>
>


Re: [julia-users] Re: Julia and Spark

2015-11-14 Thread Christof Stocker
You are right. My tunnel vision strikes again. That's what happens when 
one works with a hammer all day; all I see is nails :-)


On 2015-11-14 23:54, Andrei wrote:
I wouldn't restrict big data analytics to machine learning - it also 
includes SQL queries and visualization, real-time data enrichment, 
ETL, etc. I don't think Julia ML ecosystem by itself will ever expand 
to these areas.


On Sun, Nov 15, 2015 at 1:43 AM, Christof Stocker 
> wrote:


A pure Julia ML ecosystem is something that is actively being
worked on/towards. Naturally large problem sizes are one big
reason that people are interested in working on this. It just
takes time to flesh it out to something that people are used to
from languages like python or R. As I see it, Julia is different
and offers unique opportunities for connecting research,
education, and application. It is an interesting journey to figure
out how to get the most out of what the language has to offer.


On 2015-11-14 23:20, Andrei wrote:


All of these makes me think that similar framework for big
data analytics written in pure Julia could bypass many of
these issues and generate more interest in Julia community. I
wonder if somebody would want to take part in such a challenge.







Re: [julia-users] Re: Julia and Spark

2015-11-01 Thread Sisyphuss
http://dl.acm.org/citation.cfm?id=2228301

On Saturday, October 31, 2015 at 5:18:01 PM UTC+1, Jey Kottalam wrote:
>
> Could you please define "streams of RDDs"?
>
> On Sat, Oct 31, 2015 at 12:59 AM,  
> wrote:
>
>> Is there any implementation with streams of RDDs for Julia ? 
>>
>>
>> On Monday, April 20, 2015 at 11:54:10 AM UTC-7, wil...@gmail.com wrote:
>>>
>>> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of 
>>> creating transformations as independent abstraction operations with a lazy 
>>> evaluation, the package has all transformations immediately executed upon 
>>> their call. This is completely undermines whole purpose of RDD as 
>>> fault-tolerant parallel data structure.
>>>
>>> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra 
>>> wrote:

 There was some attempt made towards a pure Julia RDD in Spark.jl (
 https://github.com/d9w/Spark.jl).
 We also have DistributedArrays (
 https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
 https://github.com/JuliaParallel/Blocks.jl) and (
 https://github.com/JuliaStats/DataFrames.jl).

 I wonder if it is possible to leverage any of these for a pure Julia 
 RDD.
 And MachineLearning.jl 
 
  or 
 something similar could probably be the equivalent of MLib.


 On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com 
 wrote:
>
> Of course, a Spark data access infrastructure is unbeatable, due to 
> mature JVM-based libraries for accessing various data sources and formats 
> (avro, parquet, hdfs). That includes SQL support as well. But, look at 
> Python and R bindings, these are just facades for JVM calls. MLLib is 
> written in Scala, Streaming API as well, and then all this called from 
> Python or R, all data transformations happen on JVM level. It would be 
> more 
> efficient write code in Scala then use any non-JVM bindings. Think of 
> overhead for RPC and data serialization over huge volumes of data needed 
> to 
> be processed and you'll understand why Dpark exists. BTW, machine 
> learning 
> libraries in JVM, good luck. It only works because of large computational 
> resources used, but even that has its limits.
>
> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>>
>> Julia bindings for Spark would provide much more than just RDD, they 
>> will give us access to multiple big data components for streaming, 
>> machine 
>> learning, SQL capabilities and much more. 
>>
>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com 
>> wrote:
>>>
>>> However, I wonder, how hard it would be to implement RDD in Julia? 
>>> It looks straight forward from a RDD paper 
>>>  how 
>>> to implement it. It is a robust abstraction that can be used in any 
>>> parallel computation.
>>>
>>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert 
>>> wrote:

 yes that's a solid approach. For my personal julia - java 
 integrations I also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com 
 wrote:
>
> 1) simply wrap the Spark java API via JavaCall. This is the low 
>> level approach. BTW I've experimented with javaCall and found it was 
>> unstable & also lacking functionality (e.g. there's no way to 
>> shutdown the 
>> jvm or create a pool of JVM analogous to DB connections) so that 
>> might need 
>> some work before trying the Spark integration.
>>
>
> Using JavaCall is not an option, especially when JVM became 
> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>
> Python bindings are done through Py4J, which is RPC to JVM. If you 
> look at the sparkR , 
> it is done in a same way. sparkR uses a RPC interface to communicate 
> with a 
> Netty-based Spark JVM backend that translates R calls into JVM calls, 
> keeps 
> SparkContext on a JVM side, and ships serialized data to/from R.
>
> So it is just a matter of writing Julia RPC to JVM and wrapping 
> necessary Spark methods in a Julia friendly way. 
>

>

Re: [julia-users] Re: Julia and Spark

2015-11-01 Thread ssarkarayushnetdev
Yes.

On Sunday, November 1, 2015 at 9:34:26 AM UTC-8, Jey Kottalam wrote:
>
> Are you asking about Spark Streaming support?
>
> On Sun, Nov 1, 2015 at 4:42 AM, Sisyphuss  > wrote:
>
>> http://dl.acm.org/citation.cfm?id=2228301
>>
>> On Saturday, October 31, 2015 at 5:18:01 PM UTC+1, Jey Kottalam wrote:
>>>
>>> Could you please define "streams of RDDs"?
>>>
>>> On Sat, Oct 31, 2015 at 12:59 AM,  wrote:
>>>
 Is there any implementation with streams of RDDs for Julia ? 


 On Monday, April 20, 2015 at 11:54:10 AM UTC-7, wil...@gmail.com wrote:
>
> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of 
> creating transformations as independent abstraction operations with a 
> lazy 
> evaluation, the package has all transformations immediately executed upon 
> their call. This is completely undermines whole purpose of RDD as 
> fault-tolerant parallel data structure.
>
> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra 
> wrote:
>>
>> There was some attempt made towards a pure Julia RDD in Spark.jl (
>> https://github.com/d9w/Spark.jl).
>> We also have DistributedArrays (
>> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
>> https://github.com/JuliaParallel/Blocks.jl) and (
>> https://github.com/JuliaStats/DataFrames.jl).
>>
>> I wonder if it is possible to leverage any of these for a pure Julia 
>> RDD.
>> And MachineLearning.jl 
>> 
>>  or 
>> something similar could probably be the equivalent of MLib.
>>
>>
>> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com 
>> wrote:
>>>
>>> Of course, a Spark data access infrastructure is unbeatable, due to 
>>> mature JVM-based libraries for accessing various data sources and 
>>> formats 
>>> (avro, parquet, hdfs). That includes SQL support as well. But, look at 
>>> Python and R bindings, these are just facades for JVM calls. MLLib is 
>>> written in Scala, Streaming API as well, and then all this called from 
>>> Python or R, all data transformations happen on JVM level. It would be 
>>> more 
>>> efficient write code in Scala then use any non-JVM bindings. Think of 
>>> overhead for RPC and data serialization over huge volumes of data 
>>> needed to 
>>> be processed and you'll understand why Dpark exists. BTW, machine 
>>> learning 
>>> libraries in JVM, good luck. It only works because of large 
>>> computational 
>>> resources used, but even that has its limits.
>>>
>>> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:

 Julia bindings for Spark would provide much more than just RDD, 
 they will give us access to multiple big data components for 
 streaming, 
 machine learning, SQL capabilities and much more. 

 On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com 
 wrote:
>
> However, I wonder, how hard it would be to implement RDD in Julia? 
> It looks straight forward from a RDD paper 
>  
> how to implement it. It is a robust abstraction that can be used in 
> any 
> parallel computation.
>
> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert 
> wrote:
>>
>> yes that's a solid approach. For my personal julia - java 
>> integrations I also run the JVM in a separate process.
>>
>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, 
>> wil...@gmail.com wrote:
>>>
>>> 1) simply wrap the Spark java API via JavaCall. This is the low 
 level approach. BTW I've experimented with javaCall and found it 
 was 
 unstable & also lacking functionality (e.g. there's no way to 
 shutdown the 
 jvm or create a pool of JVM analogous to DB connections) so that 
 might need 
 some work before trying the Spark integration.

>>>
>>> Using JavaCall is not an option, especially when JVM became 
>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7
>>> .
>>>
>>> Python bindings are done through Py4J, which is RPC to JVM. If 
>>> you look at the sparkR 
>>> , it is done in 
>>> a same way. sparkR uses a RPC interface to communicate with a 
>>> Netty-based 
>>> Spark JVM backend that translates R calls into JVM calls, keeps 
>>> SparkContext on a JVM side, and ships serialized data to/from 

Re: [julia-users] Re: Julia and Spark

2015-11-01 Thread Jey Kottalam
Are you asking about Spark Streaming support?

On Sun, Nov 1, 2015 at 4:42 AM, Sisyphuss  wrote:

> http://dl.acm.org/citation.cfm?id=2228301
>
> On Saturday, October 31, 2015 at 5:18:01 PM UTC+1, Jey Kottalam wrote:
>>
>> Could you please define "streams of RDDs"?
>>
>> On Sat, Oct 31, 2015 at 12:59 AM,  wrote:
>>
>>> Is there any implementation with streams of RDDs for Julia ?
>>>
>>>
>>> On Monday, April 20, 2015 at 11:54:10 AM UTC-7, wil...@gmail.com wrote:

 Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of
 creating transformations as independent abstraction operations with a lazy
 evaluation, the package has all transformations immediately executed upon
 their call. This is completely undermines whole purpose of RDD as
 fault-tolerant parallel data structure.

 On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra
 wrote:
>
> There was some attempt made towards a pure Julia RDD in Spark.jl (
> https://github.com/d9w/Spark.jl).
> We also have DistributedArrays (
> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
> https://github.com/JuliaParallel/Blocks.jl) and (
> https://github.com/JuliaStats/DataFrames.jl).
>
> I wonder if it is possible to leverage any of these for a pure Julia
> RDD.
> And MachineLearning.jl
> 
>  or
> something similar could probably be the equivalent of MLib.
>
>
> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com
> wrote:
>>
>> Of course, a Spark data access infrastructure is unbeatable, due to
>> mature JVM-based libraries for accessing various data sources and formats
>> (avro, parquet, hdfs). That includes SQL support as well. But, look at
>> Python and R bindings, these are just facades for JVM calls. MLLib is
>> written in Scala, Streaming API as well, and then all this called from
>> Python or R, all data transformations happen on JVM level. It would be 
>> more
>> efficient write code in Scala then use any non-JVM bindings. Think of
>> overhead for RPC and data serialization over huge volumes of data needed 
>> to
>> be processed and you'll understand why Dpark exists. BTW, machine 
>> learning
>> libraries in JVM, good luck. It only works because of large computational
>> resources used, but even that has its limits.
>>
>> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>>>
>>> Julia bindings for Spark would provide much more than just RDD, they
>>> will give us access to multiple big data components for streaming, 
>>> machine
>>> learning, SQL capabilities and much more.
>>>
>>> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com
>>> wrote:

 However, I wonder, how hard it would be to implement RDD in Julia?
 It looks straight forward from a RDD paper
 
 how to implement it. It is a robust abstraction that can be used in any
 parallel computation.

 On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert
 wrote:
>
> yes that's a solid approach. For my personal julia - java
> integrations I also run the JVM in a separate process.
>
> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com
> wrote:
>>
>> 1) simply wrap the Spark java API via JavaCall. This is the low
>>> level approach. BTW I've experimented with javaCall and found it was
>>> unstable & also lacking functionality (e.g. there's no way to 
>>> shutdown the
>>> jvm or create a pool of JVM analogous to DB connections) so that 
>>> might need
>>> some work before trying the Spark integration.
>>>
>>
>> Using JavaCall is not an option, especially when JVM became
>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>>
>> Python bindings are done through Py4J, which is RPC to JVM. If
>> you look at the sparkR
>> , it is done in a
>> same way. sparkR uses a RPC interface to communicate with a 
>> Netty-based
>> Spark JVM backend that translates R calls into JVM calls, keeps
>> SparkContext on a JVM side, and ships serialized data to/from R.
>>
>> So it is just a matter of writing Julia RPC to JVM and wrapping
>> necessary Spark methods in a Julia friendly way.
>>
>
>>


[julia-users] Re: Julia and Spark

2015-10-31 Thread ssarkarayushnetdev
Is there any implementation with streams of RDDs for Julia ? 

On Monday, April 20, 2015 at 11:54:10 AM UTC-7, wil...@gmail.com wrote:
>
> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of 
> creating transformations as independent abstraction operations with a lazy 
> evaluation, the package has all transformations immediately executed upon 
> their call. This is completely undermines whole purpose of RDD as 
> fault-tolerant parallel data structure.
>
> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra wrote:
>>
>> There was some attempt made towards a pure Julia RDD in Spark.jl (
>> https://github.com/d9w/Spark.jl).
>> We also have DistributedArrays (
>> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
>> https://github.com/JuliaParallel/Blocks.jl) and (
>> https://github.com/JuliaStats/DataFrames.jl).
>>
>> I wonder if it is possible to leverage any of these for a pure Julia RDD.
>> And MachineLearning.jl 
>> 
>>  or 
>> something similar could probably be the equivalent of MLib.
>>
>>
>> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com wrote:
>>>
>>> Of course, a Spark data access infrastructure is unbeatable, due to 
>>> mature JVM-based libraries for accessing various data sources and formats 
>>> (avro, parquet, hdfs). That includes SQL support as well. But, look at 
>>> Python and R bindings, these are just facades for JVM calls. MLLib is 
>>> written in Scala, Streaming API as well, and then all this called from 
>>> Python or R, all data transformations happen on JVM level. It would be more 
>>> efficient write code in Scala then use any non-JVM bindings. Think of 
>>> overhead for RPC and data serialization over huge volumes of data needed to 
>>> be processed and you'll understand why Dpark exists. BTW, machine learning 
>>> libraries in JVM, good luck. It only works because of large computational 
>>> resources used, but even that has its limits.
>>>
>>> On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:

 Julia bindings for Spark would provide much more than just RDD, they 
 will give us access to multiple big data components for streaming, machine 
 learning, SQL capabilities and much more. 

 On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote:
>
> However, I wonder, how hard it would be to implement RDD in Julia? It 
> looks straight forward from a RDD paper 
>  how 
> to implement it. It is a robust abstraction that can be used in any 
> parallel computation.
>
> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:
>>
>> yes that's a solid approach. For my personal julia - java 
>> integrations I also run the JVM in a separate process.
>>
>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com 
>> wrote:
>>>
>>> 1) simply wrap the Spark java API via JavaCall. This is the low 
 level approach. BTW I've experimented with javaCall and found it was 
 unstable & also lacking functionality (e.g. there's no way to shutdown 
 the 
 jvm or create a pool of JVM analogous to DB connections) so that might 
 need 
 some work before trying the Spark integration.

>>>
>>> Using JavaCall is not an option, especially when JVM became 
>>> close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.
>>>
>>> Python bindings are done through Py4J, which is RPC to JVM. If you 
>>> look at the sparkR , 
>>> it is done in a same way. sparkR uses a RPC interface to communicate 
>>> with a 
>>> Netty-based Spark JVM backend that translates R calls into JVM calls, 
>>> keeps 
>>> SparkContext on a JVM side, and ships serialized data to/from R.
>>>
>>> So it is just a matter of writing Julia RPC to JVM and wrapping 
>>> necessary Spark methods in a Julia friendly way. 
>>>
>>

Re: [julia-users] Re: Julia and Spark

2015-10-31 Thread Jey Kottalam
Could you please define "streams of RDDs"?

On Sat, Oct 31, 2015 at 12:59 AM,  wrote:

> Is there any implementation with streams of RDDs for Julia ?
>
>
> On Monday, April 20, 2015 at 11:54:10 AM UTC-7, wil...@gmail.com wrote:
>>
>> Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of
>> creating transformations as independent abstraction operations with a lazy
>> evaluation, the package has all transformations immediately executed upon
>> their call. This is completely undermines whole purpose of RDD as
>> fault-tolerant parallel data structure.
>>
>> On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra
>> wrote:
>>>
>>> There was some attempt made towards a pure Julia RDD in Spark.jl (
>>> https://github.com/d9w/Spark.jl).
>>> We also have DistributedArrays (
>>> https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
>>> https://github.com/JuliaParallel/Blocks.jl) and (
>>> https://github.com/JuliaStats/DataFrames.jl).
>>>
>>> I wonder if it is possible to leverage any of these for a pure Julia RDD.
>>> And MachineLearning.jl
>>> 
>>>  or
>>> something similar could probably be the equivalent of MLib.
>>>
>>>
>>> On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com
>>> wrote:

 Of course, a Spark data access infrastructure is unbeatable, due to
 mature JVM-based libraries for accessing various data sources and formats
 (avro, parquet, hdfs). That includes SQL support as well. But, look at
 Python and R bindings, these are just facades for JVM calls. MLLib is
 written in Scala, Streaming API as well, and then all this called from
 Python or R, all data transformations happen on JVM level. It would be more
 efficient write code in Scala then use any non-JVM bindings. Think of
 overhead for RPC and data serialization over huge volumes of data needed to
 be processed and you'll understand why Dpark exists. BTW, machine learning
 libraries in JVM, good luck. It only works because of large computational
 resources used, but even that has its limits.

 On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:
>
> Julia bindings for Spark would provide much more than just RDD, they
> will give us access to multiple big data components for streaming, machine
> learning, SQL capabilities and much more.
>
> On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com
> wrote:
>>
>> However, I wonder, how hard it would be to implement RDD in Julia? It
>> looks straight forward from a RDD paper
>>  how
>> to implement it. It is a robust abstraction that can be used in any
>> parallel computation.
>>
>> On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:
>>>
>>> yes that's a solid approach. For my personal julia - java
>>> integrations I also run the JVM in a separate process.
>>>
>>> On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com
>>> wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low
> level approach. BTW I've experimented with javaCall and found it was
> unstable & also lacking functionality (e.g. there's no way to 
> shutdown the
> jvm or create a pool of JVM analogous to DB connections) so that 
> might need
> some work before trying the Spark integration.
>

 Using JavaCall is not an option, especially when JVM became
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you
 look at the sparkR ,
 it is done in a same way. sparkR uses a RPC interface to communicate 
 with a
 Netty-based Spark JVM backend that translates R calls into JVM calls, 
 keeps
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping
 necessary Spark methods in a Julia friendly way.

>>>


[julia-users] Re: Julia and Spark

2015-04-20 Thread wildart
Unfortunately, Spark.jl is an incorrect RDD implementation. Instead of 
creating transformations as independent abstraction operations with a lazy 
evaluation, the package has all transformations immediately executed upon 
their call. This is completely undermines whole purpose of RDD as 
fault-tolerant parallel data structure.

On Saturday, April 18, 2015 at 4:04:23 AM UTC-4, Tanmay K. Mohapatra wrote:

 There was some attempt made towards a pure Julia RDD in Spark.jl (
 https://github.com/d9w/Spark.jl).
 We also have DistributedArrays (
 https://github.com/JuliaParallel/DistributedArrays.jl), Blocks (
 https://github.com/JuliaParallel/Blocks.jl) and (
 https://github.com/JuliaStats/DataFrames.jl).

 I wonder if it is possible to leverage any of these for a pure Julia RDD.
 And MachineLearning.jl 
 https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jlsa=Dsntz=1usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ
  or 
 something similar could probably be the equivalent of MLib.


 On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com wrote:

 Of course, a Spark data access infrastructure is unbeatable, due to 
 mature JVM-based libraries for accessing various data sources and formats 
 (avro, parquet, hdfs). That includes SQL support as well. But, look at 
 Python and R bindings, these are just facades for JVM calls. MLLib is 
 written in Scala, Streaming API as well, and then all this called from 
 Python or R, all data transformations happen on JVM level. It would be more 
 efficient write code in Scala then use any non-JVM bindings. Think of 
 overhead for RPC and data serialization over huge volumes of data needed to 
 be processed and you'll understand why Dpark exists. BTW, machine learning 
 libraries in JVM, good luck. It only works because of large computational 
 resources used, but even that has its limits.

 On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:

 Julia bindings for Spark would provide much more than just RDD, they 
 will give us access to multiple big data components for streaming, machine 
 learning, SQL capabilities and much more. 

 On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote:

 However, I wonder, how hard it would be to implement RDD in Julia? It 
 looks straight forward from a RDD paper 
 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf how to 
 implement it. It is a robust abstraction that can be used in any parallel 
 computation.

 On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:

 yes that's a solid approach. For my personal julia - java integrations 
 I also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com 
 wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable 
  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need 
 some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became 
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you 
 look at the sparkR https://github.com/apache/spark/tree/master/R, 
 it is done in a same way. sparkR uses a RPC interface to communicate 
 with a 
 Netty-based Spark JVM backend that translates R calls into JVM calls, 
 keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping 
 necessary Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-18 Thread Tanmay K. Mohapatra
There was some attempt made towards a pure Julia RDD in Spark.jl (
https://github.com/d9w/Spark.jl).
We also have DistributedArrays 
(https://github.com/JuliaParallel/DistributedArrays.jl), Blocks 
(https://github.com/JuliaParallel/Blocks.jl) and 
(https://github.com/JuliaStats/DataFrames.jl).

I wonder if it is possible to leverage any of these for a pure Julia RDD.
And MachineLearning.jl 
https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jlsa=Dsntz=1usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ
 or 
something similar could probably be the equivalent of MLib.


On Friday, April 17, 2015 at 9:24:03 PM UTC+5:30, wil...@gmail.com wrote:

 Of course, a Spark data access infrastructure is unbeatable, due to mature 
 JVM-based libraries for accessing various data sources and formats (avro, 
 parquet, hdfs). That includes SQL support as well. But, look at Python and 
 R bindings, these are just facades for JVM calls. MLLib is written in 
 Scala, Streaming API as well, and then all this called from Python or R, 
 all data transformations happen on JVM level. It would be more efficient 
 write code in Scala then use any non-JVM bindings. Think of overhead for 
 RPC and data serialization over huge volumes of data needed to be processed 
 and you'll understand why Dpark exists. BTW, machine learning libraries in 
 JVM, good luck. It only works because of large computational resources 
 used, but even that has its limits.

 On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:

 Julia bindings for Spark would provide much more than just RDD, they will 
 give us access to multiple big data components for streaming, machine 
 learning, SQL capabilities and much more. 

 On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote:

 However, I wonder, how hard it would be to implement RDD in Julia? It 
 looks straight forward from a RDD paper 
 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf how to 
 implement it. It is a robust abstraction that can be used in any parallel 
 computation.

 On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:

 yes that's a solid approach. For my personal julia - java integrations 
 I also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com 
 wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable 
  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need 
 some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became 
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you 
 look at the sparkR https://github.com/apache/spark/tree/master/R, 
 it is done in a same way. sparkR uses a RPC interface to communicate with 
 a 
 Netty-based Spark JVM backend that translates R calls into JVM calls, 
 keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping 
 necessary Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-17 Thread wildart
Of course, a Spark data access infrastructure is unbeatable, due to mature 
JVM-based libraries for accessing various data sources and formats (avro, 
parquet, hdfs). That includes SQL support as well. But, look at Python and 
R bindings, these are just facades for JVM calls. MLLib is written in 
Scala, Streaming API as well, and then all this called from Python or R, 
all data transformations happen on JVM level. It would be more efficient 
write code in Scala then use any non-JVM bindings. Think of overhead for 
RPC and data serialization over huge volumes of data needed to be processed 
and you'll understand why Dpark exists. BTW, machine learning libraries in 
JVM, good luck. It only works because of large computational resources 
used, but even that has its limits.

On Thursday, April 16, 2015 at 6:29:58 PM UTC-4, Andrei Zh wrote:

 Julia bindings for Spark would provide much more than just RDD, they will 
 give us access to multiple big data components for streaming, machine 
 learning, SQL capabilities and much more. 

 On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote:

 However, I wonder, how hard it would be to implement RDD in Julia? It 
 looks straight forward from a RDD paper 
 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf how to 
 implement it. It is a robust abstraction that can be used in any parallel 
 computation.

 On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:

 yes that's a solid approach. For my personal julia - java integrations I 
 also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com 
 wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became 
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you look 
 at the sparkR https://github.com/apache/spark/tree/master/R, it is 
 done in a same way. sparkR uses a RPC interface to communicate with a 
 Netty-based Spark JVM backend that translates R calls into JVM calls, 
 keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping 
 necessary Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-16 Thread Steven Sagaert
yes that's a solid approach. For my personal julia - java integrations I 
also run the JVM in a separate process.

On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became close-sourced, 
 see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you look at 
 the sparkR https://github.com/apache/spark/tree/master/R, it is done in 
 a same way. sparkR uses a RPC interface to communicate with a Netty-based 
 Spark JVM backend that translates R calls into JVM calls, keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping necessary 
 Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-16 Thread wildart
However, I wonder, how hard it would be to implement RDD in Julia? It looks 
straight forward from a RDD paper 
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf how to 
implement it. It is a robust abstraction that can be used in any parallel 
computation.

On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:

 yes that's a solid approach. For my personal julia - java integrations I 
 also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became 
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you look 
 at the sparkR https://github.com/apache/spark/tree/master/R, it is 
 done in a same way. sparkR uses a RPC interface to communicate with a 
 Netty-based Spark JVM backend that translates R calls into JVM calls, keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping necessary 
 Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-16 Thread Andrei Zh
Julia bindings for Spark would provide much more than just RDD, they will 
give us access to multiple big data components for streaming, machine 
learning, SQL capabilities and much more. 

On Friday, April 17, 2015 at 12:54:32 AM UTC+3, wil...@gmail.com wrote:

 However, I wonder, how hard it would be to implement RDD in Julia? It 
 looks straight forward from a RDD paper 
 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf how to 
 implement it. It is a robust abstraction that can be used in any parallel 
 computation.

 On Thursday, April 16, 2015 at 3:32:32 AM UTC-4, Steven Sagaert wrote:

 yes that's a solid approach. For my personal julia - java integrations I 
 also run the JVM in a separate process.

 On Wednesday, April 15, 2015 at 9:30:28 PM UTC+2, wil...@gmail.com wrote:

 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need some 
 work before trying the Spark integration.


 Using JavaCall is not an option, especially when JVM became 
 close-sourced, see https://github.com/aviks/JavaCall.jl/issues/7.

 Python bindings are done through Py4J, which is RPC to JVM. If you look 
 at the sparkR https://github.com/apache/spark/tree/master/R, it is 
 done in a same way. sparkR uses a RPC interface to communicate with a 
 Netty-based Spark JVM backend that translates R calls into JVM calls, keeps 
 SparkContext on a JVM side, and ships serialized data to/from R.

 So it is just a matter of writing Julia RPC to JVM and wrapping 
 necessary Spark methods in a Julia friendly way. 



[julia-users] Re: Julia and Spark

2015-04-15 Thread Steven Sagaert
I've been comtemplating writing a high level wrapper to Spark myself since 
I'm interested in both Julia  Spark but I was waiting for Julia 0.4 to 
finalize before even starting.
One can do the integration on several levels:
1) simply wrap the Spark java API via JavaCall. This is the low level 
approach. BTW I've experimented with javaCall and found it was unstable  
also lacking functionality (e.g. there's no way to shutdown the jvm or 
create a pool of JVM analogous to DB connections) so that might need some 
work before trying the Spark integration.
2) Spark 1.3 has now new and high level interfaces: dataframe API for 
accessing data in the form of distributed dataframes  pipeline API to 
compose algo via pipeline framework. By wrapping the spark dataframe with 
julia dataframe you would quickly have a high level (data scientist level) 
interface to Spark. BTW Spark dataframes are actually also FASTER than the 
more low level approaches like java/scala methods calls or Spark SQL 
(intermediate level) because Spark itself can do more optimizations (this 
is similar to how PyData Blaze works). By wrapping the pipeline API one 
could quickly compose Spark algos to create new algos.
3) for an intermediate approach : wrap the Spark SQL API and use SQL to 
query the system.

Personally I would start with dataframe  pipeline API. Maybe later on if 
needed add Spark SQL API and only do the low level stuff last if needed. 
But before interfacing Spark dataframes with julia ones the julia dataframe 
should become more powerful: at least  and || should be allowed in 
indexing for richer querying like in R dataframes.

On Wednesday, April 15, 2015 at 11:37:50 AM UTC+2, Tanmay K. Mohapatra 
wrote:

 This thread is to discuss Julia - Spark integration further.

 This is a continuation of discussions from 
 https://groups.google.com/forum/#!topic/julia-users/LeCnTmOvUbw (the 
 thread topic was misleading and we could not change it).

 To summarize briefly, here are a few interesting packages:
 - https://github.com/d9w/Spark.jl
 - https://github.com/jey/Spock.jl
 - https://github.com/benhamner/MachineLearning.jl 
 https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fbenhamner%2FMachineLearning.jlsa=Dsntz=1usg=AFQjCNEBun6ioX809NFBqVDu3eMKWzrZBQ
 - packages at https://github.com/JuliaParallel

 We can discuss approaches and coordinate efforts towards whichever looks 
 promising.



[julia-users] Re: Julia and Spark

2015-04-15 Thread wildart


 1) simply wrap the Spark java API via JavaCall. This is the low level 
 approach. BTW I've experimented with javaCall and found it was unstable  
 also lacking functionality (e.g. there's no way to shutdown the jvm or 
 create a pool of JVM analogous to DB connections) so that might need some 
 work before trying the Spark integration.


Using JavaCall is not an option, especially when JVM became close-sourced, 
see https://github.com/aviks/JavaCall.jl/issues/7.

Python bindings are done through Py4J, which is RPC to JVM. If you look at 
the sparkR https://github.com/apache/spark/tree/master/R, it is done in a 
same way. sparkR uses a RPC interface to communicate with a Netty-based 
Spark JVM backend that translates R calls into JVM calls, keeps 
SparkContext on a JVM side, and ships serialized data to/from R.

So it is just a matter of writing Julia RPC to JVM and wrapping necessary 
Spark methods in a Julia friendly way.