Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Christopher Nguyen
Thanks Nick :)


Abid, you may also want to check out
http://conferences.oreilly.com/strata/big-data-conference-ny-2015/public/schedule/detail/43484,
which describes our work on a combination of Spark and Tachyon for Deep
Learning. We found significant gains in using Tachyon (with co-processing)
for the "descent" step while Spark computes the gradients. The video was
recently uploaded here http://bit.ly/1JnvQAO.


Regards,
-- 

*Algorithms of the Mind **http://bit.ly/1ReQvEW <http://bit.ly/1ReQvEW>*

Christopher Nguyen
CEO & Co-Founder
www.Arimo.com (née Adatao)
linkedin.com/in/ctnguyen


Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui, sorry about that. That link you mentioned is probably the one for
the products. We don't have one pointing from adatao.com to ddf.io; maybe
we'll add it.

As for access to the code base itself, I think the team has already created
a GitHub repo for it, and should open it up within a few weeks. There's
some debate about whether to put out the implementation with Shark
dependencies now, or SparkSQL with a bit limited functionality and not as
well tested.

I'll check and ping when this is opened up.

The license is Apache.

Sent while mobile. Please excuse typos etc.
On Sep 6, 2014 1:39 PM, "oppokui"  wrote:

> Thanks, Christopher. I saw it before, it is amazing. Last time I try to
> download it from adatao, but no response after filling the table. How can I
> download it or its source code? What is the license?
>
> Kui
>
>
> On Sep 6, 2014, at 8:08 PM, Christopher Nguyen  wrote:
>
> Hi Kui,
>
> DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
> and is already implemented on top of Spark.
>
> One philosophy is that the DDF API aggressively hides the notion of
> parallel datasets, exposing only (mutable) tables to users, on which they
> can apply R and other familiar data mining/machine learning idioms, without
> having to know about the distributed representation underneath. Now, you
> can get to the underlying RDDs if you want to, simply by asking for it.
>
> This was launched at the July Spark Summit. See
> http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
> .
>
> Sent while mobile. Please excuse typos etc.
> On Sep 4, 2014 1:59 PM, "Shivaram Venkataraman" <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
>> things we are working on. One of the main features is to expose a data
>> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
>> be integrating this with Spark's MLLib.  At a high-level this will
>> allow R users to use a familiar API but make use of MLLib's efficient
>> distributed implementation. This is the same strategy used in Python
>> as well.
>>
>> Also we do hope to merge SparkR with mainline Spark -- we have a few
>> features to complete before that and plan to shoot for integration by
>> Spark 1.3.
>>
>> Thanks
>> Shivaram
>>
>> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
>> > Thanks, Shivaram.
>> >
>> > No specific use case yet. We try to use R in our project as data
>> scientest
>> > are all knowing R. We had a concern that how R handles the mass data.
>> Spark
>> > does a better work on big data area, and Spark ML is focusing on
>> predictive
>> > analysis area. Then we are thinking whether we can merge R and Spark
>> > together. We tried SparkR and it is pretty easy to use. But we didn’t
>> see
>> > any feedback on this package in industry. It will be better if Spark
>> team
>> > has R support just like scala/Java/Python.
>> >
>> > Another question is that MLlib will re-implement all famous data mining
>> > algorithms in Spark, then what is the purpose of using R?
>> >
>> > There is another technique for us H2O which support R natively. H2O is
>> more
>> > friendly to data scientist. I saw H2O can also work on Spark (Sparkling
>> > Water).  It is better than using SparkR?
>> >
>> > Thanks and Regards.
>> >
>> > Kui
>> >
>> >
>> > On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
>> >  wrote:
>> >
>> > Hi
>> >
>> > Do you have a specific use-case where SparkR doesn't work well ? We'd
>> love
>> > to hear more about use-cases and features that can be improved with
>> SparkR.
>> >
>> > Thanks
>> > Shivaram
>> >
>> >
>> > On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
>> >>
>> >> Does spark ML team have plan to support R script natively? There is a
>> >> SparkR project, but not from spark team. Spark ML used netlib-java to
>> talk
>> >> with native fortran routines or use NumPy, why not try to use R in some
>> >> sense.
>> >>
>> >> R had lot of useful packages. If spark ML team can include R support,
>> it
>> >> will be a very powerful.
>> >>
>> >> Any comment?
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui,

DDF (open sourced) also aims to do something similar, adding RDBMS idioms,
and is already implemented on top of Spark.

One philosophy is that the DDF API aggressively hides the notion of
parallel datasets, exposing only (mutable) tables to users, on which they
can apply R and other familiar data mining/machine learning idioms, without
having to know about the distributed representation underneath. Now, you
can get to the underlying RDDs if you want to, simply by asking for it.

This was launched at the July Spark Summit. See
http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
.

Sent while mobile. Please excuse typos etc.
On Sep 4, 2014 1:59 PM, "Shivaram Venkataraman" 
wrote:

> Thanks Kui. SparkR is a pretty young project, but there are a bunch of
> things we are working on. One of the main features is to expose a data
> frame API (https://sparkr.atlassian.net/browse/SPARKR-1) and we will
> be integrating this with Spark's MLLib.  At a high-level this will
> allow R users to use a familiar API but make use of MLLib's efficient
> distributed implementation. This is the same strategy used in Python
> as well.
>
> Also we do hope to merge SparkR with mainline Spark -- we have a few
> features to complete before that and plan to shoot for integration by
> Spark 1.3.
>
> Thanks
> Shivaram
>
> On Wed, Sep 3, 2014 at 9:24 PM, oppokui  wrote:
> > Thanks, Shivaram.
> >
> > No specific use case yet. We try to use R in our project as data
> scientest
> > are all knowing R. We had a concern that how R handles the mass data.
> Spark
> > does a better work on big data area, and Spark ML is focusing on
> predictive
> > analysis area. Then we are thinking whether we can merge R and Spark
> > together. We tried SparkR and it is pretty easy to use. But we didn’t see
> > any feedback on this package in industry. It will be better if Spark team
> > has R support just like scala/Java/Python.
> >
> > Another question is that MLlib will re-implement all famous data mining
> > algorithms in Spark, then what is the purpose of using R?
> >
> > There is another technique for us H2O which support R natively. H2O is
> more
> > friendly to data scientist. I saw H2O can also work on Spark (Sparkling
> > Water).  It is better than using SparkR?
> >
> > Thanks and Regards.
> >
> > Kui
> >
> >
> > On Sep 4, 2014, at 1:47 AM, Shivaram Venkataraman
> >  wrote:
> >
> > Hi
> >
> > Do you have a specific use-case where SparkR doesn't work well ? We'd
> love
> > to hear more about use-cases and features that can be improved with
> SparkR.
> >
> > Thanks
> > Shivaram
> >
> >
> > On Wed, Sep 3, 2014 at 3:19 AM, oppokui  wrote:
> >>
> >> Does spark ML team have plan to support R script natively? There is a
> >> SparkR project, but not from spark team. Spark ML used netlib-java to
> talk
> >> with native fortran routines or use NumPy, why not try to use R in some
> >> sense.
> >>
> >> R had lot of useful packages. If spark ML team can include R support, it
> >> will be a very powerful.
> >>
> >> Any comment?
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen
Fantastic!

Sent while mobile. Pls excuse typos etc.
On Aug 19, 2014 4:09 PM, "Haoyuan Li"  wrote:

> Hi folks,
>
> We've posted the first Tachyon meetup, which will be on August 25th and is
> hosted by Yahoo! (Limited Space):
> http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you there!
>
> Best,
>
> Haoyuan
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>


Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause
here, since Lance was already able to load/deserialize the model object.

And on that side topic, I wish all serdes libraries would just use
constructor.setAccessible(true) by default :-) Most of the time that
privacy is not about serdes reflection restrictions.

Sent while mobile. Pls excuse typos etc.
On Aug 14, 2014 1:58 AM, "Hoai-Thu Vuong"  wrote:

> A man in this community give me a video:
> https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
> this community and other guys helped me to solve this problem. I'm trying
> to load MatrixFactorizationModel from object file, but compiler said that,
> I can not create object because the constructor is private. To solve this,
> I put my new object to same package as MatrixFactorizationModel. Luckly it
> works.
>
>
> On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen 
> wrote:
>
>> Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
>> isolate the cause to serialization of the loaded model. And also try to
>> serialize the deserialized (loaded) model "manually" to see if that throws
>> any visible exceptions.
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Aug 13, 2014 7:03 AM, "lancezhange"  wrote:
>>
>>> my prediction codes are simple enough as follows:
>>>
>>>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>>>   val prediction = model.predict(point.features)
>>>   (point.label, prediction)
>>>   }*
>>>
>>> when model is the loaded one, above code just can't work. Can you catch
>>> the
>>> error?
>>> Thanks.
>>>
>>> PS. i use spark-shell under standalone mode, version 1.0.0
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>
> --
> Thu.
>


Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to
isolate the cause to serialization of the loaded model. And also try to
serialize the deserialized (loaded) model "manually" to see if that throws
any visible exceptions.

Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 7:03 AM, "lancezhange"  wrote:

> my prediction codes are simple enough as follows:
>
>   *val labelsAndPredsOnGoodData = goodDataPoints.map { point =>
>   val prediction = model.predict(point.features)
>   (point.label, prediction)
>   }*
>
> when model is the loaded one, above code just can't work. Can you catch the
> error?
> Thanks.
>
> PS. i use spark-shell under standalone mode, version 1.0.0
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-mllib-model-to-hdfs-and-reload-it-tp11953p12035.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
+1 what Sean said. And if there are too many state/argument parameters for
your taste, you can always create a dedicated (serializable) class to
encapsulate them.

Sent while mobile. Pls excuse typos etc.
On Aug 13, 2014 6:58 AM, "Sean Owen"  wrote:

> PS I think that solving "not serializable" exceptions by adding
> 'transient' is usually a mistake. It's a band-aid on a design problem.
>
> transient causes the default serialization mechanism to not serialize
> the field when the object is serialized. When deserialized, this field
> will be null, which often compromises the class's assumptions about
> state. This keyword is only appropriate when the field can safely be
> recreated at any time -- things like cached values.
>
> In Java, this commonly comes up when declaring anonymous (therefore
> non-static) inner classes, which have an invisible reference to the
> containing instance, which can easily cause it to serialize the
> enclosing class when it's not necessary at all.
>
> Inner classes should be static in this case, if possible. Passing
> values as constructor params takes more code but let you tightly
> control what the function references.
>
> On Wed, Aug 13, 2014 at 2:47 PM, Jaideep Dhok 
> wrote:
> > Hi,
> > I have faced a similar issue when trying to run a map function with
> predict.
> > In my case I had some non-serializable fields in my calling class. After
> > making those fields transient, the error went away.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Christopher Nguyen
Hi sparkuser2345,

I'm inferring the problem statement is something like "how do I make this
complete faster (given my compute resources)?"

Several comments.

First, Spark only allows launching parallel tasks from the driver, not from
workers, which is why you're seeing the exception when you try. Whether the
latter is a sensible/doable idea is another discussion, but I can
appreciate why many people assume this should be possible.

Second, on optimization, you may be able to apply Sean's idea about
(thread) parallelism at the driver, combined with the knowledge that often
these cluster tasks bottleneck while competing for the same resources at
the same time (cpu vs disk vs network, etc.) You may be able to achieve
some performance optimization by randomizing these timings. This is not
unlike GMail randomizing user storage locations around the world for load
balancing. Here, you would partition each of your RDDs into a different
number of partitions, making some tasks larger than others, and thus some
may be in cpu-intensive map while others are shuffling data around the
network. This is rather cluster-specific; I'd be interested in what you
learn from such an exercise.

Third, I find it useful always to consider doing as much as possible in one
pass, subject to memory limits, e.g., mapPartitions() vs map(), thus
minimizing map/shuffle/reduce boundaries with their context switches and
data shuffling. In this case, notice how you're running the
training+prediction k times over mostly the same rows, with map/reduce
boundaries in between. While the training phase is sealed in this context,
you may be able to improve performance by collecting all the k models
together, and do a [m x k] predictions all at once which may end up being
faster.

Finally, as implied from the above, for the very common k-fold
cross-validation pattern, the algorithm itself might be written to be smart
enough to take both train and test data and "do the right thing" within
itself, thus obviating the need for the user to prepare k data sets and
running over them serially, and likely saving a lot of repeated
computations in the right internal places.

Enjoy,
--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen  wrote:

> If you call .par on data_kfolded it will become a parallel collection in
> Scala and so the maps will happen in parallel .
> On Jul 5, 2014 9:35 AM, "sparkuser2345"  wrote:
>
>> Hi,
>>
>> I am trying to fit a logistic regression model with cross validation in
>> Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
>> each element is a pair of RDDs containing the training and test data:
>>
>> (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
>> test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint])
>>
>> scala> data_kfolded
>> res21:
>>
>> Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],
>> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])]
>> =
>> Array((MappedRDD[9] at map at :24,MappedRDD[7] at map at
>> :23), (MappedRDD[13] at map at :24,MappedRDD[11] at map
>> at
>> :23), (MappedRDD[17] at map at :24,MappedRDD[15] at map
>> at
>> :23))
>>
>> Everything works fine when using data_kfolded:
>>
>> val validationErrors =
>> data_kfolded.map { datafold =>
>>   val svmAlg = new SVMWithSGD()
>>   val model_reg = svmAlg.run(datafold._1)
>>   val labelAndPreds = datafold._2.map { point =>
>> val prediction = model_reg.predict(point.features)
>> (point.label, prediction)
>>   }
>>   val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
>> datafold._2.count
>>   trainErr.toDouble
>> }
>>
>> scala> validationErrors
>> res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837,
>> 0.29833546734955185)
>>
>> However, I have understood that the models are not fitted in parallel as
>> data_kfolded is not an RDD (although it's an array of pairs of RDDs). When
>> running the same code where data_kfolded has been replaced with
>> sc.parallelize(data_kfolded), I get a null pointer exception from the line
>> where the run method of the SVMWithSGD object is called with the traning
>> data. I guess this is somehow related to the fact that RDDs can't be
>> accessed from inside a closure. I fail to understand though why the first
>> version works and the second doesn't. Most importantly, is there a way to
>> fit the models in parallel? I would really appreciate your help.
>>
>> val validationErrors =
>> sc.parallelize(data_kfolded).map { datafold =>
>>   val svmAlg = new SVMWithSGD()
>>   val model_reg = svmAlg.run(datafold._1) // This line gives null pointer
>> exception
>>   val labelAndPreds = datafold._2.map { point =>
>> val prediction = model_reg.predict(point.features)
>> (point.label, prediction)
>>   }
>>   val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
>> datafold._2.count
>>

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want
for your use case. As for Parquet support, that's newly arrived in Spark
1.0.0 together with SparkSQL so continue to watch this space.

Gerard's suggestion to look at JobServer, which you can generalize as
"building a long-running application which allows multiple clients to
load/share/persist/save/collaborate-on RDDs" satisfies a larger, more
complex use case. That is indeed the job of a higher-level application,
subject to a wide variety of higher-level design choices. A number of us
have successfully built Spark-based apps around that model.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Thu, Jun 12, 2014 at 4:35 AM, Toby Douglass  wrote:

> On Thu, Jun 12, 2014 at 11:36 AM, Gerard Maas 
> wrote:
>
>> The goal of rdd.persist is to created a cached rdd that breaks the DAG
>> lineage. Therefore, computations *in the same job* that use that RDD can
>> re-use that intermediate result, but it's not meant to survive between job
>> runs.
>>
>
> As I understand it, Spark is designed for interactive querying, in the
> sense that the caching of intermediate results eliminates the need to
> recompute those results.
>
> However, if intermediate results last only for the duration of a job (e.g.
> say a python script), how exactly is interactive querying actually
> performed?   a script is not an interactive medium.  Is the shell the only
> medium for interactive querying?
>
> Consider a common usage case : a web-site, which offers reporting upon a
> large data set.  Users issue arbitrary queries.  A few queries (just with
> different arguments) dominate the query load, so we thought to create
> intermediate RDDs to service those queries, so only those order of
> magnitude or smaller RDDs would need to be processed.  Where this is not
> possible, we can only use Spark for reporting by issuing each query over
> the whole data set - e.g. Spark is just like Impala is just like Presto is
> just like [nnn].  The enourmous benefit of RDDs - the entire point of Spark
> - so profoundly useful here - is not available.  What a huge and unexpected
> loss!  Spark seemingly renders itself ordinary.  It is for this reason I am
> surprised to find this functionality is not available.
>
>
>> If you need to ad-hoc persist to files, you can can save RDDs using
>> rdd.saveAsObjectFile(...) [1] and load them afterwards using
>> sparkContext.objectFile(...)
>>
>
> I've been using this site for docs;
>
> http://spark.apache.org
>
> Here we find through the top-of-the-page menus the link "API Docs" ->
> ""Python API" which brings us to;
>
> http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html
>
> Where this page does not show the function saveAsObjectFile().
>
> I find now from your link here;
>
>
> https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
>
> What appears to be a second and more complete set of the same
> documentation, using a different web-interface to boot.
>
> It appears at least that there are two sets of documentation for the same
> APIs, where one set is out of the date and the other not, and the out of
> date set is that which is linked to from the main site?
>
> Given that our agg sizes will exceed memory, we expect to cache them to
> disk, so save-as-object (assuming there are no out of the ordinary
> performance issues) may solve the problem, but I was hoping to store data
> is a column orientated format.  However I think this in general is not
> possible - Spark can *read* Parquet, but I think it cannot write Parquet as
> a disk-based RDD format.
>
> If you want to preserve the RDDs in memory between job runs, you should
>> look at the Spark-JobServer [3]
>>
>
> Thankyou.
>
> I view this with some trepidation.  It took two man-days to get Spark
> running (and I've spent another man day now trying to get a map/reduce to
> run; I'm getting there, but not there yet) - the bring-up/config experience
> for end-users is not tested or accurated documented (although to be clear,
> no better and no worse than is normal for open source; Spark is not
> exceptional).  Having to bring up another open source project is a
> significant barrier to entry; it's always such a headache.
>
> The save-to-disk function you mentioned earlier will allow intermediate
> RDDs to go to disk, but we do in fact have a use case where in-memory would
> be useful; it might allow us to ditch Cassandra, which would be wonderful,
> since it would reduce the system count by one.
>
> I have to say, having to install JobServer to achieve this one end seems
> an extraordinarily heavyweight solution - a whole new application, when all
> that is wished for is that Spark persists RDDs across jobs, where so small
> a feature seems to open the door to so much functionality.
>
>
>


Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful.

It sounds like you're trying to determine the home location of a user, or
something similar.

If that's the problem statement, the data pattern may suggest a far more
computationally efficient approach. For example, first map all (lat,long)
pairs into geocells of a desired resolution (e.g., 10m or 100m), then count
occurrences of geocells instead. There are simple libraries to map any
(lat,long) pairs into a geocell (string) ID very efficiently.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Wed, Jun 4, 2014 at 3:49 AM, lmk 
wrote:

> Hi,
> I am a new spark user. Pls let me know how to handle the following
> scenario:
>
> I have a data set with the following fields:
> 1. DeviceId
> 2. latitude
> 3. longitude
> 4. ip address
> 5. Datetime
> 6. Mobile application name
>
> With the above data, I would like to perform the following steps:
> 1. Collect all lat and lon for each ipaddress
> (ip1,(lat1,lon1),(lat2,lon2))
> (ip2,(lat3,lon3),(lat4,lat5))
> 2. For each IP,
> 1.Find the distance between each lat and lon coordinate pair and
> all
> the other pairs under the same IP
> 2.Select those coordinates whose distances fall under a specific
> threshold (say 100m)
> 3.Find the coordinate pair with the maximum occurrences
>
> In this case, how can I iterate and compare each coordinate pair with all
> the other pairs?
> Can this be done in a distributed manner, as this data set is going to have
> a few million records?
> Can we do this in map/reduce commands?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.!

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell  wrote:

> I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
> is a milestone release as the first in the 1.0 line of releases,
> providing API stability for Spark's core interfaces.
>
> Spark 1.0.0 is Spark's largest release ever, with contributions from
> 117 developers. I'd like to thank everyone involved in this release -
> it was truly a community effort with fixes, features, and
> optimizations contributed from dozens of organizations.
>
> This release expands Spark's standard libraries, introducing a new SQL
> package (SparkSQL) which lets users integrate SQL queries into
> existing Spark workflows. MLlib, Spark's machine learning library, is
> expanded with sparse vector support and several new algorithms. The
> GraphX and Streaming libraries also introduce new features and
> optimizations. Spark's core engine adds support for secured YARN
> clusters, a unified tool for submitting Spark applications, and
> several performance and stability improvements. Finally, Spark adds
> support for Java 8 lambda syntax and improves coverage of the Java and
> Python API's.
>
> Those features only scratch the surface - check out the release notes here:
> http://spark.apache.org/releases/spark-release-1-0-0.html
>
> Note that since release artifacts were posted recently, certain
> mirrors may not have working downloads for a few hours.
>
> - Patrick
>


Re: Spark Memory Bounds

2014-05-28 Thread Christopher Nguyen
Keith, please see inline.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Tue, May 27, 2014 at 7:22 PM, Keith Simmons  wrote:

> A dash of both.  I want to know enough that I can "reason about", rather
> than "strictly control", the amount of memory Spark will use.  If I have a
> big data set, I want to understand how I can design it so that Spark's
> memory consumption falls below my available resources.  Or alternatively,
> if it's even possible for Spark to process a data set over a certain size.
>  And if I run into memory problems, I want to know which knobs to turn, and
> how turning those knobs will affect memory consumption.
>

In practice, to avoid OOME, a key dial we use is the size (or inversely,
number) of the partitions of your dataset. Clearly there is some "blow-up
factor" F such that, e.g., if you start out with 128MB on-disk data
partitions, you would consume 128F MB of memory, both by Spark and by your
closure code. Knowing this, you would want to size the partitions such that
AvailableMemoryInMBPerWorker / NumberOfCoresPerWorker > 128F. To arrive at
F, you could do some back-of-the-envelope modeling, and/or run the job and
observe empirically.


>
> It's my understanding that between certain key stages in a Spark DAG (i.e.
> group by stages), Spark will serialize all data structures necessary to
> continue the computation at the next stage, including closures.  So in
> theory, per machine, Spark only needs to hold the transient memory required
> to process the partitions assigned to the currently active tasks.  Is my
> understanding correct?  Specifically, once a key/value pair is serialized
> in the shuffle stage of a task, are the references to the raw java objects
> released before the next task is started.
>

Yes, that is correct in non-cached mode. At the same time, Spark also does
something else optionally, which is to keep the data structures (RDDs)
persistent in memory (*). As such it is possible partitions that are not
being actively worked on to be consuming memory. Spark will spill all these
to local disk if they take up more memory than it is allowed to take. So
the key thing to worry about is less about what Spark does (apart of
overhead and yes, the possibility of bugs that need to be fixed), and more
about what your closure code does with JVM memory as a whole. If in doubt,
refer back to the "blow-up factor" model described above.

(*) this is a fundamentally differentiating feature of Spark over a range
of other "in-memory" architectures, that focus on raw-data or transient
caches that serve non-equivalent purposes when viewed from the application
level. It allows for very fast access to ready-to-consume high-level data
structures, as long as available RAM permits.


>
>
> On Tue, May 27, 2014 at 6:21 PM, Christopher Nguyen wrote:
>
>> Keith, do you mean "bound" as in (a) strictly control to some
>> quantifiable limit, or (b) try to minimize the amount used by each task?
>>
>> If "a", then that is outside the scope of Spark's memory management,
>> which you should think of as an application-level (that is, above JVM)
>> mechanism. In this scope, Spark "voluntarily" tracks and limits the amount
>> of memory it uses for explicitly known data structures, such as RDDs. What
>> Spark cannot do is, e.g., control or manage the amount of JVM memory that a
>> given piece of user code might take up. For example, I might write some
>> closure code that allocates a large array of doubles unbeknownst to Spark.
>>
>> If "b", then your thinking is in the right direction, although quite
>> imperfect, because of things like the example above. We often experience
>> OOME if we're not careful with job partitioning. What I think Spark needs
>> to evolve to is at least to include a mechanism for application-level hints
>> about task memory requirements. We might work on this and submit a PR for
>> it.
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Tue, May 27, 2014 at 5:33 PM, Keith Simmons  wrote:
>>
>>> I'm trying to determine how to bound my memory use in a job working with
>>> more data than can simultaneously fit in RAM.  From reading the tuning
>>> guide, my impression is that Spark's memory usage is roughly the following:
>>>
>>> (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
>>> used by all currently running tasks
>>>
>>> I can bound A with spark.storage.memoryFraction and I can bound B w

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen
Keith, do you mean "bound" as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?

If "a", then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope, Spark "voluntarily" tracks and limits the amount of memory
it uses for explicitly known data structures, such as RDDs. What Spark
cannot do is, e.g., control or manage the amount of JVM memory that a given
piece of user code might take up. For example, I might write some closure
code that allocates a large array of doubles unbeknownst to Spark.

If "b", then your thinking is in the right direction, although quite
imperfect, because of things like the example above. We often experience
OOME if we're not careful with job partitioning. What I think Spark needs
to evolve to is at least to include a mechanism for application-level hints
about task memory requirements. We might work on this and submit a PR for
it.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, May 27, 2014 at 5:33 PM, Keith Simmons  wrote:

> I'm trying to determine how to bound my memory use in a job working with
> more data than can simultaneously fit in RAM.  From reading the tuning
> guide, my impression is that Spark's memory usage is roughly the following:
>
> (A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
> used by all currently running tasks
>
> I can bound A with spark.storage.memoryFraction and I can bound B with 
> spark.shuffle.memoryFraction.
>  I'm wondering how to bound C.
>
> It's been hinted at a few times on this mailing list that you can reduce
> memory use by increasing the number of partitions.  That leads me to
> believe that the amount of transient memory is roughly follows:
>
> total_data_set_size/number_of_partitions *
> number_of_tasks_simultaneously_running_per_machine
>
> Does this sound right?  In other words, as I increase the number of
> partitions, the size of each partition will decrease, and since each task
> is processing a single partition and there are a bounded number of tasks in
> flight, my memory use has a rough upper limit.
>
> Keith
>


Re: is Mesos falling out of favor?

2014-05-16 Thread Christopher Nguyen
Paco, that's a great video reference, thanks.

To be fair to our friends at Yahoo, who have done a tremendous amount to
help advance the cause of the BDAS stack, it's not FUD coming from them,
certainly not in any organized or intentional manner.

In vacuo we prefer Mesos ourselves, but also can't ignore the fact that in
the larger market, many enterprise technology stack decisions are made
based on their existing vendor support relationships.

And in view of Mesos, super happy to see Mesosphere growing!

Sent while mobile. Pls excuse typos etc.
That's FUD. Tracking the Mesos and Spark use cases, there are very large
production deployments of these together. Some are rather private but
others are being surfaced. IMHO, one of the most amazing case studies is
from Christina Delimitrou http://youtu.be/YpmElyi94AA

For a tutorial, use the following but upgrade it to latest production for
Spark. There was a related O'Reilly webcast and Strata tutorial as well:
http://mesosphere.io/learn/run-spark-on-mesos/

FWIW, I teach "Intro to Spark" with sections on CM4, YARN, Mesos, etc.
Based on lots of student experiences, Mesos is clearly the shortest path to
deploying a Spark cluster if you want to leverage the robustness,
multi-tenancy for mixed workloads, less ops overhead, etc., that show up
repeatedly in the use case analyses.

My opinion only and not that of any of my clients: "Don't believe the FUD
from YHOO unless you really want to be stuck in 2009."


On Wed, May 7, 2014 at 8:30 AM, deric  wrote:

> I'm also using right now SPARK_EXECUTOR_URI, though I would prefer
> distributing Spark as a binary package.
>
> For running examples with `./bin/run-example ...` it works fine, however
> tasks from spark-shell are getting lost.
>
> Error: Could not find or load main class
> org.apache.spark.executor.MesosExecutorBackend
>
> which looks more like problem with sbin/spark-executor and missing paths to
> jar. Anyone encountered this error before?
>
> I guess Yahoo invested quite a lot of effort into YARN and Spark
> integration
> (moreover when Mahout is migrating to Spark there's much more interest in
> Hadoop and Spark integration). If there would be some "Mesos company"
> working on Spark - Mesos integration it could be at least on the same
> level.
>
> I don't see any other reason why would be YARN better than Mesos,
> personally
> I like the latter, however I haven't checked YARN for a while, maybe
> they've
> made a significant progress. I think Mesos is more universal and flexible
> than YARN.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5481.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted
such a comparative study as a Masters thesis:

http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

According to this snapshot (c. 2013), Stratosphere is different from Spark
in not having an explicit concept of an in-memory dataset (e.g., RDD).

In principle this could be argued to be an implementation detail; the
operators and execution plan/data flow are of primary concern in the API,
and the data representation/materializations are otherwise unspecified.

But in practice, for long-running interactive applications, I consider RDDs
to be of fundamental, first-class citizen importance, and the key
distinguishing feature of Spark's model vs other "in-memory" approaches
that treat memory merely as an implicit cache.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia wrote:

> I don’t know a lot about it except from the research side, where the team
> has done interesting optimization stuff for these types of applications. In
> terms of the engine, one thing I’m not sure of is whether Stratosphere
> allows explicit caching of datasets (similar to RDD.cache()) and
> interactive queries (similar to spark-shell). But it’s definitely an
> interesting project to watch.
>
> Matei
>
> On Nov 22, 2013, at 4:17 PM, Ankur Chauhan 
> wrote:
>
> > Hi,
> >
> > That's what I thought but as per the slides on
> http://www.stratosphere.eu they seem to "know" about spark and the scala
> api does look similar.
> > I found the PACT model interesting. Would like to know if matei or other
> core comitters have something to weight in on.
> >
> > -- Ankur
> > On 22 Nov 2013, at 16:05, Patrick Wendell  wrote:
> >
> >> I've never seen that project before, would be interesting to get a
> >> comparison. Seems to offer a much lower level API. For instance this
> >> is a wordcount program:
> >>
> >>
> https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java
> >>
> >> On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan 
> wrote:
> >>> Hi,
> >>>
> >>> I was just curious about https://github.com/stratosphere/stratosphere
> >>> and how does spark compare to it. Anyone has any experience with it to
> make
> >>> any comments?
> >>>
> >>> -- Ankur
> >
>
>


Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.

So yes, you can easily see them deployed side-by-side in a given enterprise.

Sent while mobile. Pls excuse typos etc.
On Apr 8, 2014 5:58 AM, "Flavio Pompermaier"  wrote:

> Hi to everybody,
>
> in these days I looked a bit at the recent evolution of the big data
> stacks and it seems that HBase is somehow fading away in favour of
> Spark+HDFS. Am I correct?
> Do you think that Spark and HBase should work together or not?
>
> Best regards,
> Flavio
>


Re: Spark on other parallel filesystems

2014-04-06 Thread Christopher Nguyen
Venkat, correct, though to be sure, I'm referring to I/O related to
loading/saving data from/to their persistence locations, and not I/O
related to local operations like RDD caching or shuffling.

Sent while mobile. Pls excuse typos etc.
On Apr 5, 2014 11:11 AM, "Venkat Krishnamurthy"  wrote:

>  Christopher
>
>  Just to clarify - by 'load ops' do you mean RDD actions that result in
> IO?
>
>  Venkat
>  From: Christopher Nguyen 
> Reply-To: "user@spark.apache.org" 
> Date: Saturday, April 5, 2014 at 8:49 AM
> To: "user@spark.apache.org" 
> Subject: Re: Spark on other parallel filesystems
>
>   Avati, depending on your specific deployment config, there can be up to
> a 10X difference in data loading time. For example, we routinely parallel
> load 10+GB data files across small 8-node clusters in 10-20 seconds, which
> would take about 100s if bottlenecked over a 1GigE network. That's about
> the max difference for that config. If you use multiple local SSDs the
> difference can be correspondingly greater, and likewise 10x smaller for
> 10GigE networks.
>
> Lastly, an interesting dimension to consider is that the difference
> diminishes as your data size gets much larger relative to your cluster
> size, since the load ops have to be serialized in time anyway.
>
> There is no difference after loading.
>
> Sent while mobile. Pls excuse typos etc.
> On Apr 4, 2014 10:45 PM, "Anand Avati"  wrote:
>
>>
>>
>>
>> On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia wrote:
>>
>>> As long as the filesystem is mounted at the same path on every node, you
>>> should be able to just run Spark and use a file:// URL for your files.
>>>
>>>  The only downside with running it this way is that Lustre won't expose
>>> data locality info to Spark, the way HDFS does. That may not matter if it's
>>> a network-mounted file system though.
>>>
>>
>>  Is the locality querying mechanism specific to HDFS mode, or is it
>> possible to implement plugins in Spark to query location in other ways on
>> other filesystems? I ask because, glusterfs can expose data location of a
>> file through virtual extended attributes and I would be interested in
>> making Spark exploit that locality when the file location is specified as
>> glusterfs:// (or querying the xattr blindly for file://). How much of a
>> difference does data locality make for Spark use cases anyways (since most
>> of the computation happens in memory)? Any sort of numbers?
>>
>>  Thanks!
>> Avati
>>
>>
>>>
>>>
>>   Matei
>>>
>>>  On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy 
>>> wrote:
>>>
>>>  All
>>>
>>>  Are there any drawbacks or technical challenges (or any information,
>>> really) related to using Spark directly on a global parallel filesystem
>>>  like Lustre/GPFS?
>>>
>>>  Any idea of what would be involved in doing a minimal proof of
>>> concept? Is it just possible to run Spark unmodified (without the HDFS
>>> substrate) for a start, or will that not work at all? I do know that it's
>>> possible to implement Tachyon on Lustre and get the HDFS interface - just
>>> looking at other options.
>>>
>>>  Venkat
>>>
>>>
>>>
>>


Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen
Avati, depending on your specific deployment config, there can be up to a
10X difference in data loading time. For example, we routinely parallel
load 10+GB data files across small 8-node clusters in 10-20 seconds, which
would take about 100s if bottlenecked over a 1GigE network. That's about
the max difference for that config. If you use multiple local SSDs the
difference can be correspondingly greater, and likewise 10x smaller for
10GigE networks.

Lastly, an interesting dimension to consider is that the difference
diminishes as your data size gets much larger relative to your cluster
size, since the load ops have to be serialized in time anyway.

There is no difference after loading.

Sent while mobile. Pls excuse typos etc.
On Apr 4, 2014 10:45 PM, "Anand Avati"  wrote:

>
>
>
> On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia wrote:
>
>> As long as the filesystem is mounted at the same path on every node, you
>> should be able to just run Spark and use a file:// URL for your files.
>>
>> The only downside with running it this way is that Lustre won't expose
>> data locality info to Spark, the way HDFS does. That may not matter if it's
>> a network-mounted file system though.
>>
>
> Is the locality querying mechanism specific to HDFS mode, or is it
> possible to implement plugins in Spark to query location in other ways on
> other filesystems? I ask because, glusterfs can expose data location of a
> file through virtual extended attributes and I would be interested in
> making Spark exploit that locality when the file location is specified as
> glusterfs:// (or querying the xattr blindly for file://). How much of a
> difference does data locality make for Spark use cases anyways (since most
> of the computation happens in memory)? Any sort of numbers?
>
> Thanks!
> Avati
>
>
>>
>>
> Matei
>>
>> On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy 
>> wrote:
>>
>>  All
>>
>>  Are there any drawbacks or technical challenges (or any information,
>> really) related to using Spark directly on a global parallel filesystem
>>  like Lustre/GPFS?
>>
>>  Any idea of what would be involved in doing a minimal proof of concept?
>> Is it just possible to run Spark unmodified (without the HDFS substrate)
>> for a start, or will that not work at all? I do know that it's possible to
>> implement Tachyon on Lustre and get the HDFS interface - just looking at
>> other options.
>>
>>  Venkat
>>
>>
>>
>


Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not "validation error", which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.

However, in this example, the errors are correctly referred to as "training
error", which is what you might compute on a per-iteration basis in a
gradient-descent optimizer, in order to see how you're doing with respect
to minimizing the in-sample residuals.

There's nothing special about Spark ML algorithms that claims to escape
these bias-variance considerations.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia wrote:

> Hi,
>
> I notices spark machine learning examples use training data to validate
> regression models, For instance, in linear 
> regressionexample:
>
> // Evaluate model on training examples and compute training errorval 
> valuesAndPreds = parsedData.map { point =>
>   val prediction = model.predict(point.features)
>   (point.label, prediction)}
> ...
>
>
>  Here training data was used to validated a model which was created from
> the very same training data. This is just a bias estimation, and cross
> validationis 
> missing here. In order to cross validate, we need to partition the data
> into in-sample for training, and out-of-sample for validation.
>
> Please correct me if this does not apply to ML algorithms implemented in
> spark.
>


Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.

DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point
is that it's an abstraction above with all attendant implications, plusses
and minusses. With DDFs you get to think of everything as tables with
schemas, while the underlying complexity of mutability and data
representation is hidden away. You also get rich idioms to operate on those
tables like filtering, projection, subsetting, handling of missing data
(NA's), dummy-column generation, data mining statistics and machine
learning, etc. In some aspects it replaces a lot of boiler plate analytics
that you don't want to re-invent over and over again, e.g., FiveNum or
XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows
you to easily apply "arbitrary" machine learning algorithms without having
to think too hard about getting the data types just right. Etc.

But you would also find yourself wanting access to the underlying RDDs for
their full semantics & flexibility.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung
wrote:

> Thanks Chris,
>
> I'm not exactly sure what you mean with MutablePair, but are you saying
> that we could create RDD[MutablePair] and modify individual rows?
>
> If so, will that play nicely with RDD's lineage and fault tolerance?
>
> As for the alternatives, I don't think 1 is something we want to do, since
> that would require another complex system we'll have to implement. Is DDF
> going to be an alternative to RDD?
>
> Thanks again!
>
>
>
> On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen wrote:
>
>> Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
>> get what you want is to transform to another RDD. But you might look at
>> MutablePair (
>> https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
>> to see if the semantics meet your needs.
>>
>> Alternatively you can consider:
>>
>>1. Build & provide a fast lookup service that stores and returns the
>>mutable information keyed by the RDD row IDs, or
>>2. Use DDF (Distributed DataFrame) which we'll make available in the
>>near future, which will give you fully mutable-row table semantics.
>>
>>
>> --
>> Christopher T. Nguyen
>> Co-founder & CEO, Adatao <http://adatao.com>
>> linkedin.com/in/ctnguyen
>>
>>
>>
>> On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung <
>> coded...@cs.stanford.edu> wrote:
>>
>>> Hey guys,
>>>
>>> I need to tag individual RDD lines with some values. This tag value
>>> would change at every iteration. Is this possible with RDD (I suppose this
>>> is sort of like mutable RDD, but it's more) ?
>>>
>>> If not, what would be the best way to do something like this? Basically,
>>> we need to keep mutable information per data row (this would be something
>>> much smaller than actual data row, however).
>>>
>>> Thanks
>>>
>>
>>
>


Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
to see if the semantics meet your needs.

Alternatively you can consider:

   1. Build & provide a fast lookup service that stores and returns the
   mutable information keyed by the RDD row IDs, or
   2. Use DDF (Distributed DataFrame) which we'll make available in the
   near future, which will give you fully mutable-row table semantics.


--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung
wrote:

> Hey guys,
>
> I need to tag individual RDD lines with some values. This tag value would
> change at every iteration. Is this possible with RDD (I suppose this is
> sort of like mutable RDD, but it's more) ?
>
> If not, what would be the best way to do something like this? Basically,
> we need to keep mutable information per data row (this would be something
> much smaller than actual data row, however).
>
> Thanks
>


Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, yes, you may indeed be overthinking it a bit, about how Spark
executes maps/filters etc. I'll focus on the high-order bits so it's clear.

Let's assume you're doing this in Java. Then you'd pass some
*MyMapper*instance to J
*avaRDD#map(myMapper)*.

So you'd have a class *MyMapper extends Function*. The
*call()* method of that class is effectively the function that will be
executed by the workers on your RDD's rows.

Within that *MyMapper#call()*, you can access static members and methods of
*MyMapper* itself. You could implement your *runOnce() *there.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Thu, Mar 27, 2014 at 4:20 PM, deenar.toraskar wrote:

> Christopher
>
> Sorry I might be missing the obvious, but how do i get my function called
> on
> all Executors used by the app? I dont want to use RDDs unless necessary.
>
> once I start my shell or app, how do I get
> TaskNonce.getSingleton().doThisOnce() executed on each executor?
>
> @dmpour
> >>rdd.mapPartitions and it would still work as code would only be executed
> once in each VM, but was wondering if there is more efficient way of doing
> this by using a generated RDD with one partition per executor.
> This remark was misleading, what I meant was that in conjunction with the
> TaskNonce pattern, my function would be called only once per executor as
> long as the RDD had atleast one partition on each executor
>
> Deenar
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3393.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, dmpour is correct in that there's a many-to-many mapping between
executors and partitions (an executor can be assigned multiple partitions,
and a given partition can in principle move a different executor).

I'm not sure why you seem to require this problem statement to be solved
with RDDs. It is fairly easy to have something executed once per JVM, using
the pattern I suggested. Is there some other requirement I have missed?

Sent while mobile. Pls excuse typos etc.
On Mar 27, 2014 9:06 AM, "dmpour23"  wrote:

> How exactly does rdd.mapPartitions  be executed once in each VM?
>
> I am running  mapPartitions and the call function seems not to execute the
> code?
>
> JavaPairRDD twos = input.map(new
> Split()).sortByKey().partitionBy(new HashPartitioner(k));
> twos.values().saveAsTextFile(args[2]);
>
> JavaRDD ls = twos.values().mapPartitions(new
> FlatMapFunction, String>() {
> @Override
> public Iterable call(Iterator arg0) throws Exception {
>System.out.println("Usage should call my jar once: " + arg0);
>return Lists.newArrayList();}
> });
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3353.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen
+1 Michael, Reynold et al. This is key to some of the things we're doing.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust wrote:

> Hey Everyone,
>
> This already went out to the dev list, but I wanted to put a pointer here
> as well to a new feature we are pretty excited about for Spark 1.0.
>
>
> http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
>
> Michael
>


Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this:

public class TaskNonce {

  private transient boolean mIsAlreadyDone;

  private static transient TaskNonce mSingleton = new TaskNonce();

  private transient Object mSyncObject = new Object();

  public TaskNonce getSingleton() { return mSingleton; }

  public void doThisOnce() {
if (mIsAlreadyDone) return;
lock (mSyncObject) {
  mIsAlreadyDone = true;
  ...
}
  }

which you would invoke as TaskNonce.getSingleton().doThisOnce() from within
the map closure. If you're using the Spark Java API, you can put all this
code in the mapper class itself.

There is no need to require one-row RDD partitions to achieve what you
want, if I understand your problem statement correctly.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 11:07 AM, deenar.toraskar wrote:

> Christopher
>
> It is once per JVM. TaskNonce would meet my needs. I guess if I want it
> once
> per thread, then a ThreadLocal would do the same.
>
> But how do I invoke TaskNonce, what is the best way to generate a RDD to
> ensure that there is one element per executor.
>
> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203p3208.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say "just once", have you defined "across multiple "
(e.g., across multiple threads in the same JVM on the same machine)? In
principle you can have multiple executors on the same machine.

In any case, assuming it's the same JVM, have you considered using a
singleton that maintains done/not-done state, that is invoked by each of
the instances (TaskNonce.getSingleton().doThisOnce()) ? You can, e.g., mark
the state boolean "transient" to prevent it from going through serdes.



--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Tue, Mar 25, 2014 at 10:03 AM, deenar.toraskar wrote:

> Hi
>
> Is there a way in Spark to run a function on each executor just once. I
> have
> a couple of use cases.
>
> a) I use an external library that is a singleton. It keeps some global
> state
> and provides some functions to manipulate it (e.g. reclaim memory. etc.) .
> I
> want to check the global state of this library on each executor.
>
> b) To get jvm stats or instrumentation on each executor.
>
> Currently I have a crude way of achieving something similar, I just run a
> map on a large RDD that is hash partitioned, this does not guarantee that
> the job would run just once.
>
> Deenar
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-tp3203.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen
Chanwit, that is awesome!

Improvements in shuffle operations should help improve life even more for
you. Great to see a data point on ARM.

Sent while mobile. Pls excuse typos etc.
On Mar 18, 2014 7:36 PM, "Chanwit Kaewkasi"  wrote:

> Hi all,
>
> We are a small team doing a research on low-power (and low-cost) ARM
> clusters. We built a 20-node ARM cluster that be able to start Hadoop.
> But as all of you've known, Hadoop is performing on-disk operations,
> so it's not suitable for a constraint machine powered by ARM.
>
> We then switched to Spark and had to say wow!!
>
> Spark / HDFS enables us to crush Wikipedia articles (of year 2012) of
> size 34GB in 1h50m. We have identified the bottleneck and it's our
> 100M network.
>
> Here's the cluster:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/Mk-I_SSD.png
>
> And this is what we got from Spark's shell:
> https://dl.dropboxusercontent.com/u/381580/aiyara_cluster/result_00.png
>
> I think it's the first ARM cluster that can process a non-trivial size
> of Big Data.
> (Please correct me if I'm wrong)
> I really want to thank the Spark team that makes this possible !!
>
> Best regards,
>
> -chanwit
>
> --
> Chanwit Kaewkasi
> linkedin.com/in/chanwit
>


Re: best practices for pushing an RDD into a database

2014-03-13 Thread Christopher Nguyen
Nicholas,

> (Can we make that a thing? Let's make that a thing. :)

Yes, we're soon releasing something called Distributed DataFrame (DDF) to
the community that will make this (among other useful idioms) "a
(straightforward) thing" for Spark.

Sent while mobile. Pls excuse typos etc.
On Mar 13, 2014 2:05 PM, "Nicholas Chammas" 
wrote:

> My fellow welders
> ,
>
> (Can we make that a thing? Let's make that a thing. :)
>
> I'm trying to wedge Spark into an existing model where we process and
> transform some data and then load it into an MPP database. I know that part
> of the sell of Spark and Shark is that you shouldn't have to copy data
> around like this, so please bear with me. :)
>
> Say I have an RDD of about 10GB in size that's cached in memory. What is
> the best/fastest way to push that data into an MPP database like 
> Redshift?
> Has anyone done something like this?
>
> I'm assuming that pushing the data straight from memory into the database
> is much faster than writing the RDD to HDFS and then COPY-ing it from there
> into the database.
>
> Is there, for example, a way to perform a bulk load into the database that
> runs on each partition of the in-memory RDD in parallel?
>
> Nick
>
>
> --
> View this message in context: best practices for pushing an RDD into a
> database
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: major Spark performance problem

2014-03-06 Thread Christopher Nguyen
Dana,

When you run multiple "applications" under Spark, and if each application
takes up the entire cluster resources, it is expected that one will block
the other completely, thus you're seeing that the wall time add together
sequentially. In addition there is some overhead associated with starting
up a new application/SparkContext.

Your other mode of sharing a single SparkContext, if your use case allows
it, is more promising in that workers are available to work on tasks in
parallel (but ultimately still subject to maximum resource limits). Without
knowing what your actual workload is, it's hard to tell in absolute terms
whether 12 seconds is reasonable or not.

One reason for the jump from 12s in local mode to 40s in cluster mode would
be the HBase bottleneck---you apparently would have 2x10=20 clients going
against the HBase data source instead of 1 (or however many local threads
you have). Assuming this is an increase of useful work output by a factor
of 20x, a jump from 12s to 40s wall time is actually quite attractive.

NB: given my assumption that the HBase data source is not parallelized
along with the Spark cluster, you would run into sublinear performance
issues (HBase-perf-limited or network-bandwidth-limited) as you scale out
your cluster size.
--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Thu, Mar 6, 2014 at 11:49 AM, Livni, Dana  wrote:

>  Hi all,
>
>
>
> We have a big issue and would like if someone have any insights or ideas.
>
> The problem is composed of two connected problems.
>
> 1.   Run time of a single application.
>
> 2.   Run time of multiple applications in parallel is almost linear
> with run time of a single application.
>
>
>
> We have written a spark application patching its data from HBase.
>
> We are running the application using YARN-client resource manager.
>
> The cluster have 2 nodes (both uses as HBase data nodes and spark/YARN
> processing nodes).
>
>
>
> We have few sparks steps in our app, the heaviest and longest from all Is
> describe by this flow
>
> 1.   flatMap - converting the HBase RDD to objects RDD.
>
> 2.   Group by key
>
> 3.   Map making the calculations we need. (checking set of basic
> mathematical conditions)
>
>
>
> When running a single instance of this step Working on only 2000 records
> this step takes around 13s. (all records are related to one key)
>
> The HBase table we fetch the data from have 5 regions.
>
>
>
> The implementation we have made is using REST service which creates one
> spark context
>
> Each request we make to this service, run an instance of the application
> (but a gain all uses the same spark contxt)
>
> Each request creates multiple threads which run all the application steps.
>
> When running one request (with 10 parallel threads) the relevant stage
> takes about 40s for all the threads - each one of them takes 40s  itself,
> but they almost run completely in parallel, so also the total run time of
> one request is 40s.
>
>
>
> We have allocated 10 workers each with 512M memory (no need for more,
> looks like all the RDD is cached)
>
>
>
> So the first question:
>
> Does this run time make sense? For us it seems too long? Do you have an
> idea what are we doing wrong
>
>
>
> The second problem and the more serious one
>
> We need to run multiple parallel request of this kind.
>
> When doing so the run time spikes again and instead of an request that
> runs in about 1m (40s is only the main stage)
>
> We get 2 applications both running almost in parallel both run for 2m.
>
> This also happens if we use 2 different services and sending each of them
> 1 request.
>
> These running times grows as we send more requests.
>
>
>
> We have also monitored the CPU usage of the node and each request makes it
> jump to 90%.
>
>
>
> If we reduce the number of workers to 2 the CPU usage jump is to about
> 35%, but the run time increases significantly.
>
>
>
> This seems very unlikely to us.
>
> Are there any spark parameters we should consider to change?
>
> Any other ideas? We are quite stuck on this.
>
>
>
> Thanks in advanced
>
> Dana
>
>
>
>
>
>
>
>
>
>
>
> -
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>