Hello List,
I was wondering what is the design principle that partition size of
an RDD is inherited from the parent. See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would b
Hello List,
I was wondering what is the design principle that partition size of
an RDD is inherited from the parent. See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would b
There is a BigDL project:
https://github.com/intel-analytics/BigDL
On 20 June 2017 at 16:17, Jules Damji wrote:
> And we will having a webinar on July 27 going into some more details. Stay
> tuned.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 20, 2017, a
I suggest RandomRDDs API. It provides nice tools. If you write
wrappers around that might be good.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$
-
To unsubscribe e-mail: user-
On 3 August 2017 at 01:05, jeff saremi wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."
Is this really true? I had the impression that DAG will no
On 3 August 2017 at 03:00, Vadim Semenov wrote:
> `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
> just saves data to some destination.
Yes, that's what I thought, so the statement "..otherwise saving it on
a file will require recomputation." from the book is not ent
It depends on what model you would like to train but models requiring
optimisation could use SGD with mini batches. See:
https://spark.apache.org/docs/latest/mllib-optimization.html#stochastic-gradient-descent-sgd
On 23 August 2017 at 14:27, Sea aj wrote:
> Hi,
>
> I am trying to feed a huge dat
gmail.com&idSignature=22>
>
> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet wrote:
>
>> It depends on what model you would like to train but models requiring
>> optimisation could use SGD with mini batches. See:
>> https://spark.apache.org/docs/latest/mllib-optim
I think the order has no meaning in RDDs see this post, specially zip methods:
https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
equence across a
> partition as partition is local and computation happens one record at a
> time.
>
> On 13-Sep-2017 9:54 PM, "Suzen, Mehmet" wrote:
>
> I think the order has no meaning in RDDs see this post, specially zip
> methods:
> https://stackoverflo
of partitions in mapPartition?
On 13 Sep 2017 19:54, "Ankit Maloo" wrote:
>
> Rdd are fault tolerant as it can be recomputed using DAG without storing the
> intermediate RDDs.
>
> On 13-Sep-2017 11:16 PM, "Suzen, Mehmet" wrote:
>>
>> But what h
On 14 September 2017 at 10:42, wrote:
> val noTs = myData.map(dropTimestamp)
>
> val scaled = scaler.transform(noTs)
>
> val projected = (new RowMatrix(scaled)).multiply(principalComponents).rows
>
> val clusters = myModel.predict(projected)
>
> val result = myData.zip(clusters)
>
>
>
> Do you th
Hi Johan,
DataFrames are building on top of RDDs, not sure if the ordering
issues are different there. Maybe you could create minimally large
enough simulated data and example series of transformations as an
example to experiment on.
Best,
-m
Mehmet Süzen, MSc, PhD
| PRIVILEGED AND CONFIDENTIAL
You can use Breeze, which is part of spark distribution:
https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra
Check out the modules under import breeze._
On 23 May 2018 at 07:04, umargeek wrote:
> Hi Folks,
>
> I am planning to rewrite one of my python module written for entropy
> cal
14 matches
Mail list logo