Thanks Patrick! I tried to package it according to this instructions, it
got distributed on the cluster however the same spark program that takes 5
mins without pandas UDF has started to take 25mins...
Have you experienced anything like this? Also is Pyarrow 0.12 supported
with Spark 2.3
I am at loss why one needs Spark to load one row from the DB as in the
example below
val data = sparkSession
.loadFromMapRDB("/user/mapr/tables/data", schema)
.filter("uid = '101'")
.select("_id")
Assuming that _id is the primary key so we are just going to load one row
only. Spark as a
Thank you for your answers!
While it is clear each DL framework can solve the distributed model
training on their own (some better than others). Still I see a lot of
value of having Spark on the ETL/pre-processing part, thus the origin of my
question.
I am trying to avoid to mange multiple
@Riccardo
Spark does not do the DL learning part of the pipeline (afaik) so it is
limited to data ingestion and transforms (ETL). It therefore is optional
and other ETL options might be better for you.
Most of the technologies @Gourav mentions have their own scaling based on
their own compute
Try using MxNet and Horovod directly as well (I think that MXNet is worth a
try as well):
1.
https://medium.com/apache-mxnet/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7
2.
https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/rel_19-01.html
3.
just try using an apply on a series for a custom function or on any other
library. Advertisement and actual delivery are two different skills
altogether. Not everyone wants to add a one to their column using the
pandas udf as one of their links shows :)
Most of the actual used cases are more
Hi list,
I am trying to undestand if ti make sense to leverage on Spark as enabling
platform for Deep Learning.
My open question to you are:
- Do you use Apache Spark in you DL pipelines?
- How do you use Spark for DL? Is it just a stand-alone stage in the
workflow (ie data preparation
Hello Spark Users!
Just wondering whether it is possible to register a metric source without
metrics known in advance and add the metrics themselves to this source
later on?
It seems that currently MetricSystem puts all the metrics from the source's
MetricRegistry into a shared MetricRegistry of
hi Gourav,
> And also be aware that pandas UDF does not always lead to better performance
> and sometimes even massively slow performance.
this information is not widely spread. this is good to know. in which
circumstances is it worst than regular udf ?
> With Grouped Map dont you run into the
Hi,
I integrated the apache spark decision tree classifier in a java
program that reads real time data into an array called 'vals' and then
run the code:
Vector v = Vectors.dense(vals);
LabeledPoint pos = new LabeledPoint(0.0, v);
SparkConf sparkConf = new
10 matches
Mail list logo