Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-04 Thread Rishi Shah
Thanks Patrick! I tried to package it according to this instructions, it got distributed on the cluster however the same spark program that takes 5 mins without pandas UDF has started to take 25mins... Have you experienced anything like this? Also is Pyarrow 0.12 supported with Spark 2.3

Re: This MapR-DB Spark Connector with Secondary Indexes

2019-05-04 Thread Mich Talebzadeh
I am at loss why one needs Spark to load one row from the DB as in the example below val data = sparkSession .loadFromMapRDB("/user/mapr/tables/data", schema) .filter("uid = '101'") .select("_id") Assuming that _id is the primary key so we are just going to load one row only. Spark as a

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Thank you for your answers! While it is clear each DL framework can solve the distributed model training on their own (some better than others). Still I see a lot of value of having Spark on the ETL/pre-processing part, thus the origin of my question. I am trying to avoid to mange multiple

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Pat Ferrel
@Riccardo Spark does not do the DL learning part of the pipeline (afaik) so it is limited to data ingestion and transforms (ETL). It therefore is optional and other ETL options might be better for you. Most of the technologies @Gourav mentions have their own scaling based on their own compute

Re: Deep Learning with Spark, what is your experience?

2019-05-04 Thread Gourav Sengupta
Try using MxNet and Horovod directly as well (I think that MXNet is worth a try as well): 1. https://medium.com/apache-mxnet/distributed-training-using-apache-mxnet-with-horovod-44f98bf0e7b7 2. https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/rel_19-01.html 3.

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Gourav Sengupta
just try using an apply on a series for a custom function or on any other library. Advertisement and actual delivery are two different skills altogether. Not everyone wants to add a one to their column using the pandas udf as one of their links shows :) Most of the actual used cases are more

Deep Learning with Spark, what is your experience?

2019-05-04 Thread Riccardo Ferrari
Hi list, I am trying to undestand if ti make sense to leverage on Spark as enabling platform for Deep Learning. My open question to you are: - Do you use Apache Spark in you DL pipelines? - How do you use Spark for DL? Is it just a stand-alone stage in the workflow (ie data preparation

Dynamic metric names

2019-05-04 Thread Sergey Zhemzhitsky
Hello Spark Users! Just wondering whether it is possible to register a metric source without metrics known in advance and add the metrics themselves to this source later on? It seems that currently MetricSystem puts all the metrics from the source's MetricRegistry into a shared MetricRegistry of

Re: pySpark - pandas UDF and binaryType

2019-05-04 Thread Nicolas Paris
hi Gourav, > And also be aware that pandas UDF does not always lead to better performance > and sometimes even massively slow performance. this information is not widely spread. this is good to know. in which circumstances is it worst than regular udf ? > With Grouped Map dont you run into the

error when running decisiontree in java

2019-05-04 Thread Serena S Yuan
Hi, I integrated the apache spark decision tree classifier in a java program that reads real time data into an array called 'vals' and then run the code: Vector v = Vectors.dense(vals); LabeledPoint pos = new LabeledPoint(0.0, v); SparkConf sparkConf = new