Re: spark stddev() giving '?' as output how to handle it ? i.e replace null/0

2019-04-23 Thread Shyam P
Sorry, yeah i fixed this ...its a formatting issue . please ignore... thank you. On Wed, Apr 24, 2019 at 11:58 AM Shyam P wrote: > > https://stackoverflow.com/questions/55823608/how-to-handle-spark-stddev-function-output-value-when-there-there-is-no-data > > > Regards, > Shyam >

Handle empty partitions in pyspark

2019-04-23 Thread kanchan tewary
Hi All, I have a situation where the rdd is having some empty partitions, which I would like to identify and handle while applying mapPartitions or similar functions. Is there a way to do this in pyspark? The method isEmpty works on the rdd only and can not be applied. Much appreciated. Code blo

spark stddev() giving '?' as output how to handle it ? i.e replace null/0

2019-04-23 Thread Shyam P
https://stackoverflow.com/questions/55823608/how-to-handle-spark-stddev-function-output-value-when-there-there-is-no-data Regards, Shyam

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-23 Thread Mike Chan
Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we fo

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Could you provide your code, and running cluster info ? On Tue, Apr 23, 2019 at 4:10 PM Qian He wrote: > The dataset was using a sparse representation before feeding into > LogisticRegression. > > On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu > wrote: > >> Hi Qian, >> >> Do your dataset use sparse

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Qian He
The dataset was using a sparse representation before feeding into LogisticRegression. On Tue, Apr 23, 2019 at 3:15 PM Weichen Xu wrote: > Hi Qian, > > Do your dataset use sparse vector format ? > > > > On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > >> Hi all, >> >> I'm using Spark provided Lo

spark 2.4.1 -> 3.0.0-SNAPSHOT mllib

2019-04-23 Thread Koert Kuipers
we recently started compiling against spark 3.0.0-SNAPSHOT (build inhouse from master branch) to uncover any breaking changes that might be an issue for us. we ran into some of our tests breaking where we use mllib. most of it is immaterial: we had some magic numbers hard-coded and the results ar

Re: Spark LogisticRegression got stuck on dataset with millions of columns

2019-04-23 Thread Weichen Xu
Hi Qian, Do your dataset use sparse vector format ? On Mon, Apr 22, 2019 at 5:03 PM Qian He wrote: > Hi all, > > I'm using Spark provided LogisticRegression to fit a dataset. Each row of > the data has 1.7 million columns, but it is sparse with only hundreds of > 1s. The Spark Ui reported hig

Re: toDebugString - RDD Logical Plan

2019-04-23 Thread kanchan tewary
Hello Dylan, Thank you for help. The result do look formatted after making the change. However, from the following code, I was expecting RDD types like MappedRDD and filteredRDD to be present in the lineage. However, I can only see PythonRDD and parallelCollectionRDD in the lineage [I am running i

Re: Update / Delete records in Parquet

2019-04-23 Thread Khare, Ankit
Hi Chetan, I also agree that for this usecase parquet would not be the best option . I had similar usecase , 50 different tables to be download from MSSQL . Source : MSSQL Destination. : Apache KUDU (Since it supports very well change data capture use cases) We used Streamset CDC module to co