check is empty effieciently

2019-06-26 Thread SNEHASISH DUTTA
Hi, which is more efficient? this is already defined since 2.4.0 *def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0}* or * df.head(1).isEmpty* I am checking if a DF is empty and it is taking forever

Generic Dataset[T] Query

2019-05-09 Thread SNEHASISH DUTTA
Hi , I am trying to write a generic method which will return custom type datasets as well as spark.sql.Row def read[T](params: Map[String, Any])(implicit encoder: Encoder[T]): Dataset[T] is my method signature, which is working fine for custom types but when I am trying to obtain a Dataset[Row]

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-30 Thread SNEHASISH DUTTA
/spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/DataFrameNaFunctions.html >> >> On Mon, Apr 29, 2019 at 4:57 PM Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> Hey Snehasish, >>> >>> Do you have a reproducer for this

Handle Null Columns in Spark Structured Streaming Kafka

2019-04-24 Thread SNEHASISH DUTTA
Hi, While writing to kafka using spark structured streaming , if all the values in certain column are Null it gets dropped Is there any way to override this , other than using na.fill functions Regards, Snehasish

Shuffling Data After Union and Write

2018-04-13 Thread SNEHASISH DUTTA
Hi, I am currently facing an issue , while performing union on three data fames say df1,df2,df3 once the operation is performed and I am trying to save the data , the data is getting shuffled so the ordering of data in df1,df2,df3 are not maintained. When I save the data as text/csv file the

Access Table with Spark Dataframe

2018-03-20 Thread SNEHASISH DUTTA
Hi, I am using Spark 2.2 , a table fetched from database contains a (.) dot in one of the column names. Whenever I am trying to select that particular column I am getting query analysis exception. I have tried creating a temporary table , using createOrReplaceTempView() and fetch the column's

CSV use case

2018-02-21 Thread SNEHASISH DUTTA
Hi, I am using spark 2.2 csv reader I have data in following format 123|123|"abc"||""|"xyz" Where || is null And "" is one blank character as per the requirement I was using option sep as pipe And option quote as "" Parsed the data and using regex I was able to fulfill all the mentioned

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, Even text won't work you may try this df.coalesce(1).write.option("h eader","true").mode("overwrite").save("output",format=text) Else convert to an rdd and use saveAsTextFile Regards, Snehasish On Wed, Feb 21, 2018 at 3:38 AM, SNEHASISH DUTTA

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
alesce(1).write.option("header","true").mode("overwrite > ").csv("output") throws > > java.lang.UnsupportedOperationException: CSV data source does not support > struct<...> data type. > > > Regards, > Mina > > > >

Re: Serialize a DataFrame with Vector values into text/csv file

2018-02-20 Thread SNEHASISH DUTTA
Hi Mina, This might help df.coalesce(1).write.option("header","true").mode("overwrite").csv("output") Regards, Snehasish On Wed, Feb 21, 2018 at 1:53 AM, Mina Aslani wrote: > Hi, > > I would like to serialize a dataframe with vector values into a text/csv > in pyspark. >

Re: Can spark handle this scenario?

2018-02-17 Thread SNEHASISH DUTTA
Hi Lian, This could be the solution case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] symbolDs.map { k =>