finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Hi

Is there any   functions to find distinct count of all the variables in
dataframe.

val sc = new SparkContext(conf) // spark context
val options = Map("header" -> "true", "delimiter" -> delimiter,
"inferSchema" -> "true")
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
val datasetDF =
sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)


we are able to get the schema, variable data type. is there any method
to get the distinct count ?



-- 
Thanks and Regards
Arun


Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar,

You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.

2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :

> Hi
>
> Is there any   functions to find distinct count of all the variables in
> dataframe.
>
> val sc = new SparkContext(conf) // spark context
> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" 
> -> "true")
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
> val datasetDF = 
> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>
>
> we are able to get the schema, variable data type. is there any method to get 
> the distinct count ?
>
>
>
> --
> Thanks and Regards
> Arun
>


Re: finding distinct count using dataframe

2016-01-05 Thread Kristina Rogale Plazonic
I think it's an expression, rather than a function you'd find in the API
 (as a function you could do   df.select(col).distinct.count)

This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint]

Whereas this will give you the number of distinct values in each column:
scala> df.select(countDistinct("name"), countDistinct("age"))
res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint,
COUNT(DISTINCT age): bigint]

Of course, when you need many columns at once, this expression becomes
tedious, so I find it easiest to construct an sql statement from column
names, like so:

df.registerTempTable("df")
val sqlstatement = "select "+ df.columns.map( col => s"count (distinct
$col) as ${col}_distinct").mkString(", ") + " from df"
sqlContext.sql(sqlstatement)

But this is not efficient - see this Jira ticket
and the fix.

On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai 
wrote:

> Thanks Yanbo,
>
> Thanks for the help. But I'm not able to find countDistinct ot
> approxCountDistinct. function. These functions are within dataframe or any
> other package
>
> On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang  wrote:
>
>> Hi Arunkumar,
>>
>> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
>> approxCountDistinct for a approximate result.
>>
>> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :
>>
>>> Hi
>>>
>>> Is there any   functions to find distinct count of all the variables in
>>> dataframe.
>>>
>>> val sc = new SparkContext(conf) // spark context
>>> val options = Map("header" -> "true", "delimiter" -> delimiter, 
>>> "inferSchema" -> "true")
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>>> val datasetDF = 
>>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>>
>>>
>>> we are able to get the schema, variable data type. is there any method to 
>>> get the distinct count ?
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Arun
>>>
>>
>>
>
>
> --
> Thanks and Regards
> Arun
>


Re: finding distinct count using dataframe

2016-01-05 Thread Arunkumar Pillai
Thanks Yanbo,

Thanks for the help. But I'm not able to find countDistinct ot
approxCountDistinct. function. These functions are within dataframe or any
other package

On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang  wrote:

> Hi Arunkumar,
>
> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
> approxCountDistinct for a approximate result.
>
> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :
>
>> Hi
>>
>> Is there any   functions to find distinct count of all the variables in
>> dataframe.
>>
>> val sc = new SparkContext(conf) // spark context
>> val options = Map("header" -> "true", "delimiter" -> delimiter, 
>> "inferSchema" -> "true")
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>> val datasetDF = 
>> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>
>>
>> we are able to get the schema, variable data type. is there any method to 
>> get the distinct count ?
>>
>>
>>
>> --
>> Thanks and Regards
>> Arun
>>
>
>


-- 
Thanks and Regards
Arun