Hi Chirag,
Maybe something like this?
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(Seq(
Row("A1", "B1", "C1"),
Row("A2", "B2", "C2"),
Row("A3", "B3", "C2"),
Row("A1", "B1", "C1")
))
val schema = StructType(Seq("a", "b", "c").map(c => StructField(c, StringType)))
val df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("rows")
sqlContext.sql("select a, b, c, count(0) as count from rows group by
a, b, c").collect()
Eric
On Thu, Sep 10, 2015 at 2:19 AM, Chirag Dewan <[email protected]>
wrote:
> Hi,
>
>
>
> I am using Spark 1.2.0 with Cassandra 2.0.14. I have a problem where I
> need a count of rows unique to multiple columns.
>
>
>
> So I have a column family with 3 columns i.e. a,b,c and for each value of
> distinct a1,b1,c1 I want the row count.
>
>
>
> For eg:
>
> A1,B1,C1
>
> A2,B2,C2
>
> A3,B3,C2
>
> A1,B1,C1
>
>
>
> The output should be:
>
> A1,B1,C1,2
>
> A2,B2,C2,1
>
> A3,B3,C3,1
>
>
>
> What is the optimum way of achieving this?
>
>
>
> Thanks in advance.
>
>
>
> Chirag
>