Hi Shawn,

Could we do this as below?

 for any of true

scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: double]

scala> df.filter(_.toSeq.exists(v => v == 1)).show()
+---+---+
|  a|  b|
+---+---+
|  1|0.5|
|  2|1.0|
+---+---+

​

or for all of true

scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: double]

scala> df.filter(_.toSeq.forall(v => v == 0)).show()
+---+---+
|  a|  b|
+---+---+
|  0|0.0|
+---+---+

​





2017-01-17 7:27 GMT+09:00 Shawn Wan <shawn...@gmail.com>:

> I need to filter out outliers from a dataframe by all columns. I can
> manually list all columns like:
>
> df.filter(x=>math.abs(x.get(0).toString().toDouble-means(0))<=3*stddevs(0
> ))
>
>     .filter(x=>math.abs(x.get(1).toString().toDouble-means(1))<=3*stddevs(
> 1))
>
>     ...
>
> But I want to turn it into a general function which can handle variable
> number of columns. How could I do that? Thanks in advance!
>
>
> Regards,
>
> Shawn
>
> ------------------------------
> View this message in context: filter rows by all columns
> <http://apache-spark-user-list.1001560.n3.nabble.com/filter-rows-by-all-columns-tp28309.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to