Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Fred Reiss
If you just want to emulate pushing down a join, you can just wrap the IN list query in a JDBCRelation directly: scala> val r_df = spark.read.format("jdbc").option("url", > "jdbc:h2:/tmp/testdb").option("dbtable", "R").load() > r_df: org.apache.spark.sql.DataFrame = [A: int] > scala> r_df.show > +

Re: [Pyspark, SQL] Very slow IN operator

2017-04-06 Thread Maciej Bryński
2017-04-06 4:00 GMT+02:00 Michael Segel : > Just out of curiosity, what would happen if you put your 10K values in to a > temp table and then did a join against it? The answer is predicates pushdown. In my case I'm using this kind of query on JDBC table and IN predicate is executed on DB in less

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a temp table and then did a join against it? > On Apr 5, 2017, at 4:30 PM, Maciej Bryński wrote: > > Hi, > I'm trying to run queries with many values in IN operator. > > The result is that for more than 10K values IN op

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
reply to this email, your message will be added to the discussion > below: > http://apache-spark-developers-list.1001551.n3. > nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html > To unsubscribe from Apache Spark Developers List, click here > <http://apache-spark-developers

[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi, I'm trying to run queries with many values in IN operator. The result is that for more than 10K values IN operator is getting slower. For example this code is running about 20 seconds. df = spark.range(0,10,1,1) df.where('id in ({})'.format(','.join(map(str,range(10).count() Any