Re: filter function in drill

James Turton Wed, 06 Apr 2022 01:58:16 -0700

Noting that Drill does not currently support the EXCEPT operator, thereare some different options. I'd probably use

select * from `words.csv` where word not in (select stopword from`stopwords.csv`);


On 2022/04/06 09:41, Wes Peng wrote:

Hi James,

I have two table, one for words, another for stopwords.
for instance,

apache drill (dfs.pyh)> select * from `words.csv` limit 10;
+-----------+
|   WORD    |
+-----------+
| on        |
| jan       |
| 2022      |
| at        |
| wolfgang  |
| engelmann |
| via       |
| gdb       |
| now       |
| waits     |
+-----------+

apache drill (dfs.pyh)> select * from `stopwords.csv` limit 10;
+-------------+
|  STOPWORD   |
+-------------+
| able        |
| about       |
| above       |
| abroad      |
| according   |
| accordingly |
| across      |
| actually    |
| adj         |
| after       |
+-------------+

How to select words which are in table "words" but not in table"stopwords"?


In Spark I was using the filter function for this job. for instance,

rdd=sc.textFile("words.txt")
df=spark.createDataFrame(rdd.map(lambda x:(x,1)),["word","count"])
rdd2=sc.textFile("stopwords.txt")
stoplist=rdd2.collect()
df2=df.filter(~col("word").isin(stoplist))

But I am not sure how drill implements this.
Please help. Thanks.

regards.

Re: filter function in drill

Reply via email to