Noting that Drill does not currently support the EXCEPT operator, there are some different options.  I'd probably use

select * from `words.csv` where word not in (select stopword from `stopwords.csv`);

On 2022/04/06 09:41, Wes Peng wrote:
Hi James,

I have two table, one for words, another for stopwords.
for instance,

apache drill (dfs.pyh)> select * from `words.csv` limit 10;
+-----------+
|   WORD    |
+-----------+
| on        |
| jan       |
| 2022      |
| at        |
| wolfgang  |
| engelmann |
| via       |
| gdb       |
| now       |
| waits     |
+-----------+

apache drill (dfs.pyh)> select * from `stopwords.csv` limit 10;
+-------------+
|  STOPWORD   |
+-------------+
| able        |
| about       |
| above       |
| abroad      |
| according   |
| accordingly |
| across      |
| actually    |
| adj         |
| after       |
+-------------+

How to select words which are in table "words" but not in table "stopwords"?

In Spark I was using the filter function for this job. for instance,

rdd=sc.textFile("words.txt")
df=spark.createDataFrame(rdd.map(lambda x:(x,1)),["word","count"])
rdd2=sc.textFile("stopwords.txt")
stoplist=rdd2.collect()
df2=df.filter(~col("word").isin(stoplist))

But I am not sure how drill implements this.
Please help. Thanks.

regards.


Reply via email to