Noting that Drill does not currently support the EXCEPT operator, there
are some different options. I'd probably use
select * from `words.csv` where word not in (select stopword from
`stopwords.csv`);
On 2022/04/06 09:41, Wes Peng wrote:
Hi James,
I have two table, one for words, another for stopwords.
for instance,
apache drill (dfs.pyh)> select * from `words.csv` limit 10;
+-----------+
| WORD |
+-----------+
| on |
| jan |
| 2022 |
| at |
| wolfgang |
| engelmann |
| via |
| gdb |
| now |
| waits |
+-----------+
apache drill (dfs.pyh)> select * from `stopwords.csv` limit 10;
+-------------+
| STOPWORD |
+-------------+
| able |
| about |
| above |
| abroad |
| according |
| accordingly |
| across |
| actually |
| adj |
| after |
+-------------+
How to select words which are in table "words" but not in table
"stopwords"?
In Spark I was using the filter function for this job. for instance,
rdd=sc.textFile("words.txt")
df=spark.createDataFrame(rdd.map(lambda x:(x,1)),["word","count"])
rdd2=sc.textFile("stopwords.txt")
stoplist=rdd2.collect()
df2=df.filter(~col("word").isin(stoplist))
But I am not sure how drill implements this.
Please help. Thanks.
regards.