[GitHub] [spark] Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still available for filtering

2020-02-11 Thread GitBox
Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still 
available for filtering
URL: https://github.com/apache/spark/pull/27128#issuecomment-584630973
 
 
   For me, neither "Because the software has worked this way", nor "other 
similar software", are valid arguments. For me, it's just plainly wrong to be 
able to filter a dataframe on a column that does not exist in this dataframe.
   
   I think this behavior is an issue because it means one can not simply look 
at the schema of a dataframe to determine if an operation with it is valid. 
Instead one has to consider the whole history of how the dataframe was 
created/derived. This leads to the effect that refactorings, e.g., changing the 
way of creation of a dataframe, will break one's code, even though the 
refactoring should be totally OK because it results in the exact same dataframe 
schema.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still available for filtering

2020-02-11 Thread GitBox
Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still 
available for filtering
URL: https://github.com/apache/spark/pull/27128#issuecomment-584593702
 
 
   OK, thanks. Make sense.
   
   Nevertheless, it somehow reminds me of the following. :wink:
   
   ![workflow](https://imgs.xkcd.com/comics/workflow.png)
   (source: https://xkcd.com/1172/)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still available for filtering

2020-02-10 Thread GitBox
Dobiasd commented on issue #27128: [SPARK-30421][SQL] Dropped columns still 
available for filtering
URL: https://github.com/apache/spark/pull/27128#issuecomment-584495487
 
 
   Not just this PR was closed, but also the [Jira 
issue](https://issues.apache.org/jira/browse/SPARK-30421) was resolved as 
"Won't Fix"? Could somebody please explain to me why? It the observed behavior 
intended, i.e., it's not a bug, it's a feature, or is it just not worth the 
effort to fix it?
   
   To me, the below example still looks wrong.
   
   ```scala
   scala> val df1 = Seq((0, "a"), (1, "b")).toDF("foo", "bar") 
   df1: DataFrame = [foo: int, bar: string]
   
   scala> val df2 = df1.drop("bar") 
   df2: DataFrame = [foo: int]
   
   scala> df2.printSchema 
   root
|-- foo: integer (nullable = false)
   
   scala> df2.where($"bar" === "a").show 
   +---+
   |foo|
   +---+
   |  0|
   +---+
   ```
   
   Pandas, as a comparative example, behaves correctly:
   
   ```python
   >>> import pandas as pd
   >>> df1 = pd.DataFrame(data={'foo': [0, 1], 'bar': ["a", "b"]})
   >>> df2 = df1.drop(columns=["bar"])
   >>> df2.info()
   
   RangeIndex: 2 entries, 0 to 1
   Data columns (total 1 columns):
   foo2 non-null int64
   dtypes: int64(1)
   memory usage: 144.0 bytes
   >>> df2[df2["bar"] == "a"]
   Traceback (most recent call last):
 File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py", 
line 2897, in get_loc
   return self._engine.get_loc(key)
 File "pandas/_libs/index.pyx", line 107, in 
pandas._libs.index.IndexEngine.get_loc
 File "pandas/_libs/index.pyx", line 131, in 
pandas._libs.index.IndexEngine.get_loc
 File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
 File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
   KeyError: 'bar'
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
 File "", line 1, in 
 File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 
2995, in __getitem__
   indexer = self.columns.get_loc(key)
 File "/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py", 
line 2899, in get_loc
   return self._engine.get_loc(self._maybe_cast_indexer(key))
 File "pandas/_libs/index.pyx", line 107, in 
pandas._libs.index.IndexEngine.get_loc
 File "pandas/_libs/index.pyx", line 131, in 
pandas._libs.index.IndexEngine.get_loc
 File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
 File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
   KeyError: 'bar'
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org