Nicholas Chammas created SPARK-8670: ---------------------------------------
Summary: Nested columns can't be referenced (but they can be selected) Key: SPARK-8670 URL: https://issues.apache.org/jira/browse/SPARK-8670 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0 Reporter: Nicholas Chammas This is strange and looks like a regression from 1.3. {code} import json daterz = [ { 'name': 'Nick', 'stats': { 'age': 28 } }, { 'name': 'George', 'stats': { 'age': 31 } } ] df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x))) df.select('stats.age').show() df['stats.age'] # 1.4 fails on this line {code} On 1.3 this works and yields: {code} age 28 31 Out[1]: Column<stats.age AS age#2958L> {code} On 1.4, however, this gives an error on the last line: {code} +---+ |age| +---+ | 28| | 31| +---+ --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-1-04bd990e94c6> in <module>() 19 20 df.select('stats.age').show() ---> 21 df['stats.age'] /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item) 678 if isinstance(item, basestring): 679 if item not in self.columns: --> 680 raise IndexError("no such column: %s" % item) 681 jc = self._jdf.apply(item) 682 return Column(jc) IndexError: no such column: stats.age {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org