BTW we merged this today: https://github.com/apache/spark/pull/4640
This should allow us in the future to address column by name in a Row. On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust <mich...@databricks.com> wrote: > I can unpack the code snippet a bit: > > caper.select('ran_id) is the same as saying "SELECT ran_id FROM table" in > SQL. Its always a good idea to explicitly request the columns you need > right before using them. That way you are tolerant of any changes to the > schema that might happen upstream. > > The next part .map { case Row(ranId: String) => ... } is doing an > extraction to pull out the values of the row into typed variables. This is > the same as doing .map(row => row(0).asInstanceOf[String]) or .map(row => > row.getString(0)), but I find this syntax easier to read since it lines > up nicely with the select clause that comes right before it. It's also > less verbose especially when pulling out a bunch of columns. > > Regarding the differences between python and java/scala, part of this is > just due to the nature of these language. Since java/scala are statically > typed, you will always have to explicitly say the type of the column you > are extracting (the bonus here is they are much faster than python due to > optimizations this strictness allows). However, since its already a little > more verbose, we decided not to have the more expensive ability to look up > columns in a row by name, and instead go with a faster ordinal based API. > We could revisit this, but its not currently something we are planning to > change. > > Michael > > On Mon, Feb 16, 2015 at 11:04 AM, Eric Bell <e...@ericjbell.com> wrote: > >> I am just learning scala so I don't actually understand what your code >> snippet is doing but thank you, I will learn more so I can figure it out. >> >> I am new to all of this and still trying to make the mental shift from >> normal programming to distributed programming, but it seems to me that the >> row object would know its own schema object that it came from and be able >> to ask its schema to transform a name to a column number. Am I missing >> something or is this just a matter of time constraints and this one just >> hasn't gotten into the queue yet? >> >> Baring that, do the schema classes provide methods for doing this? I've >> looked and didn't see anything. >> >> I've just discovered that the python implementation for SchemaRDD does in >> fact allow for referencing by name and column. Why is this provided in the >> python implementation but not scala or java implementations? >> >> Thanks, >> >> --eric >> >> >> >> On 02/16/2015 10:46 AM, Michael Armbrust wrote: >> >> For efficiency the row objects don't contain the schema so you can't get >> the column by name directly. I usually do a select followed by pattern >> matching. Something like the following: >> >> caper.select('ran_id).map { case Row(ranId: String) => } >> >> On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell <e...@ericjbell.com> wrote: >> >>> Is it possible to reference a column from a SchemaRDD using the column's >>> name instead of its number? >>> >>> For example, let's say I've created a SchemaRDD from an avro file: >>> >>> val sqlContext = new SQLContext(sc) >>> import sqlContext._ >>> val caper=sqlContext.avroFile("hdfs://localhost:9000/sma/raw_avro/caper") >>> caper.registerTempTable("caper") >>> >>> scala> caper >>> res20: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at >>> SchemaRDD.scala:108 >>> == Query Plan == >>> == Physical Plan == >>> PhysicalRDD >>> [ADMDISP#0,age#1,AMBSURG#2,apptdt_skew#3,APPTSTAT#4,APPTTYPE#5,ASSGNDUR#6,CANCSTAT#7,CAPERSTAT#8,COMPLAINT#9,CPT_1#10,CPT_10#11,CPT_11#12,CPT_12#13,CPT_13#14,CPT_2#15,CPT_3#16,CPT_4#17,CPT_5#18,CPT_6#19,CPT_7#20,CPT_8#21,CPT_9#22,CPTDX_1#23,CPTDX_10#24,CPTDX_11#25,CPTDX_12#26,CPTDX_13#27,CPTDX_2#28,CPTDX_3#29,CPTDX_4#30,CPTDX_5#31,CPTDX_6#32,CPTDX_7#33,CPTDX_8#34,CPTDX_9#35,CPTMOD1_1#36,CPTMOD1_10#37,CPTMOD1_11#38,CPTMOD1_12#39,CPTMOD1_13#40,CPTMOD1_2#41,CPTMOD1_3#42,CPTMOD1_4#43,CPTMOD1_5#44,CPTMOD1_6#45,CPTMOD1_7#46,CPTMOD1_8#47,CPTMOD1_9#48,CPTMOD2_1#49,CPTMOD2_10#50,CPTMOD2_11#51,CPTMOD2_12#52,CPTMOD2_13#53,CPTMOD2_2#54,CPTMOD2_3#55,CPTMOD2_4#56,CPTMOD... >>> scala> >>> >>> Now I want to access fields, and of course the normal thing to do is to >>> use a field name, not a field number. >>> >>> scala> val kv = caper.map(r => (r.ran_id, r)) >>> <console>:23: error: value ran_id is not a member of >>> org.apache.spark.sql.Row >>> val kv = caper.map(r => (r.ran_id, r)) >>> >>> How do I do this? >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >