I can unpack the code snippet a bit:

caper.select('ran_id) is the same as saying "SELECT ran_id FROM table" in
SQL.  Its always a good idea to explicitly request the columns you need
right before using them.  That way you are tolerant of any changes to the
schema that might happen upstream.

The next part .map { case Row(ranId: String) => ... } is doing an
extraction to pull out the values of the row into typed variables.  This is
the same as doing .map(row => row(0).asInstanceOf[String]) or .map(row =>
row.getString(0)), but I find this syntax easier to read since it lines up
nicely with the select clause that comes right before it.  It's also less
verbose especially when pulling out a bunch of columns.

Regarding the differences between python and java/scala, part of this is
just due to the nature of these language.  Since java/scala are statically
typed, you will always have to explicitly say the type of the column you
are extracting (the bonus here is they are much faster than python due to
optimizations this strictness allows).  However, since its already a little
more verbose, we decided not to have the more expensive ability to look up
columns in a row by name, and instead go with a faster ordinal based API.
We could revisit this, but its not currently something we are planning to
change.

Michael

On Mon, Feb 16, 2015 at 11:04 AM, Eric Bell <e...@ericjbell.com> wrote:

>  I am just learning scala so I don't actually understand what your code
> snippet is doing but thank you, I will learn more so I can figure it out.
>
> I am new to all of this and still trying to make the mental shift from
> normal programming to distributed programming, but it seems to me that the
> row object would know its own schema object that it came from and be able
> to ask its schema to transform a name to a column number. Am I missing
> something or is this just a matter of time constraints and this one just
> hasn't gotten into the queue yet?
>
> Baring that, do the schema classes provide methods for doing this? I've
> looked and didn't see anything.
>
> I've just discovered that the python implementation for SchemaRDD does in
> fact allow for referencing by name and column. Why is this provided in the
> python implementation but not scala or java implementations?
>
> Thanks,
>
> --eric
>
>
>
> On 02/16/2015 10:46 AM, Michael Armbrust wrote:
>
> For efficiency the row objects don't contain the schema so you can't get
> the column by name directly.  I usually do a select followed by pattern
> matching. Something like the following:
>
>  caper.select('ran_id).map { case Row(ranId: String) => }
>
> On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell <e...@ericjbell.com> wrote:
>
>> Is it possible to reference a column from a SchemaRDD using the column's
>> name instead of its number?
>>
>> For example, let's say I've created a SchemaRDD from an avro file:
>>
>> val sqlContext = new SQLContext(sc)
>> import sqlContext._
>> val caper=sqlContext.avroFile("hdfs://localhost:9000/sma/raw_avro/caper")
>> caper.registerTempTable("caper")
>>
>> scala> caper
>> res20: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at
>> SchemaRDD.scala:108
>> == Query Plan ==
>> == Physical Plan ==
>> PhysicalRDD
>> [ADMDISP#0,age#1,AMBSURG#2,apptdt_skew#3,APPTSTAT#4,APPTTYPE#5,ASSGNDUR#6,CANCSTAT#7,CAPERSTAT#8,COMPLAINT#9,CPT_1#10,CPT_10#11,CPT_11#12,CPT_12#13,CPT_13#14,CPT_2#15,CPT_3#16,CPT_4#17,CPT_5#18,CPT_6#19,CPT_7#20,CPT_8#21,CPT_9#22,CPTDX_1#23,CPTDX_10#24,CPTDX_11#25,CPTDX_12#26,CPTDX_13#27,CPTDX_2#28,CPTDX_3#29,CPTDX_4#30,CPTDX_5#31,CPTDX_6#32,CPTDX_7#33,CPTDX_8#34,CPTDX_9#35,CPTMOD1_1#36,CPTMOD1_10#37,CPTMOD1_11#38,CPTMOD1_12#39,CPTMOD1_13#40,CPTMOD1_2#41,CPTMOD1_3#42,CPTMOD1_4#43,CPTMOD1_5#44,CPTMOD1_6#45,CPTMOD1_7#46,CPTMOD1_8#47,CPTMOD1_9#48,CPTMOD2_1#49,CPTMOD2_10#50,CPTMOD2_11#51,CPTMOD2_12#52,CPTMOD2_13#53,CPTMOD2_2#54,CPTMOD2_3#55,CPTMOD2_4#56,CPTMOD...
>> scala>
>>
>> Now I want to access fields, and of course the normal thing to do is to
>> use a field name, not a field number.
>>
>> scala> val kv = caper.map(r => (r.ran_id, r))
>> <console>:23: error: value ran_id is not a member of
>> org.apache.spark.sql.Row
>>        val kv = caper.map(r => (r.ran_id, r))
>>
>> How do I do this?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Reply via email to