Re: Limit the # of columns in Spark Scala
Oh, just figured it out: tabs.map(c => Array(c(167), c(110), c(200)) Thanks for all of the advice, eh?! On Sun Dec 14 2014 at 1:14:00 PM Yana Kadiyska wrote: > Denny, I am not sure what exception you're observing but I've had luck > with 2 things: > > val table = sc.textFile("hdfs://") > > You can try calling table.first here and you'll see the first line of the > file. > You can also do val debug = table.first.split("\t") which would give you > an array and you can indeed verify that the array contains what you want in > positions 167,119 and 200. In the case of large files with a random bad > line I find wrapping the call within the map in try/catch very valuable -- > you can dump out the whole line in the catch statement > > Lastly I would guess that you're getting a compile error and not a runtime > error -- I believe c is an array of values so I think you want > tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167), > c._(110), c._(200)) > > > > On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee wrote: >> >> Yes - that works great! Sorry for implying I couldn't. Was just more >> flummoxed that I couldn't make the Scala call work on its own. Will >> continue to debug ;-) >> >> On Sun, Dec 14, 2014 at 11:39 Michael Armbrust >> wrote: >> >>> BTW, I cannot use SparkSQL / case right now because my table has 200 columns (and I'm on Scala 2.10.3) >>> >>> You can still apply the schema programmatically: >>> http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema >>> >>
Re: Limit the # of columns in Spark Scala
Denny, I am not sure what exception you're observing but I've had luck with 2 things: val table = sc.textFile("hdfs://") You can try calling table.first here and you'll see the first line of the file. You can also do val debug = table.first.split("\t") which would give you an array and you can indeed verify that the array contains what you want in positions 167,119 and 200. In the case of large files with a random bad line I find wrapping the call within the map in try/catch very valuable -- you can dump out the whole line in the catch statement Lastly I would guess that you're getting a compile error and not a runtime error -- I believe c is an array of values so I think you want tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167), c._(110), c._(200)) On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee wrote: > > Yes - that works great! Sorry for implying I couldn't. Was just more > flummoxed that I couldn't make the Scala call work on its own. Will > continue to debug ;-) > > On Sun, Dec 14, 2014 at 11:39 Michael Armbrust > wrote: > >> BTW, I cannot use SparkSQL / case right now because my table has 200 >>> columns (and I'm on Scala 2.10.3) >>> >> >> You can still apply the schema programmatically: >> http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema >> >
Re: Limit the # of columns in Spark Scala
Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its own. Will continue to debug ;-) On Sun, Dec 14, 2014 at 11:39 Michael Armbrust wrote: > BTW, I cannot use SparkSQL / case right now because my table has 200 >> columns (and I'm on Scala 2.10.3) >> > > You can still apply the schema programmatically: > http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema >
Re: Limit the # of columns in Spark Scala
> > BTW, I cannot use SparkSQL / case right now because my table has 200 > columns (and I'm on Scala 2.10.3) > You can still apply the schema programmatically: http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Re: Limit the # of columns in Spark Scala
Getting a bunch of syntax errors. Let me get back with the full statement and error later today. Thanks for verifying my thinking wasn't out in left field. On Sun, Dec 14, 2014 at 08:56 Gerard Maas wrote: > Hi, > > I don't get what the problem is. That map to selected columns looks like > the way to go given the context. What's not working? > > Kr, Gerard > On Dec 14, 2014 5:17 PM, "Denny Lee" wrote: > >> I have a large of files within HDFS that I would like to do a group by >> statement ala >> >> val table = sc.textFile("hdfs://") >> val tabs = table.map(_.split("\t")) >> >> I'm trying to do something similar to >> tabs.map(c => (c._(167), c._(110), c._(200)) >> >> where I create a new RDD that only has >> but that isn't quite right because I'm not really manipulating sequences. >> >> BTW, I cannot use SparkSQL / case right now because my table has 200 >> columns (and I'm on Scala 2.10.3) >> >> Thanks! >> Denny >> >>
Re: Limit the # of columns in Spark Scala
Hi, I don't get what the problem is. That map to selected columns looks like the way to go given the context. What's not working? Kr, Gerard On Dec 14, 2014 5:17 PM, "Denny Lee" wrote: > I have a large of files within HDFS that I would like to do a group by > statement ala > > val table = sc.textFile("hdfs://") > val tabs = table.map(_.split("\t")) > > I'm trying to do something similar to > tabs.map(c => (c._(167), c._(110), c._(200)) > > where I create a new RDD that only has > but that isn't quite right because I'm not really manipulating sequences. > > BTW, I cannot use SparkSQL / case right now because my table has 200 > columns (and I'm on Scala 2.10.3) > > Thanks! > Denny > >
Limit the # of columns in Spark Scala
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile("hdfs://") val tabs = table.map(_.split("\t")) I'm trying to do something similar to tabs.map(c => (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't quite right because I'm not really manipulating sequences. BTW, I cannot use SparkSQL / case right now because my table has 200 columns (and I'm on Scala 2.10.3) Thanks! Denny