Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark -> Hadoop -> GCS Connector -> GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta wrote: > All, > > I'm using the Spark shell to interact w

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
or and not a runtime > error -- I believe c is an array of values so I think you want > tabs.map(c => (c(167), c(110), c(200)) instead of tabs.map(c => (c._(167), > c._(110), c._(200)) > > > > On Sun, Dec 14, 2014 at 3:12 PM, Denny Lee wrote: >> >> Yes - that work

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
Yes - that works great! Sorry for implying I couldn't. Was just more flummoxed that I couldn't make the Scala call work on its own. Will continue to debug ;-) On Sun, Dec 14, 2014 at 11:39 Michael Armbrust wrote: > BTW, I cannot use SparkSQL / case right now because my table has 200 >> columns (a

Re: Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
ns looks like > the way to go given the context. What's not working? > > Kr, Gerard > On Dec 14, 2014 5:17 PM, "Denny Lee" wrote: > >> I have a large of files within HDFS that I would like to do a group by >> statement ala >> >> val table = sc

Limit the # of columns in Spark Scala

2014-12-14 Thread Denny Lee
I have a large of files within HDFS that I would like to do a group by statement ala val table = sc.textFile("hdfs://") val tabs = table.map(_.split("\t")) I'm trying to do something similar to tabs.map(c => (c._(167), c._(110), c._(200)) where I create a new RDD that only has but that isn't

Re: Spark SQL Roadmap?

2014-12-13 Thread Denny Lee
Hi Xiaoyong, SparkSQL has already been released and has been part of the Spark code-base since Spark 1.0. The latest stable release is Spark 1.1 (here's the Spark SQL Programming Guide ) and we're currently voting on Spark 1.2. Hive

Re: Spark-SQL JDBC driver

2014-12-11 Thread Denny Lee
Yes, that is correct. A quick reference on this is the post https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1 with the pertinent section being: It is important to note that when you create Spark tables (for example

Re: Spark on YARN memory utilization

2014-12-09 Thread Denny Lee
Thanks Sandy! On Mon, Dec 8, 2014 at 23:15 Sandy Ryza wrote: > Another thing to be aware of is that YARN will round up containers to the > nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults > to 1024. > > -Sandy > > On Sat, Dec 6, 2014 at 3:48

Re: Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
executorMemory. > > When you set executor memory, the yarn resource request is executorMemory > + yarnOverhead. > > - Arun > > On Sat, Dec 6, 2014 at 4:27 PM, Denny Lee wrote: > >> This is perhaps more of a YARN question than a Spark question but i was >> just curious

Spark on YARN memory utilization

2014-12-06 Thread Denny Lee
This is perhaps more of a YARN question than a Spark question but i was just curious to how is memory allocated in YARN via the various configurations. For example, if I spin up my cluster with 4GB with a different number of executors as noted below 4GB executor-memory x 10 executors = 46GB (4G

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Okay, my bad for not testing out the documented arguments - once i use the correct ones, the query shrinks completes in ~55s (I can probably make it faster). Thanks for the help, eh?! On Fri Dec 05 2014 at 10:34:50 PM Denny Lee wrote: > Sorry for the delay in my response - for my spark ca

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
s were only at startup, so if jobs are taking significantly >>>> longer on YARN, that should be a different problem. When you ran on YARN, >>>> did you use the --executor-cores, --executor-memory, and --num-executors >>>> arguments? When running aga

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps. If I was running this on standalone cluster mode the query finished in 55s but on YARN, the query was still running 30min later. Would the hard coded sleeps potentially be in play here? On Fri, Dec 5, 2014 at 11:23 Sandy Ry

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-25 Thread Denny Lee
To determine if this is a Windows vs. other configuration, can you just try to call the Spark-class.cmd SparkSubmit without actually referencing the Hadoop or Thrift server classes? On Tue Nov 25 2014 at 5:42:09 PM Judy Nash wrote: > I traced the code and used the following to call: > > Spark-

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
It sort of depends on your environment. If you are running on your local environment, I would just download the latest Spark 1.1 binaries and you'll be good to go. If its a production environment, it sort of depends on how you are setup (e.g. AWS, Cloudera, etc.) On Sun Nov 23 2014 at 11:27:49 A

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-23 Thread Denny Lee
By any chance are you using Spark 1.0.2? registerTempTable was introduced from Spark 1.1+ while for Spark 1.0.2, it would be registerAsTable. On Sun Nov 23 2014 at 10:59:48 AM riginos wrote: > Hi guys , > Im trying to do the Spark SQL Programming Guide but after the: > > case class Person(name:

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Denny Lee
extraction job against multiple data sources via Hadoop streaming. Another good call out but utilizing Scala within Spark is that most of the Spark code is written in Scala. On Sat, Nov 22, 2014 at 08:12 Denny Lee wrote: > There are various scenarios where traditional Hadoop makes more sense t

Re: Spark + Tableau

2014-10-30 Thread Denny Lee
When you are starting the thrift server service - are you connecting to it locally or is this on a remote server when you use beeline and/or Tableau? On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic wrote: > I use beta driver SQL ODBC from Databricks. > > > > -- > View this message in context: > ht

Re: winutils

2014-10-29 Thread Denny Lee
QQ - did you download the Spark 1.1 binaries that included the Hadoop one? Does this happen if you're using the Spark 1.1 binaries that do not include the Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: > Apparently Spark does require Hadoop even if you do not intend to use > Had

Re: add external jars to spark-shell

2014-10-20 Thread Denny Lee
–jar (ADD_JARS) is a special class loading for Spark while –driver-class-path (SPARK_CLASSPATH) is captured by the startup scripts and appended to classpath settings that is used to start the JVM running the driver You can reference https://www.concur.com/blog/en-us/connect-tableau-to-sparksql on

Re: Spark Hive max key length is 767 bytes

2014-09-25 Thread Denny Lee
ql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: > Specified key was too long; max key length is 767 bytes > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > > > Should I use HIVE 0.12.0 instead of HIVE 0.13.1? > > Regards > Arthur > > On 3

Re: What is a pre built package of Apache Spark

2014-09-24 Thread Denny Lee
This seems similar to a related Windows issue concerning python where pyspark could't find the python because the PYTHONSTARTUP environment wasn't set - by any chance could this be related? On Wed, Sep 24, 2014 at 7:51 PM, christy <760948...@qq.com> wrote: > Hi I have installed standalone on win7

Re: SQL shell for Spark SQL?

2014-09-18 Thread Denny Lee
The CLI is the command line connection to SparkSQL and yes, SparkSQL replaces Shark - there’s a great article by Reynold on the Databricks blog that provides the context:  http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html As for SparkSQL and

RE: SchemaRDD and RegisterAsTable

2014-09-18 Thread Denny Lee
was that the Thrift Server was just a HIVEQL frontend and the undelying  query execution would be done by SPARK .   Regards Santosh   From: Denny Lee [mailto:denny.g@gmail.com] Sent: Wednesday, September 17, 2014 10:14 PM To: user@spark.apache.org; Addanki, Santosh Kumar Subject: Re

Re: SchemaRDD and RegisterAsTable

2014-09-17 Thread Denny Lee
The registered table is stored within the spark context itself.  To have the table available for the thrift server to get access to, you can save the sc table into the Hive context so that way the Thrift server process can see the table.  If you are using derby as your metastore, then the thrift

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-11 Thread Denny Lee
Could you provide some context about running this in yarn-cluster mode? The Thrift server that's included within Spark 1.1 is based on Hive 0.12. Hive has been able to work against YARN since Hive 0.10. So when you start the thrift server, provided you copied the hive-site.xml over to the Spark co

Re: Spark SQL JDBC

2014-09-11 Thread Denny Lee
When you re-ran sbt did you clear out the packages first and ensure that the datanucleus jars were generated within lib_managed? I remembered having to do that when I was working testing out different configs. On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 < alexandria.shea...@gmail.com> wrote:

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with Hadoop 2.4 against Hadoop 2.5.  Note, Hadoop 2.5 is considered a relatively minor release (http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available) where Hadoop 2.4 and 2.3 were considered mo

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop.  Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop v

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Denny Lee
registerTempTable you mentioned works on SqlContext instead of HiveContext. Thanks, Du On 9/10/14, 1:21 PM, "Denny Lee" wrote: >Actually, when registering the table, it is only available within the sc >context you are running it in. For Spark 1.1, the method name

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Denny Lee
Actually, when registering the table, it is only available within the sc context you are running it in. For Spark 1.1, the method name is changed to RegisterAsTempTable to better reflect that. The Thrift server process runs under a different process meaning that it cannot see any of the tables

Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Denny Lee
behavior is inherited from Hive since Spark SQL Thrift server is a variant of HiveServer2. ​ On Wed, Sep 3, 2014 at 10:47 PM, Denny Lee wrote: When I start the thrift server (on Spark 1.1 RC4) via: ./sbin/start-thriftserver.sh --master spark://hostname:7077 --driver-class-path $CLASSPATH It appears

Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-03 Thread Denny Lee
When I start the thrift server (on Spark 1.1 RC4) via: ./sbin/start-thriftserver.sh --master spark://hostname:7077 --driver-class-path $CLASSPATH It appears that the thrift server is starting off of localhost as opposed to hostname.  I have set the spark-env.sh to use the hostname, modified the

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread Denny Lee
Oh, you may be running into an issue with your MySQL setup actually, try running alter database metastore_db character set latin1 so that way Hive (and the Spark HiveContext) can execute properly against the metastore. On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com (arthur.hk.c...@g

Re: SparkSQL HiveContext No Suitable Driver / Cannot Find Driver

2014-08-29 Thread Denny Lee
Oh, forgot to add the managed libraries and the Hive libraries within the CLASSPATH.  As soon as I did that, we’re good to go now. On August 29, 2014 at 22:55:47, Denny Lee (denny.g@gmail.com) wrote: My issue is similar to the issue as noted  http://mail-archives.apache.org/mod_mbox

SparkSQL HiveContext No Suitable Driver / Cannot Find Driver

2014-08-29 Thread Denny Lee
My issue is similar to the issue as noted  http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccadoad2ks9_qgeign5-w7xogmrotrlbchvfukctgstj5qp9q...@mail.gmail.com%3E. Currently using Spark-1.1 (grabbed from git two days ago) and using Hive 0.12 with my metastore in MySQL.

Spark / Thrift / ODBC connectivity

2014-08-28 Thread Denny Lee
I’m currently using the Spark 1.1 branch and have been able to get the Thrift service up and running.  The quick questions were whether I should able to use the Thrift service to connect to SparkSQL generated tables and/or Hive tables?   As well, by any chance do we have any documents that point

LDA example?

2014-08-21 Thread Denny Lee
Quick question - is there a handy sample / example of how to use the LDA algorithm within Spark MLLib?   Thanks! Denny

Re: Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-15 Thread Denny Lee
Apologies but we had placed the settings for downloading the slides to Seattle Spark Meetup members only - but actually meant to share with everyone.  We have since fixed this and now you can download it.  HTH! On August 14, 2014 at 18:14:35, Denny Lee (denny.g@gmail.com) wrote: For

Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-14 Thread Denny Lee
For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay - Troubleshooting the Everyday Issues, the slides have been now posted at:  http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf. Enjoy! Denny

SeattleSparkMeetup: Spark at eBay - Troubleshooting the everyday issues

2014-07-18 Thread Denny Lee
We're coming off a great Seattle Spark Meetup session with Evan Chan (@evanfchan) Interactive OLAP Queries with @ApacheSpark and #Cassandra  (http://www.slideshare.net/EvanChan2/2014-07olapcassspark) at Whitepages.  Now, we're proud to announce that our next session is Spark at eBay - Troublesho

Seattle Spark Meetup: Evan Chan's Interactive OLAP Queries with Spark and Cassandra

2014-07-17 Thread Denny Lee
We had a great Seattle Spark Meetup session with Evan Chan presenting his  Interactive OLAP Queries with Spark and Cassandra.  You can find his awesome presentation at: http://www.slideshare.net/EvanChan2/2014-07olapcassspark. Enjoy!

Seattle Spark Meetup slides: xPatterns, Fun Things, and Machine Learning Streams - next is Interactive OLAP

2014-07-07 Thread Denny Lee
Apologies for the delay but we’ve had a bunch of great slides and sessions at Seattle Spark Meetup this past couple of months including Claudiu Barbura’s "xPatterns on Spark, Shark, Mesos, and Tachyon"; Paco Nathan’s "Fun Things You Can Do with Spark 1.0”, and "Machine Learning Streams with Spar

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
Thanks! will take a look at this later today. HTH! > On Jul 3, 2014, at 11:09 AM, Kostiantyn Kudriavtsev > wrote: > > Hi Denny, > > just created https://issues.apache.org/jira/browse/SPARK-2356 > >> On Jul 3, 2014, at 7:06 PM, Denny Lee wrote: >> &

Re: Run spark unit test on Windows 7

2014-07-03 Thread Denny Lee
=hdinsight 2) put this file into d:\winutil\bin 3) add in my test: System.setProperty("hadoop.home.dir", "d:\\winutil\\") after that test runs Thank you, Konstantin Kudryavtsev On Wed, Jul 2, 2014 at 10:24 PM, Denny Lee wrote: You don't actually need it per se - its ju

Re: Run spark unit test on Windows 7

2014-07-02 Thread Denny Lee
cular issue. On Wed, Jul 2, 2014 at 12:04 PM, Kostiantyn Kudriavtsev < kudryavtsev.konstan...@gmail.com> wrote: > No, I don’t > > why do I need to have HDP installed? I don’t use Hadoop at all and I’d > like to read data from local filesystem > > On Jul 2, 2014, at 9:10 PM,

Re: Run spark unit test on Windows 7

2014-07-02 Thread Denny Lee
By any chance do you have HDP 2.1 installed? you may need to install the utils and update the env variables per http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows > On Jul 2, 2014, at 10:20 AM, Konstantin Kudryavtsev > wrote: > > Hi Andrew, > > it's windows 7

Seattle Spark Meetup: Machine Learning Streams with Spark 1.0

2014-06-05 Thread Denny Lee
If you’re in the Seattle area on 6/24, come join us at Madrona Ventures building in downtown Seattle to join the session: Machine Learning Streams with Spark 1.0.   For more information, please check out our meetup event:  http://www.meetup.com/Seattle-Spark-Meetup/events/187375042/ Enjoy! Denn

Seattle Spark Meetup: xPatterns Slides and @pacoid session next week!

2014-05-23 Thread Denny Lee
For those whom were not able to attend the last Seattle Spark Meetup, we had a great session by Claudiu Barbura on xPatterns on Spark, Shark, Tachyon, and Mesos - you can find the slides at: http://www.slideshare.net/ClaudiuBarbura/seattle-spark-meetup-may-2014. As well, check out the next Seat

Seattle Spark Meetup Slides

2014-05-02 Thread Denny Lee
We’ve had some pretty awesome presentations at the Seattle Spark Meetup - here are the links to the various slides: Seattle Spark Meetup KickOff with DataBricks | Introduction to Spark with Matei Zaharia and Pat McDonough Learnings from Running Spark at Twitter sessions Ben Hindman’s Mesos for

Re: Spark Training

2014-05-01 Thread Denny Lee
You may also want to check out Paco Nathan's Introduction to Spark courses: http://liber118.com/pxn/ > On May 1, 2014, at 8:20 AM, Mayur Rustagi wrote: > > Hi Nicholas, > We provide training on spark, hands-on also associated ecosystem. > We gave it recently at a conference in Santa Clara. P

Re: CDH5 Spark on EC2

2014-04-02 Thread Denny Lee
ct with it. > Also if you are running in distributed mode the workers should be registered. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi > > > >> On Wed, Apr 2, 2014 at 12:44 AM, Denny Lee wrote: >> I’ve

CDH5 Spark on EC2

2014-04-02 Thread Denny Lee
I’ve been able to get CDH5 up and running on EC2 and according to Cloudera Manager, Spark is running healthy. But when I try to run spark-shell, I eventually get the error: 14/04/02 07:18:18 INFO client.AppClient$ClientActor: Connecting to master  spark://ip-172-xxx-xxx-xxx:7077... 14/04/02 07:1

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Denny Lee
If you have any questions on helping to get a Spark Meetup off the ground, please do not hesitate to ping me (denny.g@gmail.com).  I helped jump start the one here in Seattle (and tangentially have been helping the Vancouver and Denver ones as well).  HTH! On March 31, 2014 at 12:35:38 PM,

<    1   2