Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Don't use big O estimates, always measure. It used to work back in the days when double multiplication was a bottleneck. The computation cost is effectively free on both the CPU and GPU and you're seeing pure copying costs. Also, I'm dubious that cublas is doing what you think it is. Can you link

Re: Google Summer of Code - ideas

2015-02-26 Thread Jeremy Freeman
For topic #4 (streaming ML in Python), there’s an existing JIRA, but progress seems to have stalled. I’d be happy to help if you want to pick it up! https://issues.apache.org/jira/browse/SPARK-4127 - jeremyfreeman.net @thefreemanlab On Feb 26, 2015, at 4:20 PM, Xiangrui

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Btw, I wish people would stop cheating when comparing CPU and GPU timings for things like matrix multiply :-P Please always compare apples with apples and include the time it takes to set up the matrices, send it to the processing unit, doing the calculation AND copying it back to where you need

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks
I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't count their copy times, but claim that you should be doing *as much as possible* on the GPU - so, maybe for some applications where you can generate the data on the GPU this makes sense. But, in the context of Spark we should

graph.mapVertices() function obtain edge triplets with null attribute

2015-02-26 Thread James
My code ``` // Initial the graph, assign a counter to each vertex that contains the vertex id only var anfGraph = graph.mapVertices { case (vid, _) = val counter = new HyperLogLog(5) counter.offer(vid) counter } val nullVertex = anfGraph.triplets.filter(edge = edge.srcAttr == null).first

Re: Need advice for Spark newbie

2015-02-26 Thread Dean Wampler
Historically, many orgs. have replaced data warehouses with Hadoop clusters and used Hive along with Impala (on Cloudera deployments) or Drill (on MapR deployments) for SQL. Hive is older and slower, while Impala and Drill are newer and faster, but you typically need both for their complementary

RE: Need advice for Spark newbie

2015-02-26 Thread Steve Nunez
Hi Vikram, There was a recent presentation at Strata that you might find useful: Hive on Spark is Blazing Fast .. Or Is It?http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final Generally those conclusions mirror my own observations: on large data sets, Hive

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-26 Thread Sandor Van Wassenhove
FWIW, I tested the first rc and saw no regressions. I ran our benchmarks built against spark 1.3 and saw results consistent with spark 1.2/1.2.1. On 2/25/15, 5:51 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a quick updated on this thread. Issues have continued to trickle in. Not

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com

Re: Need advice for Spark newbie

2015-02-26 Thread Vikram Kone
Dean Thanks for the info. Are you saying that we can create star/snowflake data models using spark so they can be queried from tableau ? On Thursday, February 26, 2015, Dean Wampler deanwamp...@gmail.com wrote: Historically, many orgs. have replaced data warehouses with Hadoop clusters and

Re: Google Summer of Code - ideas

2015-02-26 Thread Xiangrui Meng
There are couple things in Scala/Java but missing in Python API: 1. model import/export 2. evaluation metrics 3. distributed linear algebra 4. streaming algorithms If you are interested, we can list/create target JIRAs and hunt them down one by one. Best, Xiangrui On Wed, Feb 25, 2015 at 7:37

Re: Need advice for Spark newbie

2015-02-26 Thread Vikram Kone
Hi Steve Thanks for the info. I will look into hivemail. Are you saying that we can create star/snowflake data models using spark so they can be queried from tableau ? On Thursday, February 26, 2015, Steve Nunez snu...@hortonworks.com wrote: Hi Vikram, There was a recent presentation at

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Okay I confirmed my suspicions of a hang. I made a request that stopped progressing, though the already-scheduled tasks had finished. I made a separate request that was small enough not to hang, and it kicked the hung job enough to finish. I think what's happening is that the scheduler or the

Re: Need advice for Spark newbie

2015-02-26 Thread Dean Wampler
There's no support for star or snowflake models, per se. What you get with Hadoop is access to all your data and the processing power to build the ad hoc queries you want, when you need them, rather than having to figure out a schema/model in advance. I recommend that you also ask your questions

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
Hi all, I'm not surprised if the GPU is slow. It's about the bottleneck copying the memory. Watch my talk, linked from the netlib-java github page, to understand further. The only way to currently make use of a GPU is to do all the operations using the GPU's kernel. You can find some prepackaged

number of partitions for hive schemaRDD

2015-02-26 Thread masaki rikitoku
Hi all now, I'm trying the SparkSQL with hivecontext. when I execute the hql like the following. --- val ctx = new org.apache.spark.sql.hive.HiveContext(sc) import ctx._ val queries = ctx.hql(select keyword from queries where dt = '2015-02-01' limit 1000) --- It seem that the number of

Re: number of partitions for hive schemaRDD

2015-02-26 Thread Cheng Lian
Hi Masaki, I guess what you saw is the partition number of the last stage, which must be 1 to perform the global phase of LIMIT. To tune partition number of normal shuffles like joins, you may resort to spark.sql.shuffle.partitions. Cheng On 2/26/15 5:31 PM, masaki rikitoku wrote: Hi all

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
Typo - CPU was 2.5 cheaper (not GPU!) -Original Message- From: Ulanov, Alexander Sent: Thursday, February 26, 2015 2:01 PM To: Sam Halliday; Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Evan R. Sparks Subject: RE: Using CUDA within Spark / boosting linear algebra Evan, thank

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Sam Halliday
I've had some email exchanges with the author of BIDMat: it does exactly what you need to get the GPU benefit and writes higher level algorithms entirely in the GPU kernels so that the memory stays there as long as possible. The restriction with this approach is that it is only offering high-level

RE: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Ulanov, Alexander
Evan, thank you for the summary. I would like to add some more observations. The GPU that I used is 2.5 times cheaper than the CPU ($250 vs $100). They both are 3 years old. I've also did a small test with modern hardware, and the new GPU nVidia Titan was slightly more than 1 order of magnitude

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my local machine and the problem didn't present itself. I'm pretty confident that means the problem is in the LocalBackend or something near it. On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote: Okay

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
The copying overhead should be quadratic on n, while the computation cost is cubic on n. I can understand that netlib-cublas is slower than netlib-openblas on small problems. But I'm surprised to see that it is still 20x slower on 1x1. I did the following on a g2.2xlarge instance with

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just published a post about my experience doing that, building dashboards in Grafana http://grafana.org/, and using them to monitor Spark jobs: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Code

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; dev@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Love to hear some input on this. I did get a standalone cluster up on my local