Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Dean Wampler
hadoop-2.6 is supported (look for profile XML in the pom.xml file). For Hive, add -Phive -Phive-thriftserver (See http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables) for more details. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http

Re: Spark Streaming: how to use StreamingContext.queueStream with existing RDD

2015-10-26 Thread Dean Wampler
Check out StreamingContext.queueStream ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext ) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Dean Wampler
Dynamic allocation doesn't work yet with Spark Streaming in any cluster scenario. There was a previous thread on this topic which discusses the issues that need to be resolved. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'

Re: How to create nested structure from RDD

2015-11-17 Thread Dean Wampler
val array = line.split(",") // assume: "name,street,city" User(array(0), Address(array(1), array(2))) }.toDF() scala> df.printSchema root |-- name: string (nullable = true) |-- address: struct (nullable = true) ||-- street: string (nullable = true) | |-- city:

Re: How to create nested structure from RDD

2015-11-17 Thread Dean Wampler
-separated input data: case class Address(street: String, city: String) case class User (name: String, address: Address) sc.textFile("/path/to/stuff"). map { line => line.split(0) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/

Re: Spark 1.5.1 ClassNotFoundException in cluster mode.

2015-10-14 Thread Dean Wampler
-jars" option. Note that the latter may not be an ideal solution if it has other dependencies that also need to be passed. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com&

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Dean Wampler
Is myRDD outside a DStream? If so are you persisting on each batch iteration? It should be checkpointed frequently too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: Breaking lineage and reducing stages in Spark Streaming

2015-07-09 Thread Dean Wampler
to call collect on toUpdate before using foreach(println). If the RDD is huge, you definitely don't want to do that. Hope this helps. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

Re: Java vs. Scala for Spark

2015-09-08 Thread Dean Wampler
hop/InvertedIndex5b.scala> for a taste of how concise it makes code! 4. Type inference: Spark really shows its utility. It means a lot less code to write, but you get the hints of what you just wrote! My $0.02. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.or

Re: Parquet partitioning performance issue

2015-09-13 Thread Dean Wampler
files overall. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Sun, Sep 13, 2015 at 12:5

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Dean Wampler
Here's a demonstration video from @noootsab himself (creator of Spark Notebook) showing live charting in Spark Notebook. It's one reason I prefer it over the other options. https://twitter.com/noootsab/status/638489244160401408 Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <h

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-27 Thread Dean Wampler
of your example, (r: ResultSet) => (r.getInt("col1"),r.getInt("col2")...r.getInt("col37") ) could add nested () to group elements and keep the outer number of elements <= 22. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://sh

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread Dean Wampler
d routable inside the cluster. Recall that EC2 instances have both public and private host names & IP addresses. Also, is the port number correct for HDFS in the cluster? dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do>

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
, it was compiled with Java 6 (see https://en.wikipedia.org/wiki/Java_class_file). So, it doesn't appear to be a Spark build issue. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanw

Re: Networking problems in Spark 1.6.0

2016-01-05 Thread Dean Wampler
ConcurrentHashMap.keySet() returning a KeySetView is a Java 8 method. The Java 7 method returns a Set. Are you running Java 7? What happens if you run Java 8? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe

Re: Spark data frame

2015-12-22 Thread Dean Wampler
More specifically, you could have TBs of data across thousands of partitions for a single RDD. If you call collect(), BOOM! Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanw

Re: Spark data frame

2015-12-22 Thread Dean Wampler
You can call the collect() method to return a collection, but be careful. If your data is too big to fit in the driver's memory, it will crash. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typ

Re: getting different results from same line of code repeated

2015-11-18 Thread Dean Wampler
. You can look at the logic in the Spark code base, RDD.scala (first method calls the take method) and SparkContext.scala (runJob method, which take calls). However, the exceptions definitely look like bugs to me. There must be some empty partitions. dean Dean Wampler, Ph.D. Author: Programming

Re:

2015-11-19 Thread Dean Wampler
If you mean retaining data from past jobs, try running the history server, documented here: http://spark.apache.org/docs/latest/monitoring.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typ

Re: Spark Context not getting initialized in local mode

2016-01-08 Thread Dean Wampler
ClassNotFoundException usually means one of a few problems: 1. Your app assembly is missing the jar files with those classes. 2. You mixed jar files from imcompatible versions in your assembly. 3. You built with one version of Spark and deployed to another. Dean Wampler, Ph.D. Author

Re: Trying to understand dynamic resource allocation

2016-01-11 Thread Dean Wampler
It works on Mesos, too. I'm not sure about Standalone mode. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogr

Re: [Spark Streaming] Spark Streaming dropping last lines

2016-02-10 Thread Dean Wampler
? HTH, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> http://polyglotprogramming.com On Wed, Feb 10, 2016 at 3:51 PM, Nipun

Re: Using functional programming rather than SQL

2016-02-22 Thread Dean Wampler
ust's talk at Spark Summit East nicely made this point. http://www.slideshare.net/databricks/structuring-spark-dataframes-datasets-and-streaming dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <

<    1   2