Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is how I am creating a spark context. I just didn't write out the lines where I create the new SparkConf and SparkContext. I am also upping the driver memory when running. Thanks, David On 01/12/2015 11:12 A

configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
er.memoryOverhead", "1024") on my spark configuration object but I still get "Will allocate AM container, with MB memory including 384 MB overhead" when launching. I'm running in yarn-cluster mode. Any help or tips would be appreciated. Thanks, David -- Davi

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-22 Thread David Allan
.(TableSchema.java:45) at org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:234) ... 51 more Cheers David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817.html Sent from the Ap

DAGScheduler StackOverflowError

2014-12-19 Thread David McWhorter
ep getting StackOverflowError's in DAGScheduler such as the one below. I've attached a sample application that illustrates what I'm trying to do. Can anyone point out how I can keep the DAG from growing so large that spark is not able to process it? Thank you, David java.lang.Stac

Spark steaming : work with collect() but not without collect()

2014-12-11 Thread david
Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { process(event._1, event._2) }) }) This work fine. But without /collect()/ function, the following exception is rais

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd => { rdd.map(event => { // pro

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote: > Spark has a known problem where it will do a pass of metadata on a large > number of small files

Spark SQL Join returns less rows that expected

2014-11-25 Thread david
Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- b

Spark SQL (1.0)

2014-11-24 Thread david
Hi, I build 2 tables from files. Table F1 join with table F2 on c5=d4. F1 has 46730613 rows F2 has 3386740 rows All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(s

subscribe

2014-11-11 Thread DAVID SWEARINGEN
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

inconsistent edge counts in GraphX

2014-11-10 Thread Buttler, David
Hi, I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number of edges between different runs. However, if I use a single partitio

RE: Key-Value decomposition

2014-11-04 Thread david
Thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

Re: Key-Value decomposition

2014-11-03 Thread david
Hi, But i've only one RDD. Hre is a more complete exemple : my rdd is something like ("A", "1;2;3"), ("B", "2;5;6"), ("C", "3;2;1") And i expect to have the following result : ("A",1) , ("A",2) , ("A",3) , ("B",2) , ("B",5) , ("B",6) , ("C",3) , ("C",2) , ("C",1) Any idea about how can

Key-Value decomposition

2014-11-03 Thread david
Hi, I'm a newbie in Spark and faces the following use case : val data = Array ( "A", "1;2;3") val rdd = sc.parallelize(data) // Something here to produce RDD of (Key,value) // ( "A", "1") , ("A", "2"), ("A", "3) Does anybody know how to do ? Thank's -- View this mess

Re: foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I finally found a solution after reading the post : http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachPartition-write-to-

foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I want to write my RDDs to multiples files based on a key value. So, i used groupByKey and iterate over partitions. Here is a the code : rdd.map(f => (f.substring(0,4), f)).groupByKey().foreachPartition(iterator => iterator.map { case (key, values) => val fs: FileSystem = File

Re: pyspark cassandra examples

2014-09-30 Thread David Vincelli
Thanks, that worked! I downloaded the version pre-built against hadoop1 and the examples worked. - David On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang wrote: > > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.JobContext, but class was expected

pyspark cassandra examples

2014-09-30 Thread David Vincelli
Am I missing a cassandra driver? I have browsed through the documentation and found nothing specifically relevant to cassandra, is there such a piece of documentation? Thank you, - David

Re: aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
u need to implement three functions: createCombiner, > mergeValue, mergeCombiners. > > Hope this helps! > Liquan > > On Sun, Sep 28, 2014 at 11:59 PM, David Rowe wrote: > >> Hi All, >> >> After some hair pulling, I've reached the realisation that an operation I

aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Hi All, After some hair pulling, I've reached the realisation that an operation I am currently doing via: myRDD.groupByKey.mapValues(func) should be done more efficiently using aggregateByKey or combineByKey. Both of these methods would do, and they seem very similar to me in terms of their func

Re: sortByKey trouble

2014-09-24 Thread david
thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions : (String,String,String,String) (String,(String,String,String)) -- View this message in context: http://apache-spark-user-list.1

sortByKey trouble

2014-09-24 Thread david
Hi, Does anybody know how to use sortbykey in scala on a RDD like : val rddToSave = file.map(l => l.split("\\|")).map(r => (r(34)+"-"+r(3), r(4), r(10), r(12))) besauce, i received ann error "sortByKey is not a member of ord.apache.spark.rdd.RDD[(String,String,String,String)]. What i t

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread David Rowe
Hi Andrew, I can't speak for Theodore, but I would find that incredibly useful. Dave On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash wrote: > Hi Theodore, > > What do you mean by module diagram? A high level architecture diagram of > how the classes are organized into packages? > > Andrew > > On

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
y be different from the previous code, > I guess probably some potential bugs may introduced. > > > > Thanks > > Jerry > > > > *From:* David Rowe [mailto:davidr...@gmail.com] > *Sent:* Monday, September 22, 2014 7:12 PM > *To:* Andrew Ash > *Cc:* Shao, Saisai;

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
Yep, this is what I was seeing. I'll experiment tomorrow with a version prior to the changeset in that ticket. On Mon, Sep 22, 2014 at 8:29 PM, Andrew Ash wrote: > Hi David and Saisai, > > Are the exceptions you two are observing similar to the first one at > https://issue

Re: Issues with partitionBy: FetchFailed

2014-09-21 Thread David Rowe
Hi, I've seen this problem before, and I'm not convinced it's GC. When spark shuffles it writes a lot of small files to store the data to be sent to other executors (AFAICT). According to what I've read around the place the intention is that these files be stored in disk buffers, and since sync()

Re: SQL shell for Spark SQL?

2014-09-18 Thread David Rosenstrauch
nks, DR On 09/18/2014 02:18 AM, Michael Armbrust wrote: Check out the Spark SQL cli <https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-spark-sql-cli> . On Wed, Sep 17, 2014 at 10:50 PM, David Rosenstrauch wrote: Is there a shell available for Spark SQL, similar

SQL shell for Spark SQL?

2014-09-17 Thread David Rosenstrauch
Is there a shell available for Spark SQL, similar to the way the Shark or Hive shells work? From my reading up on Spark SQL, it seems like one can execute SQL queries in the Spark shell, but only from within code in a programming language such as Scala. There does not seem to be any way to di

Shark queries fail after 10% completion with UnknownHostException

2014-09-17 Thread David Rosenstrauch
We're stumped on something really odd. We run a simple shark job. (A simple query against an external table, with the data residing on HDFS - 256 part files, each approximately of size 3.75GB.) The job runs successfully until it gets to about 10% completion (200+ tasks out of approximately 2

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL): SELECT order, mean(price) FROM orders GROUP BY order In this case, I'm not aware of a way to use the DoubleRDDFunctions, since you have a single RDD of pairs where each pair is of type (KeyType, Iterable[Double]). It seems to me that

Re: Computing mean and standard deviation by key

2014-09-11 Thread David Rowe
I generally call values.stats, e.g.: val stats = myPairRdd.values.stats On Fri, Sep 12, 2014 at 4:46 PM, rzykov wrote: > Is it possible to use DoubleRDDFunctions > < > https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html > > > for calculating mean and std d

spark-ec2 [Errno 110] Connection time out

2014-08-30 Thread David Matheson
line 717, in real_main conn = ec2.connect_to_region(opts.region) Any suggestions on how to debug the cause of the timeout? Note: I replaced the name of my keypair with Blah. Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-Er

Small input split sizes

2014-08-20 Thread David Rosenstrauch
I'm still bumping up against this issue: spark (and shark) are breaking my inputs into 64MB-sized splits. Anyone know where/how to configure spark so that it either doesn't split the inputs, or at least uses a much large split size? (E.g., 512MB.) Thanks, DR On 07/15/2014 05:58

saveAsTextFile hangs with hdfs

2014-08-19 Thread David
sorted.size()); }); String outputPath = "/summarized/groupedByTrackingId4"; hdfs.rm(outputPath, true); stringIntegerJavaPairRDD.saveAsTextFile(String.format("%s/%s", hdfs.getUrl(), outputPath)); Thanks in advance, David

Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
r. > 1000 RDD count()s at once isn't a good idea for example. > > It may be the case that you don't really need a bunch of RDDs at all, > but can operate on an RDD of pairs of Strings (roots) and > something-elses, all at once. > > > On Mon, Aug 18, 2014 at 2:31 P

Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
I be doing something else entirely? Thanks David

Spark misconfigured? Small input split sizes in shark query

2014-07-15 Thread David Rosenstrauch
Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.) I ran a simple Hive query (select count ...) against a

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
and how much memory on each? > If you go to the RM web UI at port 8088, how much memory is used? Which > YARN scheduler are you using? > > -Sandy > > > On Fri, May 30, 2014 at 12:38 PM, David Belling > wrote: > >> Hi, >> >> I'm running CDH5 and it

Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
Hi, I'm running CDH5 and its bundled Spark (0.9.0). The Spark shell has been coming up fine over the last couple of weeks. However today it doesn't come up and I just see this message over and over: 14/05/30 12:06:05 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
Created https://issues.apache.org/jira/browse/SPARK-1916 I'll submit a pull request soon. /D On May 23, 2014, at 9:56 AM, David Lemieux wrote: > For some reason the patch did not make it. > > Trying via email: > > > > /D > > On May 23, 2014, at 9:52 AM, lem

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud wrote: > Hi, > > I think I found the problem. > In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read > the first 1020 bytes, but no more. The code should make sure to r

RE: K-means with large K

2014-04-28 Thread Buttler, David
@spark.apache.org Cc: user@spark.apache.org Subject: Re: K-means with large K David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, "Buttler, David" mailto:buttl...@llnl.gov>> wrote: Hi, I am trying to

K-means with large K

2014-04-28 Thread Buttler, David
Hi, I am trying to run the K-means code in mllib, and it works very nicely with small K (less than 1000). However, when I try for a larger K (I am looking for 2000-4000 clusters), it seems like the code gets part way through (perhaps just the initialization step) and freezes. The compute nodes

RE:

2014-04-23 Thread Buttler, David
This sounds like a configuration issue. Either you have not set the MASTER correctly, or possibly another process is using up all of the cores Dave From: ge ko [mailto:koenig@gmail.com] Sent: Sunday, April 13, 2014 12:51 PM To: user@spark.apache.org Subject: Hi, I'm still going to start w

Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

hbase scan performance

2014-04-09 Thread David Quigley
Hi all, We are currently using hbase to store user data and periodically doing a full scan to aggregate data. The reason we use hbase is that we need a single user's data to be contiguous, so as user data comes in, we need the ability to update a random access store. The performance of a full hba

Re: Resilient nature of RDD

2014-04-03 Thread David Thomas
but the > re-computation will occur on an executor. So if several partitions are > lost, e.g. due to a few machines failing, the re-computation can be striped > across the cluster making it fast. > > > On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote: > >> Can someone e

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal wrote: > Hi David, > > I am sorry but your question is not clear to me. Are you talking about > taking some value and sharing it across your cluster so that it is present > on all the nodes? You can

Replicating RDD elements

2014-03-27 Thread David Thomas
How can we replicate RDD elements? Say I have 1 element and 100 nodes in the cluster. I need to replicate this one item on all the nodes i.e. effectively create an RDD of 100 elements.

graphx samples in Java

2014-03-20 Thread David Soroko
, "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof" thanks --david

Round Robin Partitioner

2014-03-13 Thread David Thomas
Is it possible to parition the RDD elements in a round robin fashion? Say I have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure each element gets mapped to each node in the cluster.

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
Spark runtime/scheduler traverses the DAG starting from > that RDD and triggers evaluation of anything parent RDDs it needs that > aren't computed and cached yet. > > Any future operations build on the same DAG as long as you use the same > RDD objects and, if you used cache

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
ld be lazy, but > apparently uses an RDD.count call in its implementation: > https://spark-project.atlassian.net/browse/SPARK-1021). > > David Thomas > March 11, 2014 at 9:49 PM > For example, is distinct() transformation lazy? > > when I see the Spark source code, distin

Are all transformations lazy?

2014-03-11 Thread David Thomas
For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct? /** * Return a new RDD

Block

2014-03-10 Thread David Thomas
What is the concept of Block and BlockManager in Spark? How is a Block related to a Partition of a RDD?

Custom RDD

2014-03-10 Thread David Thomas
Is there any guide available on creating a custom RDD?

Help with groupByKey

2014-03-02 Thread David Thomas
I have an RDD of (K, Array[V]) pairs. For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2))) How can I do a groupByKey such that I get back an RDD of the form (K, Array[V]) pairs. Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))

Where does println output go?

2014-03-01 Thread David Thomas
So I'm having this code: rdd.foreach(p => { print(p) }) Where can I see this output? Currently I'm running my spark program on a cluster. When I run the jar using sbt run, I see only INFO logs on the console. Where should I check to see the application sysouts?

<    1   2   3