Repeated data item search with Spark SQL(1.0.1)

2014-07-12 Thread anyweil
Hi All: I am using Spark SQL 1.0.1 for a simple test, the loaded data (JSON format) which is registered as table "people" is: {"name":"Michael", "schools":[{"name":"ABC","time":1994},{"name":"EFG","time":2000}]} {"name":"Andy", "age":30,"scores":{"eng":98,"phy":89}} {"name":"Justin", "age":19}

Re: spark ui on yarn

2014-07-12 Thread Matei Zaharia
The UI code is the same in both, but one possibility is that your executors were given less memory on YARN. Can you check that? Or otherwise, how do you know that some RDDs were cached? Matei On Jul 12, 2014, at 4:12 PM, Koert Kuipers wrote: > hey shuo, > so far all stage links work fine for

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Mark Hamstra
And if you can relax your constraints even further to only require RDD[List[Int]], then it's even simpler: rdd.mapPartitions(_.grouped(batchedDegree)) On Sat, Jul 12, 2014 at 6:26 PM, Aaron Davidson wrote: > If you don't really care about the batchedDegree, but rather just want to > do operati

Re: not getting output from socket connection

2014-07-12 Thread Walrus theCat
Thanks! I thought it would get "passed through" netcat, but given your email, I was able to follow this tutorial and get it to work: http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen wrote: > netcat is listening for a conn

Re: Large Task Size?

2014-07-12 Thread Aaron Davidson
I also did a quick glance through the code and couldn't find anything worrying that should be included in the task closures. The only possibly unsanitary part is the Updater you pass in -- what is your Updater and is it possible it's dragging in a significant amount of extra state? On Sat, Jul 12

Large Task Size?

2014-07-12 Thread Kyle Ellrott
I'm working of a patch to MLLib that allows for multiplexing several different model optimization using the same RDD ( SPARK-2372: https://issues.apache.org/jira/browse/SPARK-2372 ) In testing larger datasets, I've started to see some memory errors ( java.lang.OutOfMemoryError and "exceeds max all

Supported SQL syntax in Spark SQL

2014-07-12 Thread Nick Chammas
Is there a place where we can find an up-to-date list of supported SQL syntax in Spark SQL? Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-tp9538.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread Aaron Davidson
Also check the web ui for that. Each iteration will have one or more stages associated with it in the driver web ui. On Sat, Jul 12, 2014 at 6:47 PM, crater wrote: > Hi Xiangrui, > > Thanks for the information. Also, it is possible to figure out the > execution > time per iteration for SVM? > >

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
https://issues.apache.org/jira/browse/SPARK-2156 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S wrote: > I am using the master that I compiled

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread crater
Hi Xiangrui, Thanks for the information. Also, it is possible to figure out the execution time per iteration for SVM? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Putting-block-rdd-failed-when-running-example-svm-on-large-data-tp9515p9535.html Sent from

Re: FW: memory question

2014-07-12 Thread Aaron Davidson
Spark 1.0.0 introduced the ContextCleaner to replace the MetadataCleaner API for this exact issue. The ContextClenaer automatically cleans up your RDD metadata once the RDD gets garbage collected on the driver. On Wed, Jul 9, 2014 at 3:31 AM, wrote: > Hi, > > > > Does anyone know if it is possi

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-12 Thread Aaron Davidson
I think this is probably dying on the driver itself, as you are probably materializing the whole dataset inside your python driver. How large is spark_data_array compared to your driver memory? On Fri, Jul 11, 2014 at 7:30 PM, Mohit Jaggi wrote: > I put the same dataset into scala (using spark-

Re: Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Aaron Davidson
If you don't really care about the batchedDegree, but rather just want to do operations over some set of elements rather than one at a time, then just use mapPartitions(). Otherwise, if you really do want certain sized batches and you are able to relax the constraints slightly, is to construct the

Re: KMeans for large training data

2014-07-12 Thread Aaron Davidson
The "netlib.BLAS: Failed to load implementation" warning only means that the BLAS implementation may be slower than using a native one. The reason why it only shows up at the end is that the library is only used for the finalization step of the KMeans algorithm, so your job should've been wrapping

Convert from RDD[Object] to RDD[Array[Object]]

2014-07-12 Thread Parthus
Hi there, I have a bunch of data in a RDD, which I processed it one by one previously. For example, there was a RDD denoted by "data: RDD[Object]" and then I processed it using "data.map(...)". However, I got a new requirement to process the data in a patched way. It means that I need to convert

Re: Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am using the master that I compiled 2 days ago. Can you point me to the JIRA? On Sat, Jul 12, 2014 at 9:13 AM, DB Tsai wrote: > Are you using 1.0 or current master? A bug related to this is fixed in > master. > > On Jul 12, 2014 8:50 AM, "Srikrishna S" wrote: >> >> I am run logistic regression

Re: Spark Questions

2014-07-12 Thread Aaron Davidson
I am not entirely certain I understand your questions, but let me assume you are mostly interested in SparkSQL and are thinking about your problem in terms of SQL-like tables. 1. Shuo Xiang mentioned Spark partitioning strategies, but in case you are talking about data partitioning or sharding as

Re: Confused by groupByKey() and the default partitioner

2014-07-12 Thread Aaron Davidson
Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here is the associated code, you'll see that it simply checks that the Partit

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nicholas Chammas
Okie doke. Thanks for filing the JIRA. On Sat, Jul 12, 2014 at 6:45 PM, Tathagata Das wrote: > Yes, thats a bug i just discovered. Race condition in the Twitter > Receiver, will fix asap. > Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 > > TD > > > On Sat, Jul 12, 2014 at 3:

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Tathagata Das
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > To add a potentially relevant piece of informa

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread Xiangrui Meng
By default, Spark uses half of the memory for caching RDDs (configurable by spark.storage.memoryFraction). That is about 25 * 8 / 2 = 100G for your setup, which is smaller than the 202G data size. So you don't have enough memory to fully cache the RDD. You can confirm it in the storage tab of the W

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nicholas Chammas
To add a potentially relevant piece of information, around when I stop the StreamingContext, I get the following warning: 14/07/12 22:16:18 WARN ReceiverTracker: All of the receivers have not deregistered, Map(0 -> ReceiverInfo(0,TwitterReceiver-0,Actor[akka.tcp://spark@url-here:49776/user/Receive

Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nick Chammas
>From the interactive shell I’ve created a StreamingContext. I call ssc.start() and take a look at http://master_url:4040/streaming/ and see that I have an active Twitter receiver. Then I call ssc.stop(stopSparkContext = false, stopGracefully = true) and wait a bit, but the receiver seems to stay

Scalability issue in Spark with SparkPageRank example

2014-07-12 Thread lokesh.gidra
Hello, I ran SparkPageRank example (the one included in the package) to evaluate scale-in capability of Spark. I ran experiments on a 8-node 48-core AMD machine with local[N] master. But, for N > 10, the completion time of the experiment kept increasing, rather than decreasing. When I profi

Re: Anaconda Spark AMI

2014-07-12 Thread Benjamin Zaitlen
Hi All, Thanks to Jey's help, I have a release AMI candidate for spark-1.0/anaconda-2.0 integration. It's currently limited to availability in US-EAST: ami-3ecd0c56 Give it a try if you have some time. This should* just work* with spark 1.0: ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3e

Re: spark ui on yarn

2014-07-12 Thread Koert Kuipers
hey shuo, so far all stage links work fine for me. i did some more testing, and it seems kind of random what shows up on the gui and what does not. some partially cached RDDs make it to the GUI, while some fully cached ones do not. I have not been able to detect a pattern. is the codebase for the

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan
Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2) Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by default? (1) Assumi

Re: Spark Streaming timing considerations

2014-07-12 Thread Laeeq Ahmed
Hi, Thanks I will try to implement it. Regards, Laeeq On Saturday, July 12, 2014 4:37 AM, Tathagata Das wrote: This is not in the current streaming API. Queue stream is useful for testing with generated RDDs, but not for actual data. For actual data stream, the slack time can be implem

Putting block rdd failed when running example svm on large data

2014-07-12 Thread crater
Hi, I am trying to run the example BinaryClassification (org.apache.spark.examples.mllib.BinaryClassification) on a 202G file. I am constantly getting the messages looks like below, it is normal or I am missing something. 14/07/12 09:49:04 WARN BlockManager: Block rdd_4_196 could not be dropped f

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, "Srikrishna S" wrote: > I am run logistic regression with SGD on a problem with about 19M > parameters (the kdda dataset from the libsvm library) > > I consistently see that the nodes on my com

Akka Client disconnected

2014-07-12 Thread Srikrishna S
I am run logistic regression with SGD on a problem with about 19M parameters (the kdda dataset from the libsvm library) I consistently see that the nodes on my computer get disconnected and soon the whole job goes to a grinding halt. 14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler: Los

Re: Announcing Spark 1.0.1

2014-07-12 Thread Patrick Wendell
Thanks for pointing this out Brad! I updated the version identified in the docs. On Sat, Jul 12, 2014 at 6:08 AM, Brad Miller wrote: > Hi All, > > Congrats to the entire Spark team on the 1.0.1 release. > > In checking out the new features, I noticed that it looks like the > python API docs have

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-12 Thread M Singh
Thanks TD. BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use: * a single input (250 GB) (in this case is there any max upper bound) or * split into 1000 files each of 250 MB (hdfs block size is 250 MB) or * a multiple of hdfs block size.

Re: Announcing Spark 1.0.1

2014-07-12 Thread Brad Miller
Hi All, Congrats to the entire Spark team on the 1.0.1 release. In checking out the new features, I noticed that it looks like the python API docs have been updated, but the and the header at the top of the page still say "Spark 1.0.0". Clearly not a big deal... I just wouldn't want anyone to g

Re: KMeans for large training data

2014-07-12 Thread durin
Your latest response doesn't show up here yet, I only got the mail. I'll still answer here in the hope that it appears later: Which memory setting do you mean? I can go up with spark.executor.memory a bit, it's currently set to 12G. But thats already way more than the whole SchemaRDD of Vectors th

Re: KMeans for large training data

2014-07-12 Thread durin
Thanks, setting the number of partitions to the number of executors helped a lot and training with 20k entries got a lot faster. However, when I tried training with 1M entries, after about 45 minutes of calculations, I get this: It's stuck at this point. The CPU load for the master is at 100% (

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-12 Thread Yan Fang
Thank you, Tathagata. That explains. Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Fri, Jul 11, 2014 at 7:21 PM, Tathagata Das wrote: > Task slot is equivalent to core number. So one core can only run one task > at a time. > > TD > > > On Fri, Jul 11, 2014 at 1:57 PM, Yan Fang wrote: >

Re: Spark Questions

2014-07-12 Thread Shuo Xiang
For your first question, the partitioning strategy can be tuned by applying different partitioner. You can use existing ones such as HashPartitioner or write your own.See this link( http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf) for some instr

Re: spark ui on yarn

2014-07-12 Thread Shuo Xiang
Hi Koert, Just curious did you find any information like "CANNOT FIND ADDRESS" after clicking into some stage? I've seen similar problems due to lost of executors. Best, On Fri, Jul 11, 2014 at 4:42 PM, Koert Kuipers wrote: > I just tested a long lived application (that we normally run in s