Re: CPU RAM

2014-09-17 Thread Akhil Das
Ganglia does give you a cluster wide and per machine utilization of resources, but i don't think it gives your per Spark Job. If you want to build something from scratch then you can follow up like : 1. Login to the machine 2. Get the PIDs 3. For network IO per process, you can have a look at

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-17 Thread Davies Liu
PipelinedRDD is an RDD generated by Python mapper/reducer, such as rdd.map(func) will be PipelinedRDD. PipelinedRDD is an subclass of RDD, so it should have all the APIs which RDD has. sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count() 10 I'm wondering that how can you

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
it also appears in streaming hdfs fileStream -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/collect-on-hadoopFile-RDD-returns-wrong-results-tp14368p14425.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread Akhil Das
Can you dump out a small piece of data? while doing rdd.collect and rdd.foreach(println) Thanks Best Regards On Wed, Sep 17, 2014 at 12:26 PM, vasiliy zadonsk...@gmail.com wrote: it also appears in streaming hdfs fileStream -- View this message in context:

Re: permission denied on local dir

2014-09-17 Thread Sean Owen
Yes, that is how it is supposed to work. Apps run as yarn and do not generally expect to depend on local file state that is created externally. This directory should be owned by yarn though right? Your error does not show permission denied. It looks like you are unable to list a yarn dir as your

Re: collect on hadoopFile RDD returns wrong results

2014-09-17 Thread vasiliy
full code example: def main(args: Array[String]) { val conf = new SparkConf().setAppName(ErrorExample).setMaster(local[8]) .set(spark.serializer, classOf[KryoSerializer].getName) val sc = new SparkContext(conf) val rdd = sc.hadoopFile( hdfs://./user.avro,

Re: Configuring Spark for heterogenous hardware

2014-09-17 Thread Victor Tso-Guillen
I'm supposing that there's no good solution to having heterogenous hardware in a cluster. What are the prospects of having something like this in the future? Am I missing an architectural detail that precludes this possibility? Thanks, Victor On Fri, Sep 12, 2014 at 12:10 PM, Victor Tso-Guillen

Re: Configuring Spark for heterogenous hardware

2014-09-17 Thread Sean Owen
I thought I answered this ... you can easily accomplish this with YARN by just telling YARN how much memory / CPU each machine has. This can be configured in groups too rather than per machine. I don't think you actually want differently-sized executors, and so don't need ratios. But you can have

Re: YARN mode not available error

2014-09-17 Thread Sean Owen
Caused by: java.lang.ClassNotFoundException: org.apache.spark.scheduler.cluster.YarnClientClusterScheduler It sounds like you perhaps deployed a custom build of Spark that did not include YARN support? you need -Pyarn in your build. On Wed, Sep 17, 2014 at 4:47 AM, Barrington

Re: Configuring Spark for heterogenous hardware

2014-09-17 Thread Victor Tso-Guillen
Hmm, interesting. I'm using standalone mode but I could consider YARN. I'll have to simmer on that one. Thanks as always, Sean! On Wed, Sep 17, 2014 at 12:40 AM, Sean Owen so...@cloudera.com wrote: I thought I answered this ... you can easily accomplish this with YARN by just telling YARN how

Re: Questions about Spark speculation

2014-09-17 Thread Andrew Ash
Hi Nicolas, I've had suspicions about speculation causing problems on my cluster but don't have any hard evidence of it yet. I'm also interested in why it's turned off by default. On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote: Hi, guys My current project is using

Do I Need to Set Checkpoint Interval for Every DStream?

2014-09-17 Thread Ji ZHANG
Hi, I'm using spark streaming 1.0. I create dstream with kafkautils and apply some operations on it. There's a reduceByWindow operation at last so I suppose the checkpoint interval should be automatically set to more than 10 seconds. But what I see is it still checkpoint every 2 seconds (my batch

About the Spark Scala API Excption

2014-09-17 Thread churly lin
Hi all: The Scala API here: http://spark.apache.org/docs/latest/api/scala/#package I found that it didn't specify the exception of each function or class. Then how could I know if a function throws a exception or not without reading the source code?

Change RDDs using map()

2014-09-17 Thread Deep Pradhan
Hi, I want to make the following changes in the RDD (create new RDD from the existing to reflect some transformation): In an RDD of key-value pair, I want to get the keys for which the values are 1. How to do this using map()? Thank You

Re: Change RDDs using map()

2014-09-17 Thread Mark Hamstra
You don't. That's what filter or the partial function version of collect are for: val transformedRDD = yourRDD.collect { case (k, v) if k == 1 = v } On Wed, Sep 17, 2014 at 3:24 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I want to make the following changes in the RDD (create new

Re: CPU RAM

2014-09-17 Thread VJ Shalish
Hi I need the same through Java. Doesn't the SPark API support this? On Wed, Sep 17, 2014 at 2:48 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Ganglia does give you a cluster wide and per machine utilization of resources, but i don't think it gives your per Spark Job. If you want to build

pyspark on yarn - lost executor

2014-09-17 Thread Oleg Ruchovets
Hi , I am execution pyspark on yarn. I have successfully executed initial dataset but now I growed it 10 times more. during execution I got all the time this error: 14/09/17 19:28:50 ERROR cluster.YarnClientClusterScheduler: Lost executor 68 on UCS-NODE1.sms1.local: remote Akka client

Short Circuit Local Reads

2014-09-17 Thread Gary Malouf
Cloudera had a blog post about this in August 2013: http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ Has anyone been using this in production - curious as to if it made a significant difference from a Spark perspective.

Number of partitions when saving (pyspark)

2014-09-17 Thread Luis Guerra
Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile in Pyspark? I am loading several files from HDFS, fixing the number of partitions to X (let's say 40 for instance). Then some transformations, like joins and filters are carried out. The weird thing here is that

Re: pyspark on yarn - lost executor

2014-09-17 Thread Eric Friedman
How many partitions do you have in your input rdd? Are you specifying numPartitions in subsequent calls to groupByKey/reduceByKey? On Sep 17, 2014, at 4:38 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I am execution pyspark on yarn. I have successfully executed initial dataset

Adjacency List representation in Spark

2014-09-17 Thread Harsha HN
Hello We are building an adjacency list to represent a graph. Vertexes, Edges and Weights for the same has been extracted from hdfs files by a Spark job. Further we expect size of the adjacency list(Hash Map) could grow over 20Gigs. How can we represent this in RDD, so that it will distributed in

Adjacency List representation in Spark

2014-09-17 Thread Sree Harsha
Hello We are building an adjacency list to represent a graph. Vertexes, Edges and Weights for the same has been extracted from hdfs files by a Spark job. Further we expect size of the adjacency list(Hash Map) could grow over 20Gigs. How can we represent this in RDD, so that it will distributed

list of documents sentiment analysis - problem with defining proper approach with Spark

2014-09-17 Thread xnts
Hi, For last few days I am working on an exercise where I want to understand the sentiment of a set of articles. As the input I have XML file with articles and the AFINN-111.txt file defining sentiment of few hundred words. What I am able to do without any problem is loading of the data,

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-17 Thread edmond_huo
Hi Davis, Thank you for you answer. This is my code. I think it is very similar with word count example in spark lines = sc.textFile(sys.argv[2]) sie = lines.map(lambda l: (l.strip().split(',')[4],1)).reduceByKey(lambda a, b: a + b) sort_sie = sie.sortByKey(False) Thanks again. --

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-17 Thread edmond_huo
Hi Davis, When I run your code in pyspark, I still get the same error: sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count() Traceback (most recent call last): File stdin, line 1, in module AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey' Is it the

Spark and disk usage.

2014-09-17 Thread Макар Красноперов
Hello everyone. The problem is that spark write data to the disk very hard, even if application has a lot of free memory (about 3.8g). So, I've noticed that folder with name like spark-local-20140917165839-f58c contains a lot of other folders with files like shuffle_446_0_1. The total size of

Re: pyspark on yarn - lost executor

2014-09-17 Thread Oleg Ruchovets
Sure, I'll post to the mail list. groupByKey(self, numPartitions=None)source code http://spark.apache.org/docs/1.0.2/api/python/pyspark.rdd-pysrc.html#RDD.groupByKey Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with into numPartitions

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi, The files you mentioned are temporary files written by Spark during shuffling. ALS will write a LOT of those files as it is a shuffle heavy algorithm. Those files will be deleted after your program completes as Spark looks for those files in case a fault occurs. Having those files ready

Re: permission denied on local dir

2014-09-17 Thread style95
However, in my application there is no logic to access local files. So I thought that spark is internally using the local file system to cache RDDs. As per the log, it looks that error occurred during spark internal logic rather than my business logic. It is trying to delete local directories.

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
Hmm, no response to this thread! Adding to it, please share experiences of building an enterprise grade product based on Spark Streaming. I am exploring Spark Streaming for enterprise software and am cautiously optimistic about it. I see huge potential to improve debuggability of Spark. -

GroupBy Key and then sort values with the group

2014-09-17 Thread abraham.jacob
Hi Group, I am quite fresh in the spark world. There is a particular use case that I just cannot understand how to accomplish in spark. I am using Cloudera CDH5/YARN/Java 7. I have a dataset that has the following characteristics - A JavaPairRDD that represents the following - Key = {int ID}

Re: pyspark on yarn - lost executor

2014-09-17 Thread Davies Liu
Maybe the Python worker use too much memory during groupByKey(), groupByKey() with larger numPartitions can help. Also, can you upgrade your cluster to 1.1? It can spilling the data into disks if the memory can not hold all the data during groupByKey(). Also, If there is hot key with dozens of

Re: Number of partitions when saving (pyspark)

2014-09-17 Thread Davies Liu
On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra luispelay...@gmail.com wrote: Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile in Pyspark? I am loading several files from HDFS, fixing the number of partitions to X (let's say 40 for instance). Then some

Re: GroupBy Key and then sort values with the group

2014-09-17 Thread Sean Owen
You just need to call mapValues() to change your Iterable of things into a sorted Iterable of things for each key-value pair. In that function you write, it's no different from any other Java program. I imagine you'll need to copy the input Iterable into an ArrayList (unfortunately), sort it with

RE: GroupBy Key and then sort values with the group

2014-09-17 Thread abraham.jacob
Thanks Sean, Makes total sense. I guess I was so caught up with RDD's and all the wonderful transformations it can do, that I did not think about pain old Java Collections.sort(list, comparator). Thanks, __ Abraham -Original Message- From: Sean Owen

Re: Adjacency List representation in Spark

2014-09-17 Thread Andrew Ash
Hi Harsha, You could look through the GraphX source to see the approach taken there for ideas in your own. I'd recommend starting at https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385 to see the storage technique. Why do you want to avoid

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM,

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Hi Andrew, Yes, I'm referring to sparkContext.setCheckpointDir(). It has the same effect as in Spark Streaming. For example, in an algorithm like ALS, the RDDs go through many transformations and the lineage of the RDD starts to grow drastically just like the lineage of DStreams do in Spark

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting

Re: Short Circuit Local Reads

2014-09-17 Thread Matei Zaharia
I'm pretty sure it does help, though I don't have any numbers for it. In any case, Spark will automatically benefit from this if you link it to a version of HDFS that contains this. Matei On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote: Cloudera had a blog post

How to ship cython library to workers?

2014-09-17 Thread freedafeng
I have a library written in Cython and C. wondering if it can be shipped to the workers which don't have cython installed. maybe create an egg package from this library? how? -- View this message in context:

how to group within the messages at a vertex?

2014-09-17 Thread spr
Sorry if this is in the docs someplace and I'm missing it. I'm trying to implement label propagation in GraphX. The core step of that algorithm is - for each vertex, find the most frequent label among its neighbors and set its label to that. (I think) I see how to get the input from all the

Spark Streaming - batchDuration for streaming

2014-09-17 Thread alJune
hi estimated Sparkes, I have some doubt about Streaming Context batchDuration parameter. I've already read excellent explication by Tathagata Das about difference between batch window duration. But the issue seem a few confusion as for exam.- without any window SparkStreaming plays as implicit

Re: how to group within the messages at a vertex?

2014-09-17 Thread Ankur Dave
At 2014-09-17 11:39:19 -0700, spr s...@yarcdata.com wrote: I'm trying to implement label propagation in GraphX. The core step of that algorithm is - for each vertex, find the most frequent label among its neighbors and set its label to that. [...] It seems on the broken line above, I

Re: Spark and disk usage.

2014-09-17 Thread Burak Yavuz
Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch. The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to

Re: OutOfMemoryError with basic kmeans

2014-09-17 Thread st553
Not sure if you resolved this but I had a similar issue and resolved it. In my case, the problem was the ids of my items were of type Long and could be very large (even though there are only a small number of distinct ids... maybe a few hundred of them). KMeans will create a dense vector for the

How to run kmeans after pca?

2014-09-17 Thread st553
I would like to reduce the dimensionality of my data before running kmeans. The problem I'm having is that both RowMatrix.computePrincipalComponents() and RowMatrix.computeSVD() return a DenseMatrix whereas KMeans.train() requires an RDD[Vector]. Does MLlib provide a way to do this conversion?

Re: partitioned groupBy

2014-09-17 Thread Akshat Aranya
Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I guess I would need

Re: Stable spark streaming app

2014-09-17 Thread Tim Smith
I don't have anything in production yet but I now at least have a stable (running for more than 24 hours) streaming app. Earlier, the app would crash for all sorts of reasons. Caveats/setup: - Spark 1.0.0 (I have no input flow control unlike Spark 1.1) - Yarn for RM - Input and Output to Kafka -

Re: Stable spark streaming app

2014-09-17 Thread Soumitra Kumar
Thanks Tim for the detailed email, would help me a lot. I have: . 3 nodes CDH5.1 cluster with 16G memory. . Majority of code in Scala, some part in Java (using Cascading earlier). . Inputs are bunch of textFileStream directories. . Every batch output is going to Parquet files, and to HBase

RE: Change RDDs using map()

2014-09-17 Thread qihong
if you want the result as RDD of (key, 1) new_rdd = rdd.filter(x = x._2 == 1) if you want result as RDD of keys (since you know the values are 1), then new_rdd = rdd.filter(x = x._2 == 1).map(x = x._1) x._1 and x._2 are the way of scala to access the key and value from key/value pair.

Re: Dealing with Time Series Data

2014-09-17 Thread qihong
what are you trying to do? generate time series from your data in HDFS, or doing some transformation and/or aggregation from your time series data in HDFS? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html Sent

Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread francisco
Hi, We are running aggregation on a huge data set (few billion rows). While running the task got the following error (see below). Any ideas? Running spark 1.1.0 on cdh distribution. ... 14/09/17 13:33:30 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 2083 bytes result sent to driver

RE: Stable spark streaming app

2014-09-17 Thread abraham.jacob
Nice write-up... very helpful! -Original Message- From: Tim Smith [mailto:secs...@gmail.com] Sent: Wednesday, September 17, 2014 1:11 PM Cc: spark users Subject: Re: Stable spark streaming app I don't have anything in production yet but I now at least have a stable (running for more

Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
If you'd like to re-use the resulting inverted map, you can persist the result: x = myRdd.mapPartitions(create inverted map).persist() Your function would create the reverse map and then return an iterator over the keys in that map. On Wed, Sep 17, 2014 at 1:04 PM, Akshat Aranya

Re: Size exceeds Integer.MAX_VALUE in BlockFetcherIterator

2014-09-17 Thread Burak Yavuz
Hi, Could you try repartitioning the data by .repartition(# of cores on machine) or while reading the data, supply the number of minimum partitions as in sc.textFile(path, # of cores on machine). It may be that the whole data is stored in one block? If it is billions of rows, then the indexing

Re: Spark Streaming - batchDuration for streaming

2014-09-17 Thread qihong
Here's official spark document about batch size/interval: http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-size spark is batch oriented processing. As you mentioned, the streaming is continuous flow, and core spark can not handle it. Spark streaming

SPARK BENCHMARK TEST

2014-09-17 Thread VJ Shalish
I am trying to benchmark spark in a hadoop cluster. I need to design a sample spark job to test the CPU utilization, RAM usage, Input throughput, Output throughput and Duration of execution in the cluster. I need to test the state of the cluster for :- A spark job which uses high CPU A spark

Re: Stable spark streaming app

2014-09-17 Thread Tim Smith
Thanks :) On Wed, Sep 17, 2014 at 2:10 PM, Paul Wais pw...@yelp.com wrote: Thanks Tim, this is super helpful! Question about jars and spark-submit: why do you provide myawesomeapp.jar as the program jar but then include other jars via the --jars argument? Have you tried building one uber

spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-17 Thread Andy Davidson
Hi I am new to spark. I am trying to write a simple java program that process tweets that where collected and stored in a file. I figured the simplest thing to do would be to convert the JSON string into a java map. When I submit my jar file I keep getting the following error

problem with HiveContext inside Actor

2014-09-17 Thread Du Li
Hi, Wonder anybody had similar experience or any suggestion here. I have an akka Actor that processes database requests in high-level messages. Inside this Actor, it creates a HiveContext object that does the actual db work. The main thread creates the needed SparkContext and passes in to the

Re: permission denied on local dir

2014-09-17 Thread style95
More clearly, that yarn cluster is managed by other team. That means, I do not have any permission to change the system. If required, I can request to them, but as of now, I only have permission to manage my spark application. So if there are any way to solve this problem by changing

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-09-17 Thread Vipul Pandey
It works for me : export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native export

Re: Huge matrix

2014-09-17 Thread Debasish Das
Hi Reza, In similarColumns, it seems with cosine similarity I also need other numbers such as intersection, jaccard and other measures... Right now I modified the code to generate jaccard but I had to run it twice due to the design of RowMatrix / CoordinateMatrix...I feel we should modify

RE: problem with HiveContext inside Actor

2014-09-17 Thread Cheng, Hao
Hi, Du I am not sure what you mean triggers the HiveContext to create a database, do you create the sub class of HiveContext? Just be sure you call the HiveContext.sessionState eagerly, since it will set the proper hiveconf into the SessionState, otherwise the HiveDriver will always get the

Re: Cannot run unit test.

2014-09-17 Thread Jies
When I run `sbt test-only SparkTest` or `sbt test-only SparkTest1`, it was pass. But run `set test` to tests SparkTest and SparkTest1, it was failed. If merge all cases into one file, the test was pass. -- View this message in context:

MLLib: LIBSVM issue

2014-09-17 Thread Sameer Tilak
Hi All,We have a fairly large amount of sparse data. I was following the following instructions in the manual: Sparse dataIt is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and

Re: problem with HiveContext inside Actor

2014-09-17 Thread Michael Armbrust
- dev Is it possible that you are constructing more than one HiveContext in a single JVM? Due to global state in Hive code this is not allowed. Michael On Wed, Sep 17, 2014 at 7:21 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, Du I am not sure what you mean “triggers the HiveContext to

Move Spark configuration from SPARK_CLASSPATH to spark-default.conf , HiveContext went wrong with Class com.hadoop.compression.lzo.LzoCodec not found

2014-09-17 Thread Zhun Shen
Hi there, My product environment is AWS EMR with hadoop2.4.0 and spark1.0.2. I moved the spark configuration in SPARK_CLASSPATH to spark-default.conf,  then the hiveContext went wrong. I also found WARN info “WARN DataNucleus.General: Plugin (Bundle) org.datanucleus.store.rdbms is already

Re: MLLib: LIBSVM issue

2014-09-17 Thread Burak Yavuz
Hi, The spacing between the inputs should be a single space, not a tab. I feel like your inputs have tabs between them instead of a single space. Therefore the parser cannot parse the input. Best, Burak - Original Message - From: Sameer Tilak ssti...@live.com To: user@spark.apache.org

SchemaRDD and RegisterAsTable

2014-09-17 Thread Addanki, Santosh Kumar
Hi, We built out SPARK 1.1.0 Version with MVN using mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive clean package And the Thrift Server has been configured to use the Hive Meta Store. When a schemaRDD is registered as table where does the metadata of this table get stored. Can it be

Re: SchemaRDD and RegisterAsTable

2014-09-17 Thread Denny Lee
The registered table is stored within the spark context itself.  To have the table available for the thrift server to get access to, you can save the sc table into the Hive context so that way the Thrift server process can see the table.  If you are using derby as your metastore, then the

Re: problem with HiveContext inside Actor

2014-09-17 Thread Du Li
Thanks for your reply. Michael: No. I only create one HiveContext in the code. Hao: Yes. I subclass HiveContext and defines own function to create database and then subclass akka Actor to call that function in response to an abstract message. By your suggestion, I called

Python version of kmeans

2014-09-17 Thread MEETHU MATHEW
Hi all, I need the kmeans code written against Pyspark for some testing purpose. Can somebody tell me the difference between these two files. spark-1.0.1/examples/src/main/python/kmeans.py and spark-1.0.1/python/pyspark/mllib/clustering.py Thanks Regards, Meethu M

SQL shell for Spark SQL?

2014-09-17 Thread David Rosenstrauch
Is there a shell available for Spark SQL, similar to the way the Shark or Hive shells work? From my reading up on Spark SQL, it seems like one can execute SQL queries in the Spark shell, but only from within code in a programming language such as Scala. There does not seem to be any way to