Fair scheduling accross applications in stand-alone mode

2014-12-05 Thread Mohammed Guller
Hi - I understand that one can use "spark.deploy.defaultCores" and "spark.cores.max" to assign a fixed number of worker cores to different apps. However, instead of statically assigning the cores, I would like Spark to dynamically assign the cores to multiple apps. For example, when there is a

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Okay, my bad for not testing out the documented arguments - once i use the correct ones, the query shrinks completes in ~55s (I can probably make it faster). Thanks for the help, eh?! On Fri Dec 05 2014 at 10:34:50 PM Denny Lee wrote: > Sorry for the delay in my response - for my spark calls

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
Sorry for the delay in my response - for my spark calls for stand-alone and YARN, I am using the --executor-memory and --total-executor-cores for the submission. In standalone, my baseline query completes in ~40s while in YARN, it completes in ~1800s. It does not appear from the RM web UI that it

Re: Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
TD, Thanks. This helps a lot. So in that case increasing the number of workers in case 2 to 8 or 12 then the aggregate read bandwidth is going to increase. At least conceptually? -Soumya On Fri, Dec 5, 2014 at 10:51 PM, Tathagata Das wrote: > That depends! See inline. I am assuming that whe

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Getting this on the home machine as well. Not referencing the spark cassandra connector in libraryDependencies compiles. I've recently updated IntelliJ to 14. Could that be causing an issue? From: as...@live.com To: yuzhih...@gmail.com CC: user@spark.apache.org Subject: RE: Adding Spark Cassand

Re: Trying to understand a basic difference between these two configurations

2014-12-05 Thread Tathagata Das
That depends! See inline. I am assuming that when you said replacing local disk with HDFS in case 1, you are connected to a separate HDFS cluster (like case 1) with a single 10G link. Also assumign that all nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the spark application driver

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Imran Rashid
> It's an easy mistake to make... I wonder if an assertion could be implemented that makes sure the type parameter is present. We could use the "NotNothing" pattern http://blog.evilmonkeylabs.com/2012/05/31/Forcing_Compiler_Nothing_checks/ but I wonder if it would just make the method signature

Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster) Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
There were two exit in this code. If the args was wrong, the spark-submit will get the return code 101, but, if the args is correct, spark-submit cannot get the second return code 100. What’s the difference between these two exit? I was so confused. I’m also confused. When I tried your codes, spar

Re: rdd.saveAsTextFile problem

2014-12-05 Thread dylanhogg
Try the workaround for Windows found here: http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. This fix the issue when calling rdd.saveAsTextFile(..) for me with Spark v1.1.0 on windows 8.1 in local mode. Summary of steps: 1) download compiled winutils.exe from http://social.m

Problems creating and reading a large test file

2014-12-05 Thread Steve Lewis
I am trying to look at problems reading a data file over 4G. In my testing I am trying to create such a file. My plan is to create a fasta file (a simple format used in biology) looking like >1 TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG >2 G

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Here's the solution I got after talking with Liancheng: 1) using backquote `..` to wrap up all illegal characters val rdd = parquetFile(file) val schema = rdd.schema.fields.map(f => s"`${f.name}` ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n") val ddl_13 = s"""

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Ted Yu
I tried the following: 511 rm -rf ~/.m2/repository/org/apache/spark/spark-core_2.10/1.3.0-SNAPSHOT/ 513 mvn -am -pl streaming package -DskipTests [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [4.976s] [INFO] Spark Project Networking ..

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
I've never used it, but reading the help it seems the "-am" option might help here. On Fri, Dec 5, 2014 at 4:47 PM, Sean Owen wrote: > Maven definitely compiles "what is needed", but not if you tell it to > only compile one module alone. Unless you have previously built and > installed the other

Re: Transfer from RDD to JavaRDD

2014-12-05 Thread Sean Owen
You can probably get around it with casting, but I ended up using wrapRDD -- which is not a static method -- from another JavaRDD in scope to address this more directly without casting or warnings. It's not ideal but both should work, just a matter of which you think is less hacky. On Fri, Dec 5,

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Sean Owen
Maven definitely compiles "what is needed", but not if you tell it to only compile one module alone. Unless you have previously built and installed the other local snapshot artifacts it needs, that invocation can't proceed because you have restricted it to build one module whose dependencies don't

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i think what changed is that core now has dependencies on other sub projects. ok... so i am forced to install stuff because maven cannot compile "what is needed". i will install On Fri, Dec 5, 2014 at 7:12 PM, Koert Kuipers wrote: > i suddenly also run into the issue that maven is trying to down

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
Hi, I had to use Pig for some preprocessing and to generate Parquet files for Spark to consume. However, due to Pig's limitation, the generated schema contains Pig's identifier e.g. sorted::id, sorted::cre_ts, ... I tried to put the schema inside CREATE EXTERNAL TABLE, e.g. create external t

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
I see. The resulting SchemaRDD is returned so like Michael said, the exception does not propogate to user code. However printing out the following log is confusing :) scala> sql("drop table if exists abc") 14/12/05 16:27:02 INFO ParseDriver: Parsing command: drop table if exists abc 14/12/05 16:2

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Koert Kuipers
i suddenly also run into the issue that maven is trying to download snapshots that dont exists for other sub projects. did something change in the maven build? does maven not have capability to smartly compile the other sub-projects that a sub-project depends on? i rather avoid "mvn install" sin

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote: > Hi, > > O

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell
The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 201

Transfer from RDD to JavaRDD

2014-12-05 Thread Xingwei Yang
I use Spark in Java. I want to access the vectors of RowMatrix M, thus I use M.rows(), which is a RDD I want to transform it to JavaRDD, I used the following command; JavaRDD data = JavaRDD.fromRDD(M.rows(), scala.reflect.ClassTag$.MODULE$.apply(Vector.class); However, it shows a error like th

Re: Market Basket Analysis

2014-12-05 Thread Debasish Das
Apriori can be thought as a post-processing on product similarity graph...I call it product similarity but for each product you build a node which keeps distinct users visiting the product and two product nodes are connected by an edge if the intersection > 0...you are assuming if no one user visit

Re: How to incrementally compile spark examples using mvn

2014-12-05 Thread Marcelo Vanzin
You can set SPARK_PREPEND_CLASSES=1 and it should pick your new mllib classes whenever you compile them. I don't see anything similar for examples/, so if you modify example code you need to re-build the examples module ("package" or "install" - just "compile" won't work, since you need to build t

Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached R

Cannot PredictOnValues or PredictOn base on the model build with StreamingLinearRegressionWithSGD

2014-12-05 Thread Bui, Tri
Hi, The following example code is able to build the correct model.weights, but its prediction value is zero. Am I passing the PredictOnValues incorrectly? I also coded a batch version base on LinearRegressionWithSGD() with the same train and test data, iteration, stepsize info, and it was

Re: Including data nucleus tools

2014-12-05 Thread DB Tsai
Can you try to run the same job using the assembly packaged by make-distribution as we discussed in the other thread. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014 at 12

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread DB Tsai
Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai

Re: Using data in RDD to specify HDFS directory to write to

2014-12-05 Thread Nathan Murthy
I'm experiencing the same problem when I try to run my app in a standalone Spark cluster. My use case, however, is closer to the problem documented in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Please-help-running-a-standalone-app-on-a-Spark-cluster-td1596.html. The solution

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Arun I've seen that behavior before. It happens when the cluster doesn't have enough resources to offer and the RM hasn't given us our containers yet. Can you check the RM Web UI at port 8088 to see whether your application is requesting more resources than the cluster has to offer? 2014-12-05

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be seeing this error. It should be addressed by this patch: https://github.com/apache/spark/pull/3150 On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren wrote: > Hi, > > I am using SparkSQL on 1.1.0 branch. > > The following code leads to

R: Optimized spark configuration

2014-12-05 Thread Paolo Platter
What kind of Query are you performing? You should set something like 2 partition per core that would be 400 Mb per partition. As you have a lot of ram I suggest to cache the whole table, performance will increase a lot. Paolo Inviata dal mio Windows Phone Da: vd

Re: Java RDD Union

2014-12-05 Thread Sean Owen
foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get a reference to them in a method like this. This is definitely a bad idea, as there is certainly no guarantee that any other ope

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hey Arun, The sleeps would only cause maximum like 5 second overhead. The idea was to give executors some time to register. On more recent versions, they were replaced with the spark.scheduler.minRegisteredResourcesRatio and spark.scheduler.maxRegisteredResourcesWaitingTime. As of 1.1, by defau

Re: spark-submit on YARN is slow

2014-12-05 Thread Ashish Rangole
Likely this not the case here yet one thing to point out with Yarn parameters like --num-executors is that they should be specified *before* app jar and app args on spark-submit command line otherwise the app only gets the default number of containers which is 2. On Dec 5, 2014 12:22 PM, "Sandy Ryz

Re: spark-submit on YARN is slow

2014-12-05 Thread Arun Ahuja
Hey Sandy, What are those sleeps for and do they still exist? We have seen about a 1min to 1:30 executor startup time, which is a large chunk for jobs that run in ~10min. Thanks, Arun On Fri, Dec 5, 2014 at 3:20 PM, Sandy Ryza wrote: > Hi Denny, > > Those sleeps were only at startup, so if jo

Re: Market Basket Analysis

2014-12-05 Thread Sean Owen
I doubt Amazon uses a priori for this, but who knows. Usually you want "also bought" functionality, which is a form of similar-item computation. But you don't want to favor items that are simply frequently purchased in general. You probably want to look at pairs of items that co-occur in purchase

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala> val firstRDD = sc.parallelize(1 to 5) firstRDD: org.apache.spark.rdd.RDD[I

Including data nucleus tools

2014-12-05 Thread spark.dubovsky.jakub
Hi all,   I have created assembly jar from 1.2 snapshot source by running [1] which sets correct version of hadoop for our cluster and uses hive profile. I also have written relatively simple test program which starts by reading data from parquet using hive context. I compile the code against as

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5, 201

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Denny, Those sleeps were only at startup, so if jobs are taking significantly longer on YARN, that should be a different problem. When you ran on YARN, did you use the --executor-cores, --executor-memory, and --num-executors arguments? When running against a standalone cluster, by default Spa

Re: spark-submit on YARN is slow

2014-12-05 Thread Sameer Farooqui
Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a 1-node m3.xlarge EC2 instance instance and the app finishes running successfully in about 40 seconds. I just figured the 30 - 40 sec run time was normal b/c of the submitting overhead that Andrew mentioned. Denny, you can mayb

Re: SchemaRDD partition on specific column values?

2014-12-05 Thread Michael Armbrust
It does not appear that the in-memory caching currently preserves the information about the partitioning of the data so this optimization will probably not work. On Thu, Dec 4, 2014 at 8:42 PM, nitin wrote: > With some quick googling, I learnt that I can we can provide "distribute by > " in hive

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust wrote: > The command run fine for me on master. Note that Hive does print an > exception in the logs, but that exception does not propogate to user code. > > On Thu, Dec 4, 2014

Re: drop table if exists throws exception

2014-12-05 Thread Michael Armbrust
The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code. On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang wrote: > Hi, > > I got exception saying Hive: NoSuchObjectException(message: table > not found) > > when

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Andrew Or
Increasing max failures is a way to do it, but it's probably a better idea to keep your tasks from failing in the first place. Are your tasks failing with exceptions from Spark or your application code? If from Spark, what is the stack trace? There might be a legitimate Spark bug such that even inc

Re: spark-submit on YARN is slow

2014-12-05 Thread Denny Lee
My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand steps. If I was running this on standalone cluster mode the query finished in 55s but on YARN, the query was still running 30min later. Would the hard coded sleeps potentially be in play here? On Fri, Dec 5, 2014 at 11:23 Sandy Ry

Re: Any ideas why a few tasks would stall

2014-12-05 Thread Andrew Or
Hi Steve et al., It is possible that there's just a lot of skew in your data, in which case repartitioning is a good idea. Depending on how large your input data is and how much skew you have, you may want to repartition to a larger number of partitions. By the way you can just call rdd.repartitio

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting d

Re: Issue in executing Spark Application from Eclipse

2014-12-05 Thread Andrew Or
Hey Stuti, Did you start your standalone Master and Workers? You can do this through sbin/start-all.sh (see http://spark.apache.org/docs/latest/spark-standalone.html). Otherwise, I would recommend launching your application from the command line through bin/spark-submit. I am not sure if we offici

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Sorry...really don't have enough maven know how to do this quickly. I tried the pom below, and IntelliJ could find org.apache.spark.streaming.StreamingContext and org.apache.spark.streaming.Seconds, but not org.apache.spark.streaming.receiver.Receiver. Is there something specific I can try? I'l

Re: Monitoring Spark

2014-12-05 Thread Andrew Or
If you're only interested in a particular instant, a simpler way is to check the executors page on the Spark UI: http://spark.apache.org/docs/latest/monitoring.html. By default each executor runs one task per core, so you can see how many tasks are being run at a given time and this translates dire

Re: spark-submit on YARN is slow

2014-12-05 Thread Sandy Ryza
Hi Tobias, What version are you using? In some recent versions, we had a couple of large hardcoded sleeps on the Spark side. -Sandy On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or wrote: > Hey Tobias, > > As you suspect, the reason why it's slow is because the resource manager > in YARN takes a wh

Re: [Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Ankur Dave
At 2014-12-05 02:26:52 -0800, Yifan LI wrote: > I have a graph in where each vertex keep several messages to some faraway > neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. > k = 5). > > now, I propose to distribute these messages to their corresponding > destinatio

Re: spark-submit on YARN is slow

2014-12-05 Thread Andrew Or
Hey Tobias, As you suspect, the reason why it's slow is because the resource manager in YARN takes a while to grant resources. This is because YARN needs to first set up the application master container, and then this AM needs to request more containers for Spark executors. I think this accounts f

Re: Unable to run applications on clusters on EC2

2014-12-05 Thread Andrew Or
Hey, the default port is 7077. Not sure if you actually meant to put 7070. As a rule of thumb, you can go to the Master web UI and copy and paste the URL at the top left corner. That almost always works unless your cluster has a weird proxy set up. 2014-12-04 14:26 GMT-08:00 Xingwei Yang : > I th

Re: Spark streaming for v1.1.1 - unable to start application

2014-12-05 Thread Andrew Or
Hey Sourav, are you able to run a simple shuffle in a spark-shell? 2014-12-05 1:20 GMT-08:00 Shao, Saisai : > Hi, > > > > I don’t think it’s a problem of Spark Streaming, seeing for call stack, > it’s the problem when BlockManager starting to initializing itself. Would > you mind checking your c

Re: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ted Yu
Can you try with maven ? diff --git a/streaming/pom.xml b/streaming/pom.xml index b8b8f2e..6cc8102 100644 --- a/streaming/pom.xml +++ b/streaming/pom.xml @@ -68,6 +68,11 @@ junit-interface test + + com.datastax.spark + spark-cassandra-connector_2.10 + 1.1.0 +

I am having problems reading files in the 4GB range

2014-12-05 Thread Steve Lewis
I am using a custom hadoop input format which works well on smaller files but fails with a file at about 4GB size - the format is generating about 800 splits and all variables in my code are longs - Any suggestions? Is anyone reading files of this size? Exception in thread "main" org.apache.spark

Optimized spark configuration

2014-12-05 Thread vdiwakar.malladi
Hi Could any one help what would be better / optimized configuration for driver memory, worker memory, number of parallelisms etc., parameters to be configured when we are running 1 master node (it itself acting as slave node also) and 1 slave node. Both are of 32 GB RAM with 4 cores. On this, I

RE: Market Basket Analysis

2014-12-05 Thread Ashic Mahtab
This can definitely be useful. "Frequently bought together" is something amazon does, though surprisingly, you don't get a discount. Perhaps it can lead to offering (or avoiding!) deals on frequent itemsets. This is a good resource for frequent itemsets implementations: http://infolab.stanford.

RE: Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Ashic Mahtab
I've done this: 1. foreachPartition 2. Open connection. 3. foreach inside the partition. 4. close the connection. Slightly crufty, but works. Would love to see a better approach. Regards, Ashic. Date: Fri, 5 Dec 2014 12:32:24 -0500 Subject: Spark Streaming Reusing JDBC Connections From: asimja.

Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Hi, Seems adding the cassandra connector and spark streaming causes "issues". I've added by build and code file. Running "sbt compile" gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver is not a member of package org.apache.spark.streaming.receiver. If

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
The code is really simple : *object TestKMeans {* * def main(args: Array[String]) {* *val conf = new SparkConf()* * .setAppName("Test KMeans")* * .setMaster("local[8]")* * .set("spark.executor.memory", "8g")* *val sc = new SparkContext(conf)* *val numClusters = 500;

Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Asim Jalis
Is there a way I can have a JDBC connection open through a streaming job. I have a foreach which is running once per batch. However, I don’t want to open the connection for each batch but would rather have a persistent connection that I can reuse. How can I do this? Thanks. Asim

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Ted Yu
I don't think 'sbt assembly' would touch local maven repo for Spark. Looking at dependency:tree output: [INFO] org.apache.spark:spark-streaming_2.10:jar:1.1.0-SNAPSHOT [INFO] +- org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT:compile spark-streaming only depends on spark-core other than thir

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Davies Liu
Could you post you script to reproduce the results (also how to generate the dataset)? That will help us to investigate it. On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa wrote: > Hmm, here I use spark on local mode on my laptop with 8 cores. The data is > on my local filesystem. Event thought

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hmm, here I use spark on local mode on my laptop with 8 cores. The data is on my local filesystem. Event thought, there an overhead due to the distributed computation, I found the difference between the runtime of the two implementations really, really huge. Is there a benchmark on how well the alg

Re: Market Basket Analysis

2014-12-05 Thread Rohit Pujari
This is a typical use case "people who buy electric razors, also tend to buy batteries and shaving gel along with it". The goal is to build a model which will look through POS records and find which product categories have higher likelihood of appearing together in given a transaction. What would

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Thank you Helena, But I would like to explain my problem space: The output is supposed to be Cassandra. To achieve that, I have to use spark-cassandra-connecter APIs. So going in a botton-up approach, to write to cassandra, I have to use: javaFunctions( rdd, TestTable.class).saveToCassandra

pyspark exception catch

2014-12-05 Thread Igor Mazor
Hi , Is it possible to catch exceptions using pyspark so in case of error, the program will not fail and exit. for example if I am using (key, value) rdd functionality but the data don't have actually (key, value) format, pyspark will throw exception (like ValueError) that I am unable to catch.

Re: Why KMeans with mllib is so slow ?

2014-12-05 Thread Sean Owen
Spark has much more overhead, since it's set up to distribute the computation. Julia isn't distributed, and so has no such overhead in a completely in-core implementation. You generally use Spark when you have a problem large enough to warrant distributing, or, your data already lives in a distribu

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, Specifying '-DskipTests' on commandline worked, though I can't be sure whether first running 'sbt assembly' also contributed to the solution. (I've tried 'sbt assembly' because branch-1.1's README says to use sbt). Thanks for the answer. Kind regards, Emre Sevinç

Why KMeans with mllib is so slow ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to a run clustering with kmeans algorithm. The size of my data set is about 240k vectors of dimension 384. Solving the problem with the kmeans available in julia (kmean++) http://clusteringjl.readthedocs.org/en/latest/kmeans.html take about 8 minutes on a single core. Solvin

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
I think what you are looking for is something like: JavaRDD pricesRDD = javaFunctions(sc).cassandraTable("ks", "tab", mapColumnTo(Double.class)).select("price"); JavaRDD rdd = javaFunctions(sc).cassandraTable("ks", "people", mapRowTo(Person.class)); noted here: https://github.com/datastax/spa

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Ok, I misunderstood the meaning of the partition. In fact, my file is 1.7G big and with less bigger file I have a different partitions size. Thanks for this clarification. On Fri, Dec 5, 2014 at 4:15 PM, Sean Owen wrote: > How big is your file? it's probably of a size that the Hadoop > InputForm

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread m.sarosh
Hi Akhil, Vyas, Helena, Thank you for your suggestions. As Akhil suggested earlier, i have implemented the batch Duration into JavaStreamingContext and waitForTermination(Duration). The approach Helena suggested is Scala oriented. But the issue now is that I want to set Cassandra as my output.

Re: Why my default partition size is set to 52 ?

2014-12-05 Thread Sean Owen
How big is your file? it's probably of a size that the Hadoop InputFormat would make 52 splits for it. Data drives partitions, not processing resource. Really, 8 splits is the minimum parallelism you want. Several times your # of cores is better. On Fri, Dec 5, 2014 at 8:51 AM, Jaonary Rabarisoa

Why my default partition size is set to 52 ?

2014-12-05 Thread Jaonary Rabarisoa
Hi all, I'm trying to run some spark job with spark-shell. What I want to do is just to count the number of lines in a file. I start the spark-shell with the default argument i.e just with ./bin/spark-shell. Load the text file with sc.textFile("path") and then call count on my data. When I do th

Re: subscribe me to the list

2014-12-05 Thread 张鹏
Hi Ningjun Please send email to this address to get subscribed: user-subscr...@spark.apache.org On Dec 5, 2014, at 10:36 PM, Wang, Ningjun (LNG-NPV) wrote: > I would like to subscribe to the user@spark.apache.org > > > Regards, > > Ningjun Wang > Consulting Software Engineer > LexisNexi

subscribe me to the list

2014-12-05 Thread Wang, Ningjun (LNG-NPV)
I would like to subscribe to the user@spark.apache.org Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541

Re: How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Ted Yu
Please specify '-DskipTests' on commandline. Cheers On Dec 5, 2014, at 3:52 AM, Emre Sevinc wrote: > Hello, > > I'm currently developing a Spark Streaming application and trying to write my > first unit test. I've used Java for this application, and I also need use > Java (and JUnit) for wr

Re: SPARK LIMITATION - more than one case class is not allowed !!

2014-12-05 Thread Daniel Darabos
On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer wrote: > Rahul, > > On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish < > rahul.bindl...@nectechnologies.in> wrote: >> >> I have done so thats why spark is able to load objectfile [e.g. >> person_obj] >> and spark has maintained serialVersionUID [perso

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd => { rdd.map(event => { // pro

Re: Market Basket Analysis

2014-12-05 Thread Sean Owen
Generally I don't think frequent-item-set algorithms are that useful. They're simple and not probabilistic; they don't tell you what sets occurred unusually frequently. Usually people ask for frequent item set algos when they really mean they want to compute item similarity or make recommendations.

Re: Increasing the number of retry in case of job failure

2014-12-05 Thread Daniel Darabos
It is controlled by "spark.task.maxFailures". See http://spark.apache.org/docs/latest/configuration.html#scheduling. On Fri, Dec 5, 2014 at 11:02 AM, shahab wrote: > Hello, > > By some (unknown) reasons some of my tasks, that fetch data from > Cassandra, are failing so often, and apparently the

cartesian on pyspark not paralleised

2014-12-05 Thread Antony Mayi
Hi, using pyspark 1.1.0 on YARN 2.5.0. all operations run nicely in parallel - I can seen multiple python processes spawned on each nodemanager but from some reason when running cartesian there is only single python process running on each node. the task is indicating thousands of partitions so

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Helena Edelson
You can just do You can just do something like this, the Spark Cassandra Connector handles the rest KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Map(KafkaTopicRaw -> 10), StorageLevel.DISK_ONLY_2) .map { case (_, line) => line.split(",")} .map(Ra

Re: Does filter on an RDD scan every data item ?

2014-12-05 Thread nsareen
Any thoughts, how could Spark SQL help in our scenario ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20465.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-05 Thread Daniel Darabos
Hi, Alexey, I'm getting the same error on startup with Spark 1.1.0. Everything works fine fortunately. The error is mentioned in the logs in https://issues.apache.org/jira/browse/SPARK-4498, so maybe it will also be fixed in Spark 1.2.0 and 1.1.2. I have no insight into it unfortunately. On Tue,

How can I compile only the core and streaming (so that I can get test utilities of streaming)?

2014-12-05 Thread Emre Sevinc
Hello, I'm currently developing a Spark Streaming application and trying to write my first unit test. I've used Java for this application, and I also need use Java (and JUnit) for writing unit tests. I could not find any documentation that focuses on Spark Streaming unit testing, all I could find

Re: Using sparkSQL to convert a collection of python dictionary of dictionaries to schma RDD

2014-12-05 Thread sahanbull
I worked man.. Thanks alot :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-tp20228p20461.html Sent from the Apache Spark User List mailing list archive at Nabble.c

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any scena

scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Hao Ren
Hi, I am using SparkSQL on 1.1.0 branch. The following code leads to a scala.MatchError at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) val scm = StructType(inputRDD.schema.fields.init :+ StructField("list", ArrayType( StructType(

[Graphx] which way is better to access faraway neighbors?

2014-12-05 Thread Yifan LI
Hi, I have a graph in where each vertex keep several messages to some faraway neighbours(I mean, not to only immediate neighbours, at most k-hops far, e.g. k = 5). now, I propose to distribute these messages to their corresponding destinations(say, "faraway neighbours”): - by using pregel api

Increasing the number of retry in case of job failure

2014-12-05 Thread shahab
Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often, and apparently the master removes a tasks which fails more than 4 times (in my case). Is there any way to increase the number of re-tries ? best, /Shahab

Re: NullPointerException When Reading Avro Sequence Files

2014-12-05 Thread cjdc
Hi all, I've tried the above example on Gist, but it doesn't work (at least for me). Did anyone get this: 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class w

Re: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread Shixiong Zhu
What's the status of this application in the yarn web UI? Best Regards, Shixiong Zhu 2014-12-05 17:22 GMT+08:00 LinQili : > I tried anather test code: > def main(args: Array[String]) { > if (args.length != 1) { > Util.printLog("ERROR", "Args error - arg1: BASE_DIR") > exit(101)

RE: Issue on [SPARK-3877][YARN]: Return code of the spark-submit in yarn-cluster mode

2014-12-05 Thread LinQili
I tried anather test code: def main(args: Array[String]) {if (args.length != 1) { Util.printLog("ERROR", "Args error - arg1: BASE_DIR") exit(101) }val currentFile = args(0).toStringval DB = "test_spark" val tableName = "src" val sparkConf = new SparkConf().setApp

  1   2   >