RE: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .
Hi, Thanks guys. I was using spark-cassandra-connector 1.4.0-M1. There is a issue in this version of spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb is supposed to have a default value of 64 MB , is being interpreted as 64 bytes. This causes too many

Spark REST APIs not working when application deployed in CLUSTER mode

2016-03-10 Thread rakesh rakshit
Hi all, I am able to make the REST API calls in client mode but when I run the application in Cluster mode, I am getting the below exception in driver logs: java.lang.IncompatibleClassChangeError: com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider and

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Deepak Gopalakrishnan
1. I'm using about 1 million users against few thousand products. I basically have around a million ratings 2. Spark 1.6 on Amazon EMR On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath wrote: > Could you provide more details about: > 1. Data set size (# ratings, # users

Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-10 Thread Gayathri Murali
Hi all, I am trying to run python unit tests. I currently have Python 2.6 and 2.7 installed. I installed unittest2 against both of them. When I try to run /python/run-tests with Python 2.7 I get the following error : Please install unittest2 to test with Python 2.6 or earlier Had test

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Nick Pentreath
Sean's old Myrrix slides contain an overview of the fold-in math: http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares/14?src=clipshare I never quite got around to actually incorporating it into my own ALS-based systems, because in the end I just

Re: Can we use spark inside a web service?

2016-03-10 Thread Nick Pentreath
Yes, really interesting discussion. It would be really interesting to compare the performance of alternative architectures. Specifically, I've found that Elasticsearch is a great option for analytic workloads - it doesn't support SQL (joins in particular), but its aggregation and arbitrary

RE: lint-r checks failing

2016-03-10 Thread Sun, Rui
This is probably because the installed lintr package get updated. After update, lintr can detect errors that are skipped before I will submit a PR for this issue -Original Message- From: Gayathri Murali [mailto:gayathri.m.sof...@gmail.com] Sent: Friday, March 11, 2016 12:48 PM To:

How can I join two DataSet of same case class?

2016-03-10 Thread 박주형
Hi. I want to join two DataSet. but below stderr is shown 16/03/11 13:55:51 WARN ColumnName: Constructing trivially true equals predicate, ''edid = 'edid'. Perhaps you need to use aliases. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'edid' given input

lint-r checks failing

2016-03-10 Thread Gayathri Murali
Hi All, I recently tried to run ./dev/run-tests on a freshly clones spark repository and I get lint-r check failed error. I have run these tests multiple times before and never had this issue. I have copied part of the issue here. Please note that I haven’t modified any of these files. Am I

Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Hear, hear. That’s why I’m here :) > On Mar 10, 2016, at 7:32 PM, Chris Fregly wrote: > > Anyway, thanks for the good discussion, everyone! This is why we have these > lists, right! :)

Get output of the ALS algorithm.

2016-03-10 Thread Shishir Anshuman
hello, I am new to Apache Spark and would like to get the Recommendation output of the ALS algorithm in a file. Please suggest me the solution. Thank you

Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-10 Thread Mukul Gupta
Hi All,I was running the following test:*Setup*9 VM runing spark workers with 1 spark executor each.1 VM running kafka and spark master.Spark version is 1.6.0Kafka version is 0.9.0.1Spark is using its own resource manager and is not running over YARN.*Test*I created a kafka topic with 3 partition.

Why the shuffle write is not the exactly same as shuffle read of next stage

2016-03-10 Thread canan chen
Here's my screenshot, the stage 19 and 20 is one-to-one relationship. They're the only child/parent. From my understanding, the shuffle write of stage 19 should be the same as shuffle read of stage 20, but here they are a little difference. Is there any reason for it ? Thanks. [image: Inline

Does anyone have experience processing large volume images on Spark cluster

2016-03-10 Thread greg huang
Hi All, Does anyone have experience processing large volume images on Spark cluster? Such as use the Spark to run some MapReduce tasks to distinguish some common features, for example count cars number in a satellite pictures. Regards, Greg

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
great discussion, indeed. Mark Hamstra and i spoke offline just now. Below is a quick recap of our discussion on how they've achieved acceptable performance from Spark on the user request/response path (@mark- feel free to correct/comment). 1) there is a big difference in request/response

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan
One of the premises here is that if you can restrict your workload to fewer cores - which is easier with FiloDB and careful data modeling - you can make this work for much higher concurrency and lower latency than most typical Spark use cases. The reason why it typically does not work in

Re: Spark configuration with 5 nodes

2016-03-10 Thread Mich Talebzadeh
Hi, Bear in mind that you typically need 1GB of NameNode memory for 1 million blocks. So if you have 128MB block size, you can store 128 * 1E6 / (3 *1024) = 41,666GB of data for every 1GB. Number 3 comes from the fact that the block is replicated three times. In other words just under 42TB of

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
When you say driver running on mesos can you explain how are you doing that...?? > On Mar 10, 2016, at 4:44 PM, Eran Chinthaka Withana > wrote: > > Yanling I'm already running the driver on mesos (through docker). FYI, I'm > running this on cluster mode with

Flashback: RDD.aggregate versus accumulables...

2016-03-10 Thread jiml
And Lord Joe you were right future versions did protect accumulators in actions. I wonder if anyone has a "modern" take on the accumulator vs. aggregate question. Seems like if I need to do it by key or control partitioning I would use aggregate. Bottom line question / reason for post: I wonder

Re: Spark configuration with 5 nodes

2016-03-10 Thread Prabhu Joseph
Ashok, Cluster nodes has enough memory but CPU cores are less. 512GB / 16 = 32 GB. For 1 core the cluster has 32GB memory. Either their should be more cores available to use efficiently the available memory or don't configure a higher executor memory which will cause lot of GC. Thanks,

Re: Can we use spark inside a web service?

2016-03-10 Thread Teng Qiu
This is really depends on how you defined "hot" :) and use cases, spark is definitely not that one fits all. At least not yet. Specially for heavy joins and full scans. Maybe spark alone fits your production workload and analytical requirements, but in general, I agree with Chris, for high

Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani
Thanks Michael for clarifying this. My response is inline. Guru Medasani gdm...@gmail.com > On Mar 10, 2016, at 12:38 PM, Michael Armbrust wrote: > > A few clarifications: > > 1) High memory and cpu usage. This is because Parquet files can't be streamed > into as

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Chris Fregly
@Colin- you're asking the $1 million dollar question that a lot of people are trying to do. This was literally the #1 most-asked question in every city on my recent world-wide meetup tour. I've been pointing people to my old Databricks co-worker's streaming-matrix-factorization project:

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
In [7]: pdf = gdf1.toPandas() pdf['date'] = epoch2num(pdf['ms'] ) print(pdf.dtypes) pdf count int64 row_keyobject createddatetime64[ns] ms int64 date float64 dtype: object Out[7]: countrow_keycreatedmsdate 02realDonaldTrump2016-03-09

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
The fact that a typical Job requires multiple Tasks is not a problem, but rather an opportunity for the Scheduler to interleave the workloads of multiple concurrent Jobs across the available cores. I work every day with such a production architecture with Spark on the user request/response hot

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson
Hi Ted In python the data type is Œfloat64¹. I have tried using both sql FloatType and DoubleType how ever I get the same error Strange andy From: Ted Yu Date: Wednesday, March 9, 2016 at 3:28 PM To: Andrew Davidson Cc: "user @spark"

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
you are correct, mark. i misspoke. apologies for the confusion. so the problem is even worse given that a typical job requires multiple tasks/cores. i have yet to see this particular architecture work in production. i would love for someone to prove otherwise. On Thu, Mar 10, 2016 at 5:44

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra
> > For example, if you're looking to scale out to 1000 concurrent requests, > this is 1000 concurrent Spark jobs. This would require a cluster with 1000 > cores. This doesn't make sense. A Spark Job is a driver/DAGScheduler concept without any 1:1 correspondence between Worker cores and Jobs.

Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon
Very interested, Evan, thanks for the link. It has given me some food for thought. I’m also in the process of building a web application which leverage Spark on the back-end for some heavy lifting. I would be curious about your thoughts on my proposed architecture: I was planning on running a

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Andy Davidson
In my experience I would try the following I use the standalone cluster manager. Each app gets it own performance web page . The streaming tab is really helpful. If processing time is greater than then your mini batch length you are going to run into performance problems Use the ³stages² tab to

Spark configuration with 5 nodes

2016-03-10 Thread Ashok Kumar
  Hi, We intend  to use 5servers which will be utilized for building Bigdata Hadoop data warehousesystem (not using any propriety distribution like Hortonworks or Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 cores, Ubuntu Linux servers. Hadoop will be

Running MesosClusterDispatcher through marathon with bridged mode

2016-03-10 Thread Eran Chinthaka Withana
Hi I was able to run MesosClusterDispatcher, through marathon, using docker host networking. It registers with mesos and can submit jobs without any issues. But when I tried to make MesosClusterDispatcher run with bridged networking mode, it never registers with mesos. I exposed ports 8081 and

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly
Good stuff, Evan. Looks like this is utilizing the in-memory capabilities of FiloDB which is pretty cool. looking forward to the webcast as I don't know much about FiloDB. My personal thoughts here are to removed Spark from the user request/response hot path. I can't tell you how many times

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann
Hi Nick, Thanks for the feedback and the pointers. I tried coalescing to fewer partitions and improved the situation dramatically. As you suggested, it is communication overhead dominating the overall runtime. The training run I mentioned originally had 900 partitions. Each tree aggregation has

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana
Yanling I'm already running the driver on mesos (through docker). FYI, I'm running this on cluster mode with MesosClusterDispatcher. Mac (client) > MesosClusterDispatcher > Driver running on Mesos --> Workers running on Mesos My next step is to run MesosClusterDispatcher in mesos through

Graphx

2016-03-10 Thread Andrew A
Hi, is there anyone who use graphx in production? What maximum size of graphs did you process by spark and what cluster are you use for it? i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank from spark examples and i faced with large volume shuffles produced by spark

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust
Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546

Job submission failure exception - unserializable TaskEndReason

2016-03-10 Thread Chris Westin
I'm getting an exception when I try to submit a job (through prediction.io, if you know it): [INFO] [Runner$] Submission command: /home/pio/PredictionIO/vendors/spark-1.5.1/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Sean Owen
While it isn't crazy, I am not sure how valid it is to build a model off of only a chunk of recent data and then merge it into another model in any direct way. They're not really sharing a basis, so you can't just average them. My experience with this aspect suggests you should try to update the

GraphX Pagerank Subgraphs

2016-03-10 Thread adamsmith
Dear community, I have a flat, really huge RDD describing a link-graph, sth. like a.com/page1 -> b.com/page2 a.com/page1 -> a.com/page5 b.com/page5 -> a.com/page3 I want to calculate pagerank (with GraphX) for in-domain links, i.e. pagerank for all pages of domain a.com, b.com. whatever.com.

Re: Installing Spark on Mac

2016-03-10 Thread Tristan Nixon
If you type ‘whoami’ in the terminal, and it responds with ‘root’ then you’re the superuser. However, as mentioned below, I don’t think its a relevant factor. > On Mar 10, 2016, at 12:02 PM, Aida Tefera wrote: > > Hi Tristan, > > I'm afraid I wouldn't know whether I'm

RE: Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
Grouping is applied in the aggregation. From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thu, Mar 10, 2016 13:56 To: Gerhard Fiedler Cc: user@spark.apache.org Subject: Re: Partitioning to speed up processing? Are they entire data set aggregates or is

[MLlib - ALS] Merging two Models?

2016-03-10 Thread Colin Woodbury
Hi there, I'm wondering if it's possible (or feasible) to combine the feature matrices of two MatrixFactorizationModels that share a user and product set. Specifically, one model would be the "on-going" model, and the other is one trained only on the most recent aggregation of some event data. My

Re: Partitioning to speed up processing?

2016-03-10 Thread Holden Karau
Are they entire data set aggregates or is there some grouping applied? On Thursday, March 10, 2016, Gerhard Fiedler wrote: > I have a number of queries that result in a sequence Filter > Project > > Aggregate. I wonder whether partitioning the input table makes

DataFrames - Kryo registration issue

2016-03-10 Thread Raghava Mutharaju
Hello All, If Kryo serialization is enabled, doesn't Spark take care of registration of built-in classes, i.e., are we not supposed to register just the custom classes? When using DataFrames, this does not seem to be the case. I had to register the following classes

Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
I have a number of queries that result in a sequence Filter > Project > Aggregate. I wonder whether partitioning the input table makes sense. Does Aggregate benefit from a partitioned input? If so, what partitions would be most useful (related to the aggregations)? Do Filter and Project

Dropping nested dataframe column

2016-03-10 Thread Ross.Cramblit
Is there any support for dropping a nested column in a dataframe? I have tried dropping with the Column reference as well as a string of the column name, but the returned dataframe is unchanged. >>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a", >>> "col2": "b"}}'])) >>>

Re: RDD recomputation

2016-03-10 Thread Kevin Mellott
I've had very good success troubleshooting this type of thing by using the Spark Web UI, which will depict a breakdown of all tasks. This also includes the RDDs being used, as well as any cached data. Additional information about this tool can be found at

RDD recomputation

2016-03-10 Thread souri datta
Hi, Currently I am trying to optimize my spark application and in that process, I am trying to figure out if at any stage in the code, I am recomputing a large RDD (so that I can optimize it by persisting/checkpointing it). Is there any indication in the event logs that tells us about an RDD

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ajay Chander
Hi Everyone, a quick question with in this context. What is the underneath persistent storage that you guys are using? With regards to this containerized environment? Thanks On Thursday, March 10, 2016, yanlin wang wrote: > How you guys make driver docker within container to be

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread yanlin wang
How you guys make driver docker within container to be reachable from spark worker ? Would you share your driver docker? i am trying to put only driver in docker and spark running with yarn outside of container and i don’t want to use —net=host Thx Yanlin > On Mar 10, 2016, at 11:06 AM,

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps
Glad to hear it. Thanks all for sharing your solutions. Le jeu. 10 mars 2016 19:19, Eran Chinthaka Withana a écrit : > Phew, it worked. All I had to do was to add *export > SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6"

RE: Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Silvio Fiorito
Very cool stuff Evan. Thanks for your work on this and sharing! From: Evan Chan Sent: Thursday, March 10, 2016 1:38 PM To: user@spark.apache.org Subject: Achieving 700 Spark SQL Queries Per Second Hey folks, I just saw a

Re: Can we use spark inside a web service?

2016-03-10 Thread velvia.github
Hi, I just wrote a blog post which might be really useful to you -- I have just benchmarked being able to achieve 700 queries per second in Spark. So, yes, web speed SQL queries are definitely possible. Read my new blog post: http://velvia.github.io/Spark-Concurrent-Fast-Queries/ and feel

Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Evan Chan
Hey folks, I just saw a recent thread on here (but can't find it anymore) on using Spark as a web-speed query engine. I want to let you guys know that this is definitely possible! Most folks don't realize how low-latency Spark can actually be. Please check out my blog post below on achieving

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust
A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once,

Re: Installing Spark on Mac

2016-03-10 Thread Aida
Hi Gaini, thanks for your response Please see the below contents of the files in the conf. directory: 1. docker.properties.template Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana
Phew, it worked. All I had to do was to add *export SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6" *before calling spark-submit. Guillaume, thanks for the pointer. Timothy, thanks for looking into this. Looking forward to see a fix soon. Thanks,

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana
Hi Timothy What version of spark are you guys running? > I'm using Spark 1.6.0. You can see the Dockerfile I used here: https://github.com/echinthaka/spark-mesos-docker/blob/master/docker/mesos-spark/Dockerfile > And also did you set the working dir in your image to be spark home? > Yes I

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Tim Chen
Hi Eran, I need to investigate but perhaps that's true, we're using SPARK_JAVA_OPTS to pass all the options and not --conf. I'll take a look at the bug, but if you can try the workaround and see if that fixes your problem. Tim On Thu, Mar 10, 2016 at 10:08 AM, Eran Chinthaka Withana <

Re: Installing Spark on Mac

2016-03-10 Thread Aida Tefera
Hi Tristan, I'm afraid I wouldn't know whether I'm running it as super user. I have java version 1.8.0_73 and SCALA version 2.11.7 Sent from my iPhone > On 9 Mar 2016, at 21:58, Tristan Nixon wrote: > > That’s very strange. I just un-set my SPARK_HOME env param,

Re: log4j pains

2016-03-10 Thread Tristan Nixon
Hmmm… that should be right. > On Mar 10, 2016, at 11:26 AM, Ashic Mahtab wrote: > > src/main/resources/log4j.properties > > Subject: Re: log4j pains > From: st...@memeticlabs.org > Date: Thu, 10 Mar 2016 11:08:46 -0600 > CC: user@spark.apache.org > To: as...@live.com > > Where

RE: log4j pains

2016-03-10 Thread Ashic Mahtab
src/main/resources/log4j.properties Subject: Re: log4j pains From: st...@memeticlabs.org Date: Thu, 10 Mar 2016 11:08:46 -0600 CC: user@spark.apache.org To: as...@live.com Where in the jar is the log4j.properties file? On Mar 10, 2016, at 9:40 AM, Ashic Mahtab wrote:1. Fat jar

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Tim Chen
Here is an example dockerfile, although it's a bit dated now if you build it today it should still work: https://github.com/tnachen/spark/tree/dockerfile/mesos_docker Tim On Thu, Mar 10, 2016 at 8:06 AM, Ashish Soni wrote: > Hi Tim , > > Can you please share your

Re: Problem with union of DirectStream

2016-03-10 Thread Cody Koeninger
If you do any RDD transformation, it's going to return a different RDD than the original. The implication for casting to HasOffsetRanges is specifically called out in the docs at http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers On Thu,

Re: log4j pains

2016-03-10 Thread Tristan Nixon
Where in the jar is the log4j.properties file? > On Mar 10, 2016, at 9:40 AM, Ashic Mahtab wrote: > > 1. Fat jar with logging dependencies included. log4j.properties in fat jar. > Spark doesn't pick up the properties file, so uses its defaults.

Problem with union of DirectStream

2016-03-10 Thread Guillermo Ortiz
I have a DirectStream and process data from Kafka, val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams1, topics1.toSet) directKafkaStream.foreachRDD { rdd => val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges When

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
Hi Tim , Can you please share your dockerfiles and configuration as it will help a lot , I am planing to publish a blog post on the same . Ashish On Thu, Mar 10, 2016 at 10:34 AM, Timothy Chen wrote: > No you don't need to install spark on each slave, we have been running

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Matthias Niehoff
Hi, the spark connector docs say: ( https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md ) "The number of Spark partitions(tasks) created is directly controlled by the setting spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Bryan Jeffrey
Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . wrote: > Hi, > > > > I have a Spark Batch job for reading timeseries data from Cassandra which > has

log4j pains

2016-03-10 Thread Ashic Mahtab
Hello,I'm trying to use a custom log4j appender, with things specified in a log4j.properties file. Very little seems to work in this regard. Here's what I've tried: 1. Fat jar with logging dependencies included. log4j.properties in fat jar. Spark doesn't pick up the properties file, so uses its

Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .
Hi, I have a Spark Batch job for reading timeseries data from Cassandra which has 50,000 rows. JavaRDD cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", "coordinate") .map(new Function() { @Override public

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Timothy Chen
No you don't need to install spark on each slave, we have been running this setup in Mesosphere without any problem at this point, I think most likely configuration problem and perhaps a chance something is missing in the code to handle some cases. What version of spark are you guys running?

how does dimsum works for categorical data

2016-03-10 Thread naveen.marri
Hi, I'm exploring dimsum(Dimension Independent Matrix Squares using Mapreduce) for finding similarites between users in terms of products they have purchased. I've modeled the matrix as User1, product1,product2,0,0 user2, product2,0,0,0 user3, product1,product3,product4,product2 . ...

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu
bq. Started SparkUI at http://192.168.2.103:4040 bq. Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Can you check UI ? Thanks On Thu, Mar 10, 2016 at 6:57 AM, Shams ul Haque wrote: >

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Shams ul Haque
Hi, *Release of Spark:* 1.6.0, i downloaded it and made a built using 'sbt/sbt assembly' *command for submitting your app: *bin/spark-submit --master spark://shams-machine:7077 --executor-cores 2 --class in.myapp.email.combiner.CombinerRealtime

Re: "bootstrapping" DStream state

2016-03-10 Thread Zalzberg, Idan (Agoda)
Ha! So easy, how could I miss it?! Thanks! Sent with AquaMail for Android http://www.aqua-mail.com On March 10, 2016 9:32:38 PM Todd Nist wrote: The updateStateByKey can be supplied an initialRDD to populate it with. Per code

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu
Can you provide a bit more information ? Release of Spark command for submitting your app code snippet of your app pastebin of log Thanks On Thu, Mar 10, 2016 at 6:32 AM, Shams ul Haque wrote: > Hi, > > I have developed a spark realtime app and started spark-standalone on

does spark needs dedicated machines to run on

2016-03-10 Thread Shams ul Haque
Hi, I have developed a spark realtime app and started spark-standalone on my laptop. But when i tried to submit that app in Spark it is always in WAITING state & Cores is always Zero. I have set: export SPARK_WORKER_CORES="2" export SPARK_EXECUTOR_CORES="1" in spark-env.sh, but still nothing

Re: "bootstrapping" DStream state

2016-03-10 Thread Todd Nist
The updateStateByKey can be supplied an initialRDD to populate it with. Per code ( https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L435-L445 ). Provided here for your convenience. /** * Return a new "state"

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Todd Nist
Hi Vinti, All of your tasks are failing based on the screen shots provided. I think a few more details would be helpful. Is this YARN or a Standalone cluster? How much overall memory is on your cluster? On each machine where workers and executors are running? Are you using the Direct

Correlation Matrix Limits

2016-03-10 Thread Sebastian Kuepers
Hello, I am planning to use from the pyspark.mllib.stat package the corr() function to compute a correlation matrix. Will this happen in a distributed fashion and does it scale up well, if you have Vectors with a length of over a million columns? Thanks, Sebastian

Re: Zeppelin Integration

2016-03-10 Thread ayan guha
Thanks guys for reply. Yes, Zeppelin with Spark is pretty compelling choice, for single user. Any pointers for using Zeppelin for multi user scenario? In essence, can we either (a) Use Zeppelin to connect to a long running Spark Application which has some pre-cached Dataframes? (b) Can Zeppelin

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
You need to install spark on each mesos slave and then while starting container make a workdir to your spark home so that it can find the spark class. Ashish > On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps > wrote: > > For an answer to my question see

Re: Zeppelin Integration

2016-03-10 Thread Sabarish Sasidharan
I believe you need to co-locate your Zeppelin on the same node where Spark is installed. You need to specify the SPARK HOME. The master I used was YARN. Zeppelin exposes a notebook interface. A notebook can have many paragraphs. You run the paragraphs. You can mix multiple contexts in the same

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps
For an answer to my question see this: http://stackoverflow.com/a/35660466?noredirect=1. But for your problem did you define the Spark.mesos.docker. home or something like that property? Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana a écrit : > Hi > > I'm also

sql functions: row_number, percent_rank, rank,rowNumber

2016-03-10 Thread AlexModestov
Hello all, I try to use some sql functions. My task to renumber rows in DataFrame. I use sql functions but they don't work and I don;t understand why. I would appreciate you help to fix this issue. Thank you! The piece of my code: "from pyspark.sql.functions import row_number, percent_rank, rank,

PySpark API signature mismatch compared to Scala

2016-03-10 Thread Li Ming Tsai
Hi, Looking at 1.6.0, I see at least the following mismatch: 1. DStream mapWithState, thought declared experimental is not found in Python 2. Dstream updateStateByKey, missing initialRDD, partitioner Is Python API falling behind? Thanks!

Zeppelin Integration

2016-03-10 Thread ayan guha
Hi All I am writing this in order to get a fair understanding of how zeppelin can be integrated with Spark. Our use case is to load few tables from a DB to Spark, run some transformation. Once done, we want to expose data through Zeppelin for analytics. I have few question around that to sound

Re: Streaming job delays

2016-03-10 Thread Matthias Niehoff
Hi, dynamic allocation is afaik not supported for streaming applications, thats maybe a reason. See also: https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCA+AHuKkxg44WvXZGr4MVNUxioWH3o8pZZQRTaXR=m5cb-op...@mail.gmail.com%3E If you are using Spark 1.6 there should also be a

Re: Dynamic allocation doesn't work on YARN

2016-03-10 Thread Jy Chen
Thank you for your help and advice. After I added the log4j conf to expose more details, I found that Spark had sent the removing request while some containers did not receive the SIGTERM signal. Thanks. 2016-03-10 10:52 GMT+08:00 Saisai Shao : > Still I think this