date:20160310

RE: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .

Hi, Thanks guys. I was using spark-cassandra-connector 1.4.0-M1. There is a issue in this version of spark-cassandra-connector. The parameter spark.cassandra.input.split.size_in_mb is supposed to have a default value of 64 MB , is being interpreted as 64 bytes. This causes too many partitions

Spark REST APIs not working when application deployed in CLUSTER mode

2016-03-10 Thread rakesh rakshit

Hi all, I am able to make the REST API calls in client mode but when I run the application in Cluster mode, I am getting the below exception in driver logs: java.lang.IncompatibleClassChangeError: com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider and com.sun.jersey.json.impl.provid

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Deepak Gopalakrishnan

1. I'm using about 1 million users against few thousand products. I basically have around a million ratings 2. Spark 1.6 on Amazon EMR On Fri, Mar 11, 2016 at 12:46 PM, Nick Pentreath wrote: > Could you provide more details about: > 1. Data set size (# ratings, # users and # products) > 2. Spark

Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-10 Thread Gayathri Murali

Hi all, I am trying to run python unit tests. I currently have Python 2.6 and 2.7 installed. I installed unittest2 against both of them. When I try to run /python/run-tests with Python 2.7 I get the following error : Please install unittest2 to test with Python 2.6 or earlier Had test failur

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath

Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated items. I > first encode

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Nick Pentreath

Sean's old Myrrix slides contain an overview of the fold-in math: http://www.slideshare.net/srowen/big-practical-recommendations-with-alternating-least-squares/14?src=clipshare I never quite got around to actually incorporating it into my own ALS-based systems, because in the end I just re-compute

Re: Can we use spark inside a web service?

2016-03-10 Thread Nick Pentreath

Yes, really interesting discussion. It would be really interesting to compare the performance of alternative architectures. Specifically, I've found that Elasticsearch is a great option for analytic workloads - it doesn't support SQL (joins in particular), but its aggregation and arbitrary filteri

RE: lint-r checks failing

2016-03-10 Thread Sun, Rui

This is probably because the installed lintr package get updated. After update, lintr can detect errors that are skipped before I will submit a PR for this issue -Original Message- From: Gayathri Murali [mailto:gayathri.m.sof...@gmail.com] Sent: Friday, March 11, 2016 12:48 PM To: user@

How can I join two DataSet of same case class?

2016-03-10 Thread 박주형

Hi. I want to join two DataSet. but below stderr is shown 16/03/11 13:55:51 WARN ColumnName: Constructing trivially true equals predicate, ''edid = 'edid'. Perhaps you need to use aliases. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'edid' given input column

lint-r checks failing

2016-03-10 Thread Gayathri Murali

Hi All, I recently tried to run ./dev/run-tests on a freshly clones spark repository and I get lint-r check failed error. I have run these tests multiple times before and never had this issue. I have copied part of the issue here. Please note that I haven’t modified any of these files. Am I mis

Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon

Hear, hear. That’s why I’m here :) > On Mar 10, 2016, at 7:32 PM, Chris Fregly wrote: > > Anyway, thanks for the good discussion, everyone! This is why we have these > lists, right! :)

Get output of the ALS algorithm.

2016-03-10 Thread Shishir Anshuman

hello, I am new to Apache Spark and would like to get the Recommendation output of the ALS algorithm in a file. Please suggest me the solution. Thank you

Kafka + Spark streaming, RDD partitions not processed in parallel

2016-03-10 Thread Mukul Gupta

Hi All,I was running the following test:*Setup*9 VM runing spark workers with 1 spark executor each.1 VM running kafka and spark master.Spark version is 1.6.0Kafka version is 0.9.0.1Spark is using its own resource manager and is not running over YARN.*Test*I created a kafka topic with 3 partition.

Why the shuffle write is not the exactly same as shuffle read of next stage

2016-03-10 Thread canan chen

Here's my screenshot, the stage 19 and 20 is one-to-one relationship. They're the only child/parent. From my understanding, the shuffle write of stage 19 should be the same as shuffle read of stage 20, but here they are a little difference. Is there any reason for it ? Thanks. [image: Inline imag

Does anyone have experience processing large volume images on Spark cluster

2016-03-10 Thread greg huang

Hi All, Does anyone have experience processing large volume images on Spark cluster? Such as use the Spark to run some MapReduce tasks to distinguish some common features, for example count cars number in a satellite pictures. Regards, Greg

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly

great discussion, indeed. Mark Hamstra and i spoke offline just now. Below is a quick recap of our discussion on how they've achieved acceptable performance from Spark on the user request/response path (@mark- feel free to correct/comment). 1) there is a big difference in request/response latenc

Re: Can we use spark inside a web service?

2016-03-10 Thread Evan Chan

One of the premises here is that if you can restrict your workload to fewer cores - which is easier with FiloDB and careful data modeling - you can make this work for much higher concurrency and lower latency than most typical Spark use cases. The reason why it typically does not work in productio

Re: Spark configuration with 5 nodes

2016-03-10 Thread Mich Talebzadeh

Hi, Bear in mind that you typically need 1GB of NameNode memory for 1 million blocks. So if you have 128MB block size, you can store 128 * 1E6 / (3 *1024) = 41,666GB of data for every 1GB. Number 3 comes from the fact that the block is replicated three times. In other words just under 42TB of data

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

When you say driver running on mesos can you explain how are you doing that...?? > On Mar 10, 2016, at 4:44 PM, Eran Chinthaka Withana > wrote: > > Yanling I'm already running the driver on mesos (through docker). FYI, I'm > running this on cluster mode with MesosClusterDispatcher. > > Mac (c

Flashback: RDD.aggregate versus accumulables...

2016-03-10 Thread jiml

And Lord Joe you were right future versions did protect accumulators in actions. I wonder if anyone has a "modern" take on the accumulator vs. aggregate question. Seems like if I need to do it by key or control partitioning I would use aggregate. Bottom line question / reason for post: I wonder if

Re: Spark configuration with 5 nodes

2016-03-10 Thread Prabhu Joseph

Ashok, Cluster nodes has enough memory but CPU cores are less. 512GB / 16 = 32 GB. For 1 core the cluster has 32GB memory. Either their should be more cores available to use efficiently the available memory or don't configure a higher executor memory which will cause lot of GC. Thanks, Prabhu

Re: Can we use spark inside a web service?

2016-03-10 Thread Teng Qiu

This is really depends on how you defined "hot" :) and use cases, spark is definitely not that one fits all. At least not yet. Specially for heavy joins and full scans. Maybe spark alone fits your production workload and analytical requirements, but in general, I agree with Chris, for high concurr

Re: AVRO vs Parquet

2016-03-10 Thread Guru Medasani

Thanks Michael for clarifying this. My response is inline. Guru Medasani gdm...@gmail.com > On Mar 10, 2016, at 12:38 PM, Michael Armbrust wrote: > > A few clarifications: > > 1) High memory and cpu usage. This is because Parquet files can't be streamed > into as records arrive. I have seen

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Chris Fregly

@Colin- you're asking the $1 million dollar question that a lot of people are trying to do. This was literally the #1 most-asked question in every city on my recent world-wide meetup tour. I've been pointing people to my old Databricks co-worker's streaming-matrix-factorization project: https://

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson

In [7]: pdf = gdf1.toPandas() pdf['date'] = epoch2num(pdf['ms'] ) print(pdf.dtypes) pdf count int64 row_keyobject createddatetime64[ns] ms int64 date float64 dtype: object Out[7]: countrow_keycreatedmsdate 02realDonaldTrump2016-03-09 11:

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra

The fact that a typical Job requires multiple Tasks is not a problem, but rather an opportunity for the Scheduler to interleave the workloads of multiple concurrent Jobs across the available cores. I work every day with such a production architecture with Spark on the user request/response hot pat

Re: trouble with NUMPY constructor in UDF

2016-03-10 Thread Andy Davidson

Hi Ted In python the data type is float64¹. I have tried using both sql FloatType and DoubleType how ever I get the same error Strange andy From: Ted Yu Date: Wednesday, March 9, 2016 at 3:28 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: trouble with NUMPY constructor in UDF >

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly

you are correct, mark. i misspoke. apologies for the confusion. so the problem is even worse given that a typical job requires multiple tasks/cores. i have yet to see this particular architecture work in production. i would love for someone to prove otherwise. On Thu, Mar 10, 2016 at 5:44 PM,

Re: Can we use spark inside a web service?

2016-03-10 Thread Mark Hamstra

> > For example, if you're looking to scale out to 1000 concurrent requests, > this is 1000 concurrent Spark jobs. This would require a cluster with 1000 > cores. This doesn't make sense. A Spark Job is a driver/DAGScheduler concept without any 1:1 correspondence between Worker cores and Jobs.

Re: Can we use spark inside a web service?

2016-03-10 Thread Tristan Nixon

Very interested, Evan, thanks for the link. It has given me some food for thought. I’m also in the process of building a web application which leverage Spark on the back-end for some heavy lifting. I would be curious about your thoughts on my proposed architecture: I was planning on running a s

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Andy Davidson

In my experience I would try the following I use the standalone cluster manager. Each app gets it own performance web page . The streaming tab is really helpful. If processing time is greater than then your mini batch length you are going to run into performance problems Use the ³stages² tab to

Spark configuration with 5 nodes

2016-03-10 Thread Ashok Kumar

Hi, We intend to use 5servers which will be utilized for building Bigdata Hadoop data warehousesystem (not using any propriety distribution like Hortonworks or Cloudera orothers).All servers configurations are 512GB RAM, 30TB storageand 16 cores, Ubuntu Linux servers. Hadoop will be installed

Running MesosClusterDispatcher through marathon with bridged mode

2016-03-10 Thread Eran Chinthaka Withana

Hi I was able to run MesosClusterDispatcher, through marathon, using docker host networking. It registers with mesos and can submit jobs without any issues. But when I tried to make MesosClusterDispatcher run with bridged networking mode, it never registers with mesos. I exposed ports 8081 and 707

Re: Can we use spark inside a web service?

2016-03-10 Thread Chris Fregly

Good stuff, Evan. Looks like this is utilizing the in-memory capabilities of FiloDB which is pretty cool. looking forward to the webcast as I don't know much about FiloDB. My personal thoughts here are to removed Spark from the user request/response hot path. I can't tell you how many times i'v

Re: Spark ML - Scaling logistic regression for many features

2016-03-10 Thread Daniel Siegmann

Hi Nick, Thanks for the feedback and the pointers. I tried coalescing to fewer partitions and improved the situation dramatically. As you suggested, it is communication overhead dominating the overall runtime. The training run I mentioned originally had 900 partitions. Each tree aggregation has t

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana

Yanling I'm already running the driver on mesos (through docker). FYI, I'm running this on cluster mode with MesosClusterDispatcher. Mac (client) > MesosClusterDispatcher > Driver running on Mesos --> Workers running on Mesos My next step is to run MesosClusterDispatcher in mesos through

Graphx

2016-03-10 Thread Andrew A

Hi, is there anyone who use graphx in production? What maximum size of graphs did you process by spark and what cluster are you use for it? i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank from spark examples and i faced with large volume shuffles produced by spark wh

[ANNOUNCE] Announcing Spark 1.6.1

2016-03-10 Thread Michael Armbrust

Spark 1.6.1 is a maintenance release containing stability fixes. This release is based on the branch-1.6 maintenance branch of Spark. We *strongly recommend* all 1.6.0 users to upgrade to this release. Notable fixes include: - Workaround for OOM when writing large partitioned tables SPARK-12546 <

Job submission failure exception - unserializable TaskEndReason

2016-03-10 Thread Chris Westin

I'm getting an exception when I try to submit a job (through prediction.io, if you know it): [INFO] [Runner$] Submission command: /home/pio/PredictionIO/vendors/spark-1.5.1/bin/spark-submit --class io.prediction.tools.imprt.FileToEvents --files file:/home/pio/PredictionIO/conf/log4j.properties,

Re: [MLlib - ALS] Merging two Models?

2016-03-10 Thread Sean Owen

While it isn't crazy, I am not sure how valid it is to build a model off of only a chunk of recent data and then merge it into another model in any direct way. They're not really sharing a basis, so you can't just average them. My experience with this aspect suggests you should try to update the e

GraphX Pagerank Subgraphs

2016-03-10 Thread adamsmith

Dear community, I have a flat, really huge RDD describing a link-graph, sth. like a.com/page1 -> b.com/page2 a.com/page1 -> a.com/page5 b.com/page5 -> a.com/page3 I want to calculate pagerank (with GraphX) for in-domain links, i.e. pagerank for all pages of domain a.com, b.com. whatever.com. The

Re: Installing Spark on Mac

2016-03-10 Thread Tristan Nixon

If you type ‘whoami’ in the terminal, and it responds with ‘root’ then you’re the superuser. However, as mentioned below, I don’t think its a relevant factor. > On Mar 10, 2016, at 12:02 PM, Aida Tefera wrote: > > Hi Tristan, > > I'm afraid I wouldn't know whether I'm running it as super user

RE: Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler

Grouping is applied in the aggregation. From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thu, Mar 10, 2016 13:56 To: Gerhard Fiedler Cc: user@spark.apache.org Subject: Re: Partitioning to speed up processing? Are they entire data set aggregates or is t

[MLlib - ALS] Merging two Models?

2016-03-10 Thread Colin Woodbury

Hi there, I'm wondering if it's possible (or feasible) to combine the feature matrices of two MatrixFactorizationModels that share a user and product set. Specifically, one model would be the "on-going" model, and the other is one trained only on the most recent aggregation of some event data. My

Re: Partitioning to speed up processing?

2016-03-10 Thread Holden Karau

Are they entire data set aggregates or is there some grouping applied? On Thursday, March 10, 2016, Gerhard Fiedler wrote: > I have a number of queries that result in a sequence Filter > Project > > Aggregate. I wonder whether partitioning the input table makes sense. > > > > Does Aggregate bene

DataFrames - Kryo registration issue

2016-03-10 Thread Raghava Mutharaju

Hello All, If Kryo serialization is enabled, doesn't Spark take care of registration of built-in classes, i.e., are we not supposed to register just the custom classes? When using DataFrames, this does not seem to be the case. I had to register the following classes conf.registerKryoClasses(Arra

Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler

I have a number of queries that result in a sequence Filter > Project > Aggregate. I wonder whether partitioning the input table makes sense. Does Aggregate benefit from a partitioned input? If so, what partitions would be most useful (related to the aggregations)? Do Filter and Project preserv

Dropping nested dataframe column

2016-03-10 Thread Ross.Cramblit

Is there any support for dropping a nested column in a dataframe? I have tried dropping with the Column reference as well as a string of the column name, but the returned dataframe is unchanged. >>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a", >>> "col2": "b"}}'])) >>>

Re: RDD recomputation

2016-03-10 Thread Kevin Mellott

I've had very good success troubleshooting this type of thing by using the Spark Web UI, which will depict a breakdown of all tasks. This also includes the RDDs being used, as well as any cached data. Additional information about this tool can be found at http://spark.apache.org/docs/latest/monitor

RDD recomputation

2016-03-10 Thread souri datta

Hi, Currently I am trying to optimize my spark application and in that process, I am trying to figure out if at any stage in the code, I am recomputing a large RDD (so that I can optimize it by persisting/checkpointing it). Is there any indication in the event logs that tells us about an RDD bein

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ajay Chander

Hi Everyone, a quick question with in this context. What is the underneath persistent storage that you guys are using? With regards to this containerized environment? Thanks On Thursday, March 10, 2016, yanlin wang wrote: > How you guys make driver docker within container to be reachable from >

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread yanlin wang

How you guys make driver docker within container to be reachable from spark worker ? Would you share your driver docker? i am trying to put only driver in docker and spark running with yarn outside of container and i don’t want to use —net=host Thx Yanlin > On Mar 10, 2016, at 11:06 AM, Gui

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps

Glad to hear it. Thanks all for sharing your solutions. Le jeu. 10 mars 2016 19:19, Eran Chinthaka Withana a écrit : > Phew, it worked. All I had to do was to add *export > SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6" > *before calling spark-sub

RE: Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Silvio Fiorito

Very cool stuff Evan. Thanks for your work on this and sharing! From: Evan Chan Sent: Thursday, March 10, 2016 1:38 PM To: user@spark.apache.org Subject: Achieving 700 Spark SQL Queries Per Second Hey folks, I just saw a recen

Re: Can we use spark inside a web service?

2016-03-10 Thread velvia.github

Hi, I just wrote a blog post which might be really useful to you -- I have just benchmarked being able to achieve 700 queries per second in Spark. So, yes, web speed SQL queries are definitely possible. Read my new blog post: http://velvia.github.io/Spark-Concurrent-Fast-Queries/ and feel fre

Achieving 700 Spark SQL Queries Per Second

2016-03-10 Thread Evan Chan

Hey folks, I just saw a recent thread on here (but can't find it anymore) on using Spark as a web-speed query engine. I want to let you guys know that this is definitely possible! Most folks don't realize how low-latency Spark can actually be. Please check out my blog post below on achieving

Re: AVRO vs Parquet

2016-03-10 Thread Michael Armbrust

A few clarifications: > 1) High memory and cpu usage. This is because Parquet files can't be > streamed into as records arrive. I have seen a lot of OOMs in reasonably > sized MR/Spark containers that write out Parquet. When doing dynamic > partitioning, where many writers are open at once, we’ve

Re: Installing Spark on Mac

2016-03-10 Thread Aida

Hi Gaini, thanks for your response Please see the below contents of the files in the conf. directory: 1. docker.properties.template Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additiona

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana

Phew, it worked. All I had to do was to add *export SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6" *before calling spark-submit. Guillaume, thanks for the pointer. Timothy, thanks for looking into this. Looking forward to see a fix soon. Thanks, Eran

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Eran Chinthaka Withana

Hi Timothy What version of spark are you guys running? > I'm using Spark 1.6.0. You can see the Dockerfile I used here: https://github.com/echinthaka/spark-mesos-docker/blob/master/docker/mesos-spark/Dockerfile > And also did you set the working dir in your image to be spark home? > Yes I did

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Tim Chen

Hi Eran, I need to investigate but perhaps that's true, we're using SPARK_JAVA_OPTS to pass all the options and not --conf. I'll take a look at the bug, but if you can try the workaround and see if that fixes your problem. Tim On Thu, Mar 10, 2016 at 10:08 AM, Eran Chinthaka Withana < eran.chin

Re: Installing Spark on Mac

2016-03-10 Thread Aida Tefera

Hi Tristan, I'm afraid I wouldn't know whether I'm running it as super user. I have java version 1.8.0_73 and SCALA version 2.11.7 Sent from my iPhone > On 9 Mar 2016, at 21:58, Tristan Nixon wrote: > > That’s very strange. I just un-set my SPARK_HOME env param, downloaded a > fresh 1.6.0

Re: log4j pains

2016-03-10 Thread Tristan Nixon

Hmmm… that should be right. > On Mar 10, 2016, at 11:26 AM, Ashic Mahtab wrote: > > src/main/resources/log4j.properties > > Subject: Re: log4j pains > From: st...@memeticlabs.org > Date: Thu, 10 Mar 2016 11:08:46 -0600 > CC: user@spark.apache.org > To: as...@live.com > > Where in the jar is th

RE: log4j pains

2016-03-10 Thread Ashic Mahtab

src/main/resources/log4j.properties Subject: Re: log4j pains From: st...@memeticlabs.org Date: Thu, 10 Mar 2016 11:08:46 -0600 CC: user@spark.apache.org To: as...@live.com Where in the jar is the log4j.properties file? On Mar 10, 2016, at 9:40 AM, Ashic Mahtab wrote:1. Fat jar with logging depe

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Tim Chen

Here is an example dockerfile, although it's a bit dated now if you build it today it should still work: https://github.com/tnachen/spark/tree/dockerfile/mesos_docker Tim On Thu, Mar 10, 2016 at 8:06 AM, Ashish Soni wrote: > Hi Tim , > > Can you please share your dockerfiles and configuration

Re: Problem with union of DirectStream

2016-03-10 Thread Cody Koeninger

If you do any RDD transformation, it's going to return a different RDD than the original. The implication for casting to HasOffsetRanges is specifically called out in the docs at http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers On Thu,

Re: log4j pains

2016-03-10 Thread Tristan Nixon

Where in the jar is the log4j.properties file? > On Mar 10, 2016, at 9:40 AM, Ashic Mahtab wrote: > > 1. Fat jar with logging dependencies included. log4j.properties in fat jar. > Spark doesn't pick up the properties file, so uses its defaults.

Problem with union of DirectStream

2016-03-10 Thread Guillermo Ortiz

I have a DirectStream and process data from Kafka, val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams1, topics1.toSet) directKafkaStream.foreachRDD { rdd => val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges When

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

Hi Tim , Can you please share your dockerfiles and configuration as it will help a lot , I am planing to publish a blog post on the same . Ashish On Thu, Mar 10, 2016 at 10:34 AM, Timothy Chen wrote: > No you don't need to install spark on each slave, we have been running > this setup in Mesos

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Matthias Niehoff

Hi, the spark connector docs say: ( https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md ) "The number of Spark partitions(tasks) created is directly controlled by the setting spark.cassandra.input.split.size_in_mb. This number reflects the approximate amount of Cassandra

Re: Spark job for Reading time series data from Cassandra

2016-03-10 Thread Bryan Jeffrey

Prateek, I believe that one task is created per Cassandra partition. How is your data partitioned? Regards, Bryan Jeffrey On Thu, Mar 10, 2016 at 10:36 AM, Prateek . wrote: > Hi, > > > > I have a Spark Batch job for reading timeseries data from Cassandra which > has 50,000 rows. > > > > > >

log4j pains

2016-03-10 Thread Ashic Mahtab

Hello,I'm trying to use a custom log4j appender, with things specified in a log4j.properties file. Very little seems to work in this regard. Here's what I've tried: 1. Fat jar with logging dependencies included. log4j.properties in fat jar. Spark doesn't pick up the properties file, so uses its

Spark job for Reading time series data from Cassandra

2016-03-10 Thread Prateek .

Hi, I have a Spark Batch job for reading timeseries data from Cassandra which has 50,000 rows. JavaRDD cassandraRowsRDD = javaFunctions.cassandraTable("iotdata", "coordinate") .map(new Function() { @Override public String call(CassandraRo

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Timothy Chen

No you don't need to install spark on each slave, we have been running this setup in Mesosphere without any problem at this point, I think most likely configuration problem and perhaps a chance something is missing in the code to handle some cases. What version of spark are you guys running? An

how does dimsum works for categorical data

2016-03-10 Thread naveen.marri

Hi, I'm exploring dimsum(Dimension Independent Matrix Squares using Mapreduce) for finding similarites between users in terms of products they have purchased. I've modeled the matrix as User1, product1,product2,0,0 user2, product2,0,0,0 user3, product1,product3,product4,product2 . ... obvious

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu

bq. Started SparkUI at http://192.168.2.103:4040 bq. Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Can you check UI ? Thanks On Thu, Mar 10, 2016 at 6:57 AM, Shams ul Haque wrote: > Hi, > > *Release of Spar

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Shams ul Haque

Hi, *Release of Spark:* 1.6.0, i downloaded it and made a built using 'sbt/sbt assembly' *command for submitting your app: *bin/spark-submit --master spark://shams-machine:7077 --executor-cores 2 --class in.myapp.email.combiner.CombinerRealtime /opt/dev/workspace-luna/combiner_spark/target/combin

Re: "bootstrapping" DStream state

2016-03-10 Thread Zalzberg, Idan (Agoda)

Ha! So easy, how could I miss it?! Thanks! Sent with AquaMail for Android http://www.aqua-mail.com On March 10, 2016 9:32:38 PM Todd Nist wrote: The updateStateByKey can be supplied an initialRDD to populate it with. Per code (https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/sc

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu

Can you provide a bit more information ? Release of Spark command for submitting your app code snippet of your app pastebin of log Thanks On Thu, Mar 10, 2016 at 6:32 AM, Shams ul Haque wrote: > Hi, > > I have developed a spark realtime app and started spark-standalone on my > laptop. But when

does spark needs dedicated machines to run on

2016-03-10 Thread Shams ul Haque

Hi, I have developed a spark realtime app and started spark-standalone on my laptop. But when i tried to submit that app in Spark it is always in WAITING state & Cores is always Zero. I have set: export SPARK_WORKER_CORES="2" export SPARK_EXECUTOR_CORES="1" in spark-env.sh, but still nothing hap

Re: "bootstrapping" DStream state

2016-03-10 Thread Todd Nist

The updateStateByKey can be supplied an initialRDD to populate it with. Per code ( https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L435-L445 ). Provided here for your convenience. /** * Return a new "state" D

Re: Spark Streaming, very slow processing and increasing scheduling delay of kafka input stream

2016-03-10 Thread Todd Nist

Hi Vinti, All of your tasks are failing based on the screen shots provided. I think a few more details would be helpful. Is this YARN or a Standalone cluster? How much overall memory is on your cluster? On each machine where workers and executors are running? Are you using the Direct (KafkaUt

Correlation Matrix Limits

2016-03-10 Thread Sebastian Kuepers

Hello, I am planning to use from the pyspark.mllib.stat package the corr() function to compute a correlation matrix. Will this happen in a distributed fashion and does it scale up well, if you have Vectors with a length of over a million columns? Thanks, Sebastian --

Re: Zeppelin Integration

2016-03-10 Thread ayan guha

Thanks guys for reply. Yes, Zeppelin with Spark is pretty compelling choice, for single user. Any pointers for using Zeppelin for multi user scenario? In essence, can we either (a) Use Zeppelin to connect to a long running Spark Application which has some pre-cached Dataframes? (b) Can Zeppelin use

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni

You need to install spark on each mesos slave and then while starting container make a workdir to your spark home so that it can find the spark class. Ashish > On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps > wrote: > > For an answer to my question see this: > http://stackoverflow.co

Re: Zeppelin Integration

2016-03-10 Thread Sabarish Sasidharan

I believe you need to co-locate your Zeppelin on the same node where Spark is installed. You need to specify the SPARK HOME. The master I used was YARN. Zeppelin exposes a notebook interface. A notebook can have many paragraphs. You run the paragraphs. You can mix multiple contexts in the same not

Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Guillaume Eynard Bontemps

For an answer to my question see this: http://stackoverflow.com/a/35660466?noredirect=1. But for your problem did you define the Spark.mesos.docker. home or something like that property? Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana a écrit : > Hi > > I'm also having this issue and can

sql functions: row_number, percent_rank, rank,rowNumber

2016-03-10 Thread AlexModestov

Hello all, I try to use some sql functions. My task to renumber rows in DataFrame. I use sql functions but they don't work and I don;t understand why. I would appreciate you help to fix this issue. Thank you! The piece of my code: "from pyspark.sql.functions import row_number, percent_rank, rank,

PySpark API signature mismatch compared to Scala

2016-03-10 Thread Li Ming Tsai

Hi, Looking at 1.6.0, I see at least the following mismatch: 1. DStream mapWithState, thought declared experimental is not found in Python 2. Dstream updateStateByKey, missing initialRDD, partitioner Is Python API falling behind? Thanks! -

Zeppelin Integration

2016-03-10 Thread ayan guha

Hi All I am writing this in order to get a fair understanding of how zeppelin can be integrated with Spark. Our use case is to load few tables from a DB to Spark, run some transformation. Once done, we want to expose data through Zeppelin for analytics. I have few question around that to sound of

Re: Streaming job delays

2016-03-10 Thread Matthias Niehoff

Hi, dynamic allocation is afaik not supported for streaming applications, thats maybe a reason. See also: https://mail-archives.apache.org/mod_mbox/spark-user/201510.mbox/%3CCA+AHuKkxg44WvXZGr4MVNUxioWH3o8pZZQRTaXR=m5cb-op...@mail.gmail.com%3E If you are using Spark 1.6 there should also be a wa

Re: Dynamic allocation doesn't work on YARN

2016-03-10 Thread Jy Chen

Thank you for your help and advice. After I added the log4j conf to expose more details, I found that Spark had sent the removing request while some containers did not receive the SIGTERM signal. Thanks. 2016-03-10 10:52 GMT+08:00 Saisai Shao : > Still I think this information is not enough to

92 matches

Mail list logo