Re: Creating BlockMatrix with java API

2015-09-23 Thread Sabarish Sasidharan
What I meant is that something like this would work. Yes, it's less than elegant but it works. List, Matrix>> blocks = new ArrayList,Matrix>>(); blocks.add( new Tuple2, Matrix>( new Tuple2(0, 0), Matrices.dense(2, 2, new double[] {0.0D, 1.1D, 2.0D, 3.1D}))); blocks.add( new Tuple2, Matrix>( new Tu

Re: No space left on device when running graphx job

2015-09-23 Thread Andy Huang
Hi Jack, Are you writing out to disk? Or it sounds like Spark is spilling to disk (RAM filled up) and it's running out of disk space. Cheers Andy On Thu, Sep 24, 2015 at 4:29 PM, Jack Yang wrote: > Hi folk, > > > > I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU > cor

No space left on device when running graphx job

2015-09-23 Thread Jack Yang
Hi folk, I have an issue of graphx. (spark: 1.4.0 + 4 machines + 4G memory + 4 CPU cores) Basically, I load data using GraphLoader.edgeListFile mthod and then count number of nodes using: graph.vertices.count() method. The problem is : Lost task 11972.0 in stage 6.0 (TID 54585, 192.168.70.129):

Re: Querying on multiple Hive stores using Apache Spark

2015-09-23 Thread Karthik
Any ideas or suggestions? Thanks, Karthik. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-on-multiple-Hive-stores-using-Apache-Spark-tp24765p24797.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: caching DataFrames

2015-09-23 Thread Zhang, Jingyu
Thanks Hemant, I will generate a total report (dfA) with many columns from log data. After the report (A) done. I will generate many detail reports (dfA1-dfAi) base on the subset of the total report (dfA), those detail reports using aggregate and window functions, according on different rules. Ho

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
hit send button too early... However, why would you want to cache a dataFrame that is subset of already cached dataFrame. If dfA is cached, and dfA1 is created by applying some transformation on dfA, actions on dfA1 will use cache of dfA. val dfA1 = dfA.filter($"_1" > 50) // this will run

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of A

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
Thanks for the explain. I've opened a PR at https://github.com/apache/spark/pull/8898 Tathagata Das 于2015年9月24日周四 上午2:44写道: > YEs, since 1.4.0, it shuts down streamingContext without gracefully from > shutdown hook. > You can make it shutdown gracefully in that hook by setting the SparkConf > "sp

Re: SparkR for accumulo

2015-09-23 Thread madhvi.gupta
Hi, Is there any other way to proceed with it to create RRDD from a source RDD other than text RDD?Or to use any other format of data stored in HDFS in sparkR? Also please elaborate me the kind of step missing in sparkR fro this. Thanks and Regards Madhvi Gupta On Thursday 24 September 2015

Re: SparkR for accumulo

2015-09-23 Thread madhvi.gupta
Ohk.Thanks Thanks and Regards Madhvi Gupta On Thursday 24 September 2015 08:12 AM, Sun, Rui wrote: No. It is possible you create a helper function which can creat accumulo data RDDs in Scala or Java (maybe put such code in a JAR, add using --jar on the command line to start SparkR to use i

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
AHA! I figured it out, but it required some tedious remote debugging of the Spark ApplicationMaster. (But now I understand the Spark codebase a little better than before, so I guess I'm not too put out. =P) Here's what's happening... I am setting spark.dynamicAllocation.minExecutors=1 but am not

KMeans Model fails to run

2015-09-23 Thread Soong, Eddie
Hi, Why am I getting this error which prevents my KMeans clustering algorithm to work inside of Spark? I'm trying to run a sample Scala model found in Databricks website on my Cloudera-Spark 1-node local VM. For completeness, the Scala program is as follows: Thx import org.apache.spark.mllib.c

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps. Sincerely, DB Tsai -- Blog: https://www.

RE: SparkR for accumulo

2015-09-23 Thread Sun, Rui
No. It is possible you create a helper function which can creat accumulo data RDDs in Scala or Java (maybe put such code in a JAR, add using --jar on the command line to start SparkR to use it ?) and in SparkR you can use the private functions like callJMethod to call it and the created RDD o

How to fix some WARN when submit job on spark 1.5 YARN

2015-09-23 Thread r7raul1...@163.com
1 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 2 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 3 WARN Unable to load native-hadoop library for your platform r7raul1...@163.com

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
Another update that doesn't make much sense: The SparkPi example does work on yarn-cluster mode with dynamicAllocation. That is, the following command works (as well as with yarn-client mode): spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi spark-examples.jar 100 Bu

Spark ClosureCleaner or java serializer OOM when trying to grow

2015-09-23 Thread jluan
I have been stuck on this problem for the last few days: I am attempting to run random forest from MLLIB, it gets through most of it, but breaks when doing a mapPartition operation. The following stack trace is shown: : An error occurred while calling o94.trainRandomForestModel. : java.lang.OutOf

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
Thanks for the quick response! spark-shell is indeed using yarn-client. I forgot to mention that I also have "spark.master yarn-client" in my spark-defaults.conf file too. The working spark-shell and my non-working example application both display spark.scheduler.mode=FIFO on the Spark UI. Is tha

Fwd: Executor lost

2015-09-23 Thread Angel Angel
-- Forwarded message -- From: Angel Angel Date: Wed, Sep 23, 2015 at 12:24 PM Subject: Executor lost To: user@spark.apache.org Hello Sir/Madam, I am running the deeplearning example on spark. i have the following configuration 1 Master and 3 slaves My driver program setting is

Re: How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
Hi Sujit, It is wonderful for you!I must show my sincere appreciation towards your kind help. Thank you very much!Best Regards,Zhiliang        On Wednesday, September 23, 2015 10:15 PM, Sujit Pal wrote:

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Andrew Duffy
What pool is the spark shell being put into? (You can see this through the YARN UI under scheduler) Are you certain you're starting spark-shell up on YARN? By default it uses a local spark executor, so if it "just works" then it's because it's not using dynamic allocation. On Wed, Sep 23, 2015 at

Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0 after using it successfully on an identically configured cluster with Spark 1.4.1. I'm getting the dreaded warning "YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers a

Odd behavior when re-defining a val in spark-shell

2015-09-23 Thread Boris Alexeev
Hi, I am seeing odd behavior when I redefine a val in the REPL of a recent release of the spark-shell. Here is my minimal test case: val a = 1 def id(a:Int) = {a} val a = 2 a id(a) Specifically, if I run "~/spark/bin/spark-shell --master local" and enter each of these five lines o

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
~3000 features, pretty sparse, I think about 200-300 non zero features in each row. We have 100 executors x 8 cores. Number of tasks is pretty big, 30k-70k, can't remember exact number. Training set is a result of pretty big join from multiple data frames, but it's cached. However as I understand S

Re: Java Heap Space Error

2015-09-23 Thread Zhang, Jingyu
Is you sql works if do not runs a regex on strings and concatenates them, I mean just Select the stuff without String operations? On 24 September 2015 at 10:11, java8964 wrote: > Try to increase partitions count, that will make each partition has less > data. > > Yong > > ---

Re: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
in ./apps/mesos-0.22.1/sbin/mesos-daemon.sh #!/usr/bin/env bash prefix=/apps/mesos-0.22.1 exec_prefix=/apps/mesos-0.22.1 deploy_dir=${prefix}/etc/mesos # Increase the default number of open file descriptors. ulimit -n 8192 Sincerely, DB Tsai -

RE: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread java8964
That is interesting. I don't have any Mesos experience, but just want to know the reason why it does so. Yong > Date: Wed, 23 Sep 2015 15:53:54 -0700 > Subject: Debugging too many files open exception issue in Spark shuffle > From: dbt...@dbtsai.com > To: user@spark.apache.org > > Hi, > > Recen

RE: Java Heap Space Error

2015-09-23 Thread java8964
Try to increase partitions count, that will make each partition has less data. Yong Subject: Re: Java Heap Space Error From: yu...@useinsider.com Date: Thu, 24 Sep 2015 00:32:47 +0300 CC: user@spark.apache.org To: java8...@hotmail.com Yes, it’s possible. I use S3 as data source. My external table

caching DataFrames

2015-09-23 Thread Zhang, Jingyu
I have A and B DataFrames A has columns a11,a12, a21,a22 B has columns b11,b12, b21,b22 I persistent them in cache 1. A.Cache(), 2. B.Cache() Then, I persistent the subset in cache later 3. DataFrame A1 (a11,a12).cache() 4. DataFrame B1 (b11,b12).cache() 5. DataFrame AB1 (a11,a12,b11,b12).cah

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Your code looks correct for me. How many # of features do you have in this training? How many tasks are running in the job? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage includi

CrossValidator speed - for loop on each parameter map?

2015-09-23 Thread julia
I’m using CrossValidator in pyspark (spark 1.4.1). I’ve seen in the class Estimator that all 'fit' are done sequentially. You can check the method _fit in CrossValidator class for the current implementation: https://spark.apache.org/docs/1.4.1/api/python/_modules/pyspark/ml/tuning.html In the sc

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
It's really simple: https://gist.github.com/ezhulenev/886517723ca4a353 The same strange heap behavior we've seen even for single model, it takes ~20 gigs heap on a driver to build single model with less than 1 million rows in input data frame. On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai wrote:

[POWERED BY] Please add our organization

2015-09-23 Thread barmaley
Name: Frontline Systems Inc. URL: www.solver.com Description: • We built an interface between Microsoft Excel and Apache Spark - bringing Big Data from the clusters to Excel enabling tools ranging from simple charts and Power View dashboards to add-ins for machine learning and predictive an

Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
Hi, Recently, we ran into this notorious exception while doing large shuffle in mesos at Netflix. We ensure that `ulimit -n` is a very large number, but still have the issue. It turns out that mesos overrides the `ulimit -n` to a small number causing the problem. It's very non-trivial to debug (a

[POWERED BY] Please add our organization

2015-09-23 Thread Oleg Shirokikh
Name: Frontline Systems Inc. URL: www.solver.com Description: * We built an interface between Microsoft Excel and Apache Spark - bringing Big Data from the clusters to Excel enabling tools ranging from simple charts and Power View dashboards to add-ins for machine learni

Re: How to obtain the key in updateStateByKey

2015-09-23 Thread Ted Yu
def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], updateFunc is given an iterator. You can access the key with _1 on the iterator. On Wed, Sep 23, 2015 at 3:01 PM, swetha wrote: > Hi, > > How to obtain the current key in updateStateBy

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Could you paste some of your code for diagnosis? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev wrote:

Re: spark.mesos.coarse impacts memory performance on mesos

2015-09-23 Thread Utkarsh Sengar
Missed to do a reply-all. Tim, spark.mesos.coarse = true doesn't work and spark.mesos.coarse = false works (sorry there was a typo in my last email, I meant "when I do "spark.mesos.coarse=false", the job works like a charm. "). I get this exception with spark.mesos.coarse = true: 15/09/22 20:18

Re: Using Spark for portfolio manager app

2015-09-23 Thread ALEX K
Thuy, if you decide to go with Hbase for external storage consider using a light-weight SQL layer such as Apache Phoenix, it has a spark plugin & JDBC driver, and throughput is pretty good even for heavy market data feed (make sure to use batched com

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
But that's not the complete application log. You say the streaming context is initialized, but can you show that in the logs? There's something happening that is causing the SparkContext to not be registered with the YARN backend, and that's why your application is being killed. If you can share t

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Marcelo, The error below is from the application logs. The spark streaming context is initialized and actively processing data when yarn claims that the context is not initialized. There are a number of errors, but they're all associated with the ssc shutting down. Regards, Bryan Jeffrey --

LogisticRegression models consumes all driver memory

2015-09-23 Thread Eugene Zhulenev
We are running Apache Spark 1.5.0 (latest code from 1.5 branch) We are running 2-3 LogisticRegression models in parallel (we'd love to run 10-20 actually), they are not really big at all, maybe 1-2 million rows in each model. Cluster itself, and all executors look good. Enough free memory and no

Re: Join over many small files

2015-09-23 Thread ayan guha
I think this can be a good case for using sequence file format to pack many files to few sequence files with file name as key andd content as value. Then read it as RDD and produce tuples like you mentioned (key=fileno+id, value=value). After that, it is a simple map operation to generate the diff

How to obtain the key in updateStateByKey

2015-09-23 Thread swetha
Hi, How to obtain the current key in updateStateBykey ? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-obtain-the-key-in-updateStateByKey-tp24792.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Marcelo Vanzin
Did you look at your application's logs (using the "yarn logs" command?). That error means your application is failing to create a SparkContext. So either you have a bug in your code, or there will be some error in the log pointing at the actual reason for the failure. On Tue, Sep 22, 2015 at 5:4

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Also - I double checked - we're setting the master to "yarn-cluster" -Original Message- From: "Tathagata Das" Sent: ‎9/‎23/‎2015 2:38 PM To: "Bryan" Cc: "user" ; "Hari Shreedharan" Subject: Re: Yarn Shutting Down Spark Processing CC;ing Hari who may have a better sense of whats going

Re: Spark as standalone or with Hadoop stack.

2015-09-23 Thread Ted Yu
HDFS on Mesos framework is still being developed. What I said previously reflected current deployment practice. Things may change in the future. On Tue, Sep 22, 2015 at 4:02 PM, Jacek Laskowski wrote: > On Tue, Sep 22, 2015 at 10:03 PM, Ted Yu wrote: > > > To my knowledge, no one runs HBase on

Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu
Hi Spark experts, I'm coming across these terminologies and having some confusions, could you please help me understand them better? For instance I have implemented a Hadoop InputFormat to load my external data in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my questi

Re: Java Heap Space Error

2015-09-23 Thread Yusuf Can Gürkan
Yes, it’s possible. I use S3 as data source. My external tables has partitioned. Belowed task is 193/200. Job has 2 stages and its 193. task of 200 in 2.stage because of sql.shuffle.partitions. How can i avoid this situation, this is my query: select userid,concat_ws(' ',collect_list(concat_ws

Join over many small files

2015-09-23 Thread Tracewski, Lukasz
Hi all, I would like you to ask for an advise on how to efficiently make a join operation in Spark with tens of thousands of tiny files. A single file has a few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows. To give you impression how they look like: File 01 ID | VA

Re: Creating BlockMatrix with java API

2015-09-23 Thread Pulasthi Supun Wickramasinghe
Hi YiZhi, Actually i was not able to try it out to see if it was working. I sent the previous reply assuming that Sabarish's solution would work :). Sorry if there was any confusion. Best Regards, Pulasthi On Wed, Sep 23, 2015 at 6:47 AM, YiZhi Liu wrote: > Hi Pulasthi, > > Are you sure this w

reduceByKeyAndWindow confusion

2015-09-23 Thread srungarapu vamsi
I create a stream from kafka as belows" val kafkaDStream = KafkaUtils.createDirectStream[String,KafkaGenericEvent,StringDecoder,KafkaGenericEventsDecoder](ssc, kafkaConf, Set(topics)) .window(Minutes(WINDOW_DURATION),Minutes(SLIDER_DURATION)) I have a map ("intToStringList") which is a Map[Int

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Check this out http://lambda.fortytools.com/post/26977061125/servlet-filter-for-http-basic-auth or https://gist.github.com/neolitec/8953607 for examples of filters implementing basic authentication. Implement one of these and set them in the spark.ui.filters property. Deenar On 23 September 2015

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Rafal Grzymkowski
I know this Spark Security page, but the information there is not sufficient. Anyone make it works? Those basic servlets for ui.filters

Re: JdbcRDD Constructor

2015-09-23 Thread Deenar Toraskar
Satish Can you post the SQL query you are using? The SQL query must have 2 placeholders and both of them should be an inclusive range (<= and >=).. e.g. select title, author from books where ? <= id and id <= ? Are you doing this? Deenar On 23 September 2015 at 20:18, Deenar Toraskar < deenar

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Rafal Check this out https://spark.apache.org/docs/latest/security.html Regards Deenar On 23 September 2015 at 19:13, Rafal Grzymkowski wrote: > Hi, > > I want to enable basic Http authentication for the spark web UI (without > recompilation need for Spark). > I see there is 'spark.ui.filters'

Re: Cosine LSH Join

2015-09-23 Thread Nick Pentreath
Not sure of performance but DIMSUM only handles "column similarity" and scales to maybe 100k columns. Item-item similarity e.g. in MF models often requires millions of items (would be millions of columns in DIMSUM). So one needs an LSH type approach or brute force via Cartesian product (but

Provide sampling ratio while loading json in spark version > 1.4.0

2015-09-23 Thread Udit Mehta
Hi, In earlier versions of spark(< 1.4.0), we were able to specify the sampling ratio while using *sqlContext.JsonFile* or *sqlContext.JsonRDD* so that we dont inspect each and every element while inferring the schema. I see that the use of these methods is deprecated in the newer spark version an

Re: Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Tathagata Das
YEs, since 1.4.0, it shuts down streamingContext without gracefully from shutdown hook. You can make it shutdown gracefully in that hook by setting the SparkConf "spark.streaming.stopGracefullyOnShutdown" to "true" Note to self, document this in the programming guide. On Wed, Sep 23, 2015 at 3:33

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread Todd Nist
Hi Kali, If you do not mind sending JSON, you could do something like this, using json4s: val rows = p.collect() map ( row => TestTable(row.getString(0), row.getString(1)) ) val json = parse(write(rows)) producer.send(new KeyedMessage[String, String]("trade", writePretty(json))) // or for eac

Re: Yarn Shutting Down Spark Processing

2015-09-23 Thread Tathagata Das
CC;ing Hari who may have a better sense of whats going on. -- Forwarded message -- From: Bryan Date: Wed, Sep 23, 2015 at 3:43 AM Subject: RE: Yarn Shutting Down Spark Processing To: Tathagata Das Cc: user Tathagata, Simple batch jobs do work. The cluster has a good set of re

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Tathagata Das
There could multiple reasons for caching till 90% - 1. not enough aggregate space in cluster - increase cluster memory 2. ata is skewed among executor so one executor is try to cache too much while others are idle - Repartition the data using RDD.repartition to force even distribution. The Storage

Re: Cosine LSH Join

2015-09-23 Thread Charlie Hack
This is great! Pretty sure I have a use for it involving entity resolution of text records.  ​ ​How does this compare to the DIMSUM similarity join implementation in MLlib performance wise, out of curiosity? ​ ​Thanks, ​ ​Charlie  On Wednesday, Sep 23, 2015 at 09:25, Nick

How to turn on basic authentication for the Spark Web

2015-09-23 Thread Rafal Grzymkowski
Hi, I want to enable basic Http authentication for the spark web UI (without recompilation need for Spark). I see there is 'spark.ui.filters' option but don't know how to use it. I found possibility to use kerberos param but it's not an option for me. What should I set there to use secret token b

Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
Thanks Steve! FYI: S3 now supports GET-after-PUT consistency for new objects in all regions, including US Standard https://aws.amazon.com/about-aws/whats-new/2015/08/amazon-s3-introduces-new-usability-enhancements/

create table in hive from spark-sql

2015-09-23 Thread Mohit Singh
Probably a noob question. But I am trying to create a hive table using spark-sql. Here is what I am trying to do: hc = HiveContext(sc) hdf = hc.parquetFile(output_path) data_types = hdf.dtypes schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")" hc.sql(" CREATE TABLE IF

Re: WAL on S3

2015-09-23 Thread Steve Loughran
On 23 Sep 2015, at 14:56, Michal Čizmazia mailto:mici...@gmail.com>> wrote: To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach? nothing obvious. the blo

Re: Calling a method parallel

2015-09-23 Thread Robineast
The following should give you what you need: val results = sc.makeRDD(1 to n).map(X(_)).collect This should return the results as an array. _ Robin East Spark GraphX in Action - Michael Malak and Robin East Manning Publications http://manning.com/books/spark-grap

Re: KafkaProducer using Cassandra as source

2015-09-23 Thread kali.tumm...@gmail.com
Guys sorry I figured it out. val x=p.collect().mkString("\n").replace("[","").replace("]","").replace(",","~") Full Code:- package com.examples /** * Created by kalit_000 on 22/09/2015. */ import kafka.producer.KeyedMessage import kafka.producer.Producer import kafka.producer.ProducerConfig

Re: How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Rafal Grzymkowski
Yes, I've seen it, but there are no files web.xml and error.jsp in binary installation of Spark. To apply this solution I should probably take Spark sources than create missing files and than recompile Spark. Right? I am looking for a solution to turn off error details without recompilation. /MyC

Re: How to get RDD from PairRDD in Java

2015-09-23 Thread Ankur Srivastava
PairRdd.values is what you need. Ankur On Tue, Sep 22, 2015, 11:25 PM Zhang, Jingyu wrote: > Hi All, > > I want to extract the "value" RDD from PairRDD in Java > > Please let me know how can I get it easily. > > Thanks > > Jingyu > > > This message and its attachments may contain legally privi

Re: How to control spark.sql.shuffle.partitions per query

2015-09-23 Thread Ted Yu
Please take a look at the following for example: sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2Suites.scala Search for spark.sql.shuffle.partitions and SQLConf.SHUFFLE_PARTITIONS.key FYI On Wed, Sep 23, 2015 at 12:42 AM, tridib wrote: > I am havin

Re: unsubscribe

2015-09-23 Thread Richard Hillegas
Hi Ntale, To unsubscribe from the user list, please send a message to user-unsubscr...@spark.apache.org as described here: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick Ntale Lukama wrote on 09/23/2015 04:34:48 AM: > From: Ntale Lukama > To: user > Date: 09/23/2015 04:

Re: Checkpoint files are saved before stream is saved to file (rdd.toDF().write ...)?

2015-09-23 Thread Cody Koeninger
TD can correct me on this, but I believe checkpointing is done after a set of jobs is submitted, not after they are completed. If you fail while processing the jobs, starting over from that checkpoint should put you in the correct state. In any case, are you actually observing a loss of messages

Re: How to subtract two RDDs with same size

2015-09-23 Thread Sujit Pal
Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p => p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together to form an A

Re: How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Ted Yu
Have you read this ? http://stackoverflow.com/questions/2246074/how-do-i-hide-stack-traces-in-the-browser-using-jetty On Wed, Sep 23, 2015 at 6:56 AM, Rafal Grzymkowski wrote: > Hi, > > Is it possible to disable Jetty stack trace with errors on Spark > master:8080 ? > When I trigger Http server

Re: K Means Explanation

2015-09-23 Thread Sabarish Sasidharan
You can't obtain that from the model. But you can always ask the model to predict the cluster center for your vectors by calling predict(). Regards Sab On Wed, Sep 23, 2015 at 7:24 PM, Tapan Sharma wrote: > Hi All, > > In the KMeans example provided under mllib, it traverse the outcome of > KMe

Re: Calling a method parallel

2015-09-23 Thread Sujit Pal
Hi Tapan, Perhaps this may work? It takes a range of 0..100 and creates an RDD out of them, then calls X(i) on each. The X(i) should be executed on the workers in parallel. Scala: val results = sc.parallelize(0 until 100).map(idx => X(idx)) Python: results = sc.parallelize(range(100)).map(lambda

Re: unsubscribe

2015-09-23 Thread Akhil Das
To unsubscribe, you need to send an email to user-unsubscr...@spark.apache.org as described here http://spark.apache.org/community.html Thanks Best Regards On Wed, Sep 23, 2015 at 1:23 AM, Stuart Layton wrote: > > > -- > Stuart Layton >

How to turn off Jetty Http stack errors on Spark web

2015-09-23 Thread Rafal Grzymkowski
Hi, Is it possible to disable Jetty stack trace with errors on Spark master:8080 ? When I trigger Http server error 500 than anyone can read details. I tried options available in log4j.properties but it doesn't help. Any hint? Thank you for answer MyCo ---

Re: WAL on S3

2015-09-23 Thread Michal Čizmazia
To get around the fact that flush does not work in S3, my custom WAL implementation stores a separate S3 object per each WriteAheadLog.write call. Do you see any gotchas with this approach? On 23 September 2015 at 02:10, Tathagata Das wrote: > Responses inline. > > > On Tue, Sep 22, 2015 at 8

Re: Has anyone used the Twitter API for location filtering?

2015-09-23 Thread Akhil Das
I just tried it and very few tweets has the .getPlace and .getGeoLocation data available in it. [image: Inline image 1] I guess this is more of an issue with the twitter api. Thanks Best Regards On Tue, Sep 22, 2015 at 11:35 PM, Jo Sunad wrote: > Thanks Akhil, but I can't seem to get any tw

K Means Explanation

2015-09-23 Thread Tapan Sharma
Hi All, In the KMeans example provided under mllib, it traverse the outcome of KMeansModel to know the cluster centers like this: KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, KMeans.K_MEANS_PARALLEL()); System.out.println("Cluster centers:"); for (Vector center : m

Calling a method parallel

2015-09-23 Thread Tapan Sharma
Hi All, I want to call a method X(int i) from my Spark program for different values of i. This means. X(1), X(2).. X(n).. Each time it returns the one object. Currently I am doing this sequentially. Is there any way to run these in parallel and I get back the list of objects? Sorry for this basic

Re: SparkContext declared as object variable

2015-09-23 Thread Akhil Das
Yes of course it works. [image: Inline image 1] Thanks Best Regards On Tue, Sep 22, 2015 at 4:53 PM, Priya Ch wrote: > Parallelzing some collection (array of strings). Infact in our product we > are reading data from kafka using KafkaUtils.createStream and applying some > transformations. > >

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Uthayan Suthakar
Thank you tathagata for your response. It make sense to use the MEMORY_AND_DISK. But sometime when I start the job it does not cache everyting at the start. It only caches 90%. The LRU scheme will only take affect after a while when the data is not in use but why it failing to cache the data at the

Re: Cosine LSH Join

2015-09-23 Thread Nick Pentreath
Looks interesting - I've been trying out a few of the ANN / LSH packages on spark-packages.org and elsewhere e.g. http://spark-packages.org/package/tdebatty/spark-knn-graphs and https://github.com/marufaytekin/lsh-spark How does this compare? Perhaps you could put it up on spark-packages to get v

Cosine LSH Join

2015-09-23 Thread Demir
We've just open sourced a LSH implementation on Spark. We're using this internally in order to find topK neighbors after a matrix factorization. We hope that this might be of use for others: https://github.com/soundcloud/cosine-lsh-join-spark For those wondering: lsh is a technique to quickly fi

Re: JdbcRDD Constructor

2015-09-23 Thread satish chandra j
HI, Could anybody provide inputs if they have came across similar issue @Rishitesh Could you provide if any sample code to use JdbcRDDSuite Regards, Satish Chandra On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra wrote: > I am using Spark 1.5. I always get count = 100, irrespective of num >

Re: spark on mesos gets killed by cgroups for too much memory

2015-09-23 Thread Dick Davies
I haven't seen that much memory overhead, I think my default is 512Mb (just a small test stack) on spark 1.4.x and i can run simple monte carlo simulations without the 'spike' of RAM usage when they deploy. I'd assume something you're using is grabbing a lot of VM up front - one option you might w

Re: Py4j issue with Python Kafka Module

2015-09-23 Thread ayan guha
Thanks guys. On Wed, Sep 23, 2015 at 3:54 PM, Tathagata Das wrote: > SPARK_CLASSPATH is I believe deprecated right now. So I am not surprised > that there is some difference in the code paths. > > On Tue, Sep 22, 2015 at 9:45 PM, Saisai Shao > wrote: > >> I think it is something related to clas

RE: Why is 1 executor overworked and other sit idle?

2015-09-23 Thread Richard Eggert
Reading from Cassandra and mapping to CSV are likely getting divided among executors, but I think reading from Cassandra is relatively cheap, and mapping to CSV is trivial, but coalescing to a single partition is fairly expensive and funnels the processing to a single executor, and writing out t

Re: JdbcRDD Constructor

2015-09-23 Thread Rishitesh Mishra
I am using Spark 1.5. I always get count = 100, irrespective of num partitions. On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j wrote: > HI, > Currently using Spark 1.2.2, could you please let me know correct results > output count which you got it by using JdbcRDDSuite > > Regards, > Satish C

unsubscribe

2015-09-23 Thread Ntale Lukama

Re: JdbcRDD Constructor

2015-09-23 Thread satish chandra j
HI, Currently using Spark 1.2.2, could you please let me know correct results output count which you got it by using JdbcRDDSuite Regards, Satish Chandra On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra wrote: > Which version of Spark you are using ?? I can get correct results using > JdbcRDD

Can DataFrames with different schema be joined efficiently

2015-09-23 Thread MrJew
Hello, I'm using spark streaming to handle quite big data flow. I'm solving a problem where we are inferring the type from the data ( we need more specific data types than what JSON provides ). And quite often there is a small difference between the schemas that we get. Saving to parquet files an

RE: Yarn Shutting Down Spark Processing

2015-09-23 Thread Bryan
Tathagata, Simple batch jobs do work. The cluster has a good set of resources and a limited input volume on the given Kafka topic. The job works on the small 3-node standalone-configured cluster I have setup for test. Regards, Bryan Jeffrey -Original Message- From: "Tathagata Das" S

Dose spark auto invoke StreamingContext.stop while receive kill signal?

2015-09-23 Thread Bin Wang
I'd like the spark application to be stopped gracefully while received kill signal, so I add these code: sys.ShutdownHookThread { println("Gracefully stopping Spark Streaming Application") ssc.stop(stopSparkContext = true, stopGracefully = true) println("Application stopped")

Updation of a graph based on changed input

2015-09-23 Thread aparasur
Hi, I am fairly new to Spark GraphX. I am using graphx to create a graph derived from data size in the range of 500GB. The inputs used to create this large graph comes from a set of files with predefined space separated constructs. I also understand that each time, the graph will be constructed and

  1   2   >