RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; d...@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When I run

Re: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Sabarish Sasidharan
Nick is right. I too have implemented this way and it works just fine. In my case, there can be even more products. You simply broadcast blocks of products to userFeatures.mapPartitions() and BLAS multiply in there to get recommendations. In my case 10K products form one block. Note that you would

Re: Does MLLib has attribute importance?

2015-06-17 Thread Ruslan Dautkhanov
Thank you Xiangrui. Oracle's attribute importance mining function have a target variable. "Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target." MLlib's ChiSqSelector does not have a target variable. -- Ruslan Dautkhanov

RE: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Nick Pentreath
One issue is that you broadcast the product vectors and then do a dot product one-by-one with the user vector. You should try forming a matrix of the item vectors and doing the dot product as a matrix-vector multiply which will make things a lot faster. Another optimisation that is avalai

Pyspark RDD search

2015-06-17 Thread bhavyateja
I have an RDD which is list of list And another RDD which is list of pairs No duplicates in inner list of first RDD and No duplicates in the pairs from second rdd I am trying to check if any pair of second RDD is present in the any list of -- View this message in context: http://apache-spark

Pyspark combination

2015-06-17 Thread bhavyateja
I have an RDD with more than 1000 elements. I have to form the combinations of elements. I tried to use Cartesian transformation and then filtering them. But failing with eof error. Is there any other way to do the same using partitions I am using pyspark -- View this message in context:

RE: Matrix Multiplication and mllib.recommendation

2015-06-17 Thread Ganelin, Ilya
Actually talk about this exact thing in a blog post here http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/. Keep in mind, you're actually doing a ton of math. Even with proper caching and use of broadcast variables this will t

RE: Where are my log4j or exception output on EMR?

2015-06-17 Thread Bozeman, Christopher
Sean, Spark on YARN (https://spark.apache.org/docs/latest/running-on-yarn.html) follows the logging construct of YARN. If you are using cluster deployment mode on yarn (master=yarn-cluster) then the logging performed in the driver (your code) would be picked up by YARN’s logs in the Applicatio

Re: Why does driver transfer application jar to executors?

2015-06-17 Thread Jeff Zhang
TaskDescription only serialize the jar path not the jar content. Multiple tasks can run on the same executor. Executor will check whether the jar has been fetched when each time task is launched. If so, it won't fetch it again. Only serialize the jar path can prevent serialize jar multiple times wh

Matrix Multiplication and mllib.recommendation

2015-06-17 Thread afarahat
Hello; I am trying to get predictions after running the ALS model. The model works fine. In the prediction/recommendation , I have about 30 ,000 products and 90 Millions users. When i try the predict all it fails. I have been trying to formulate the problem as a Matrix multiplication where I fi

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
Hi Silvio, Thanks for your response. I should clarify. I would like to do updates on a structure iteratively. I am not sure if updateStateByKey meets my criteria. In the current situation, I can run some map reduce tasks and generate a JavaPairDStream, after this my algorithm is necessarily seque

Re:

2015-06-17 Thread Silvio Fiorito
Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:48 PM To: "user@spark.apache.org" Subject: Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Silvio Fiorito
Hi, just answered in your other thread as well... Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:51 PM To: "user@spark.apache.org" Subject: Iterative Programming by keeping data across m

Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to updat

[no subject]

2015-06-17 Thread Nipun Arora
Hi, Is there anyway in spark streaming to keep data across multiple micro-batches? Like in a HashMap or something? Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? This is especially the case when I am trying to updat

Why does driver transfer application jar to executors?

2015-06-17 Thread Shiyao Ma
Hi, Looking from my executor logs, the submitted application jar is transmitted to each executors? Why does spark do the above? To my understanding, the tasks to be run are already serialized with TaskDescription. Regards. - T

Issue with PySpark UDF on a column of Vectors

2015-06-17 Thread Colin Alstad
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow = Row('i

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Guru Medasani
Hi Elkhan, There are couple of ways to do this. 1) Spark-jobserver is a popular web server that is used to submit spark jobs. https://github.com/spark-jobserver/spark-jobserver 2) Spark-submit script sets the classpath for the job. Bypassing

Re: spark-sql from CLI --->EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-17 Thread Sanjay Subramanian
ok solved. Looks like breathing the the spark-summit SFO air for 3 days helped a lot ! Piping the 7 million records to local disk still runs out of memory.So piped the results into another Hive table. I can live with that :-)  /opt/cloudera/parcels/CDH/lib/spark/bin/spark-sql -e "use aers; create

understanding on the "waiting batches" and "scheduling delay" in Streaming UI

2015-06-17 Thread Mike Fang
Hi, I have a spark streaming program running for ~ 25hrs. When I check the Streaming UI tab. I found the “Waiting batches” is 144. But the “scheduling delay” is 0. I am a bit confused. If the “waiting batches” is 144, that means many batches are waiting in the queue to be processed? If this is

Spark SQL DATE_ADD function - Spark 1.3.1 & 1.4.0

2015-06-17 Thread Nathan McCarthy
Hi guys, Running with a parquet backed table in hive ‘dim_promo_date_curr_p' which has the following data; scala> sqlContext.sql("select * from pz.dim_promo_date_curr_p").show(3) 15/06/18 00:53:21 INFO ParseDriver: Parsing command: select * from pz.dim_promo_date_curr_p 15/06/18 00:53:21 INFO P

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Elkhan Dadashov
This is not independent programmatic way of running of Spark job on Yarn cluster. That example demonstrates running on *Yarn-client* mode, also will be dependent of Jetty. Users writing Spark programs do not want to depend on that. I found this SparkLauncher class introduced in Spark 1.4 version

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Justin Yip
Done. https://issues.apache.org/jira/browse/SPARK-8420 Justin On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng wrote: > That sounds like a bug. Could you create a JIRA and ping Yin Huai > (cc'ed). -Xiangrui > > On Wed, May 27, 2015 at 12:57 AM, Justin Yip > wrote: > > Hello, > > > > I am trying

Union of many RDDs taking a long time

2015-06-17 Thread Matt Forbes
I have multiple input paths which each contain data that need to be mapped in a slightly different way into a common data structure. My approach boils down to: RDD rdd = null; for (Configuration conf : configurations) { RDD nextRdd = loadFromConfiguration(conf); rdd = (rdd == null) ? nextRdd :

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Corey Nolet
An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov wrote: > Hi all, > > Is there any way running Spark job in programmatic way on Yarn cluster > without using

Re: [SparkScore] Performance portal for Apache Spark

2015-06-17 Thread Sandy Ryza
This looks really awesome. On Tue, Jun 16, 2015 at 10:27 AM, Huang, Jie wrote: > Hi All > > We are happy to announce Performance portal for Apache Spark > http://01org.github.io/sparkscore/ ! > > The Performance Portal for Apache Spark provides performance data on the > Spark upsteam to the com

Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Elkhan Dadashov
Hi all, Is there any way running Spark job in programmatic way on Yarn cluster without using spark-submit script ? I cannot include Spark jars on my Java application (due o dependency conflict and other reasons), so I'll be shipping Spark assembly uber jar (spark-assembly-1.3.1-hadoop2.3.0.jar) t

Where are my log4j or exception output on EMR?

2015-06-17 Thread Sean Bollin
I'm running Spark on Amazon EMR using their install bootstrap. My Scala code had println("message here") statements - where can I find the output of these statements? I changed my code to use log4j - my log.info and log.error output is nowhere to be found. I've checked /mnt/var/log/hadoop/step

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
all of them. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:15 PM, Raghav Shankar wrote: > So, I would add the assembly jar to the just the master or would I have to > add it to all th

Re: Implementing top() using treeReduce()

2015-06-17 Thread Raghav Shankar
So, I would add the assembly jar to the just the master or would I have to add it to all the slaves/workers too? Thanks, Raghav > On Jun 17, 2015, at 5:13 PM, DB Tsai wrote: > > You need to build the spark assembly with your modification and deploy > into cluster. > > Sincerely, > > DB Tsai

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
You need to build the spark assembly with your modification and deploy into cluster. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar wrote: > I’ve implemented this

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
With Sentry, only hive user has the permission for read/write/execute on the subdirectories of warehouse. All the users get translated to "hive" when interacting with hiveserver2. But i think HiveContext is bypassing hiveserver2. On Wednesday, June 17, 2015, ayan guha wrote: > Try to grant read

Re: Implementing top() using treeReduce()

2015-06-17 Thread Raghav Shankar
I’ve implemented this in the suggested manner. When I build Spark and attach the new spark-core jar to my eclipse project, I am able to use the new method. In order to conduct the experiments I need to launch my app on a cluster. I am using EC2. When I setup my master and slaves using the EC2 se

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-17 Thread Xiangrui Meng
You can try hashing to control the feature dimension. MLlib's k-means implementation can handle sparse data efficiently if the number of features is not huge. -Xiangrui On Tue, Jun 16, 2015 at 2:44 PM, Rex X wrote: > Hi Sujit, > > That's a good point. But 1-hot encoding will make our data changin

Re: *Metrics API is odd in MLLib

2015-06-17 Thread Xiangrui Meng
LabeledPoint was used for both classification and regression, where label type is Double for simplicity. So in BinaryClassificationMetrics, we still use Double for labels. We compute the confusion matrix at each threshold internally, but this is not exposed to users ( https://github.com/apache/spar

Re: deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-17 Thread Sandy Ryza
Hi Matt, If you place your jars on HDFS in a public location, YARN will cache them on each node after the first download. You can also use the spark.executor.extraClassPath config to point to them. -Sandy On Wed, Jun 17, 2015 at 4:47 PM, Sweeney, Matt wrote: > Hi folks, > > I’m looking to d

Re: Not albe to run FP-growth Example

2015-06-17 Thread Xiangrui Meng
You should add spark-mllib_2.10 as a dependency instead of declaring it as the artifactId. And always use the same version for spark-core and spark-mllib. I saw you used 1.3.0 for spark-core but 1.4.0 for spark-mllib, which is not guaranteed to work. If you set the scope to "provided", mllib jar wo

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
With Sentry, only hive user has the permission for read/write/execute on the subdirectories of warehouse. All the users get translated to "hive" when interacting with hiveserver2. But i think HiveContext is bypassing hiveserver2. On Wednesday, June 17, 2015, ayan guha wrote: > Try to grant read

Reading maprfs from Spark

2015-06-17 Thread B Neupane
Hi, I'm running Spark on mesos and trying to read file from Maprcluster but not have much success with that. I tried 2 versions of Apache Spark (with and without Hadoop). I can get to the spark-shell in the with-hadoop version, but still can't access maprfs. Without-Hadoop version bails out with o

Re: Does MLLib has attribute importance?

2015-06-17 Thread Xiangrui Meng
We don't have it in MLlib. The closest would be the ChiSqSelector, which works for categorical data. -Xiangrui On Thu, Jun 11, 2015 at 4:33 PM, Ruslan Dautkhanov wrote: > What would be closest equivalent in MLLib to Oracle Data Miner's Attribute > Importance mining function? > > http://docs.oracl

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-17 Thread Xiangrui Meng
This is implemented in MLlib: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala#L41. -Xiangrui On Wed, Jun 10, 2015 at 1:53 PM, erisa wrote: > Hi, > > I am a Spark newbie, and trying to solve the same problem, and have > implement

deployment options for Spark and YARN w/ many app jar library dependencies

2015-06-17 Thread Sweeney, Matt
Hi folks, I’m looking to deploy spark on YARN and I have read through the docs (https://spark.apache.org/docs/latest/running-on-yarn.html). One question that I still have is if there is an alternate means of including your own app jars as opposed to the process in the “Adding Other Jars” sectio

Re: How to use Apache spark mllib Model output in C++ component

2015-06-17 Thread Xiangrui Meng
In 1.3, we added some model save/load support in Parquet format. You can use Parquet's C++ library (https://github.com/Parquet/parquet-cpp) to load the data back. -Xiangrui On Wed, Jun 10, 2015 at 12:15 AM, Akhil Das wrote: > Hope Swig and JNA might help for accessing c++ libraries from Java. > >

Re: k-means for text mining in a streaming context

2015-06-17 Thread Xiangrui Meng
Yes. You can apply HashingTF on your input stream and then use StreamingKMeans for training and prediction. -Xiangrui On Mon, Jun 8, 2015 at 11:05 AM, Ruslan Dautkhanov wrote: > Hello, > > https://spark.apache.org/docs/latest/mllib-feature-extraction.html > would Feature Extraction and Transforma

Re: Official Mllib API does not correspond to auto completion

2015-06-17 Thread Xiangrui Meng
I don't fully understand your question. Could you explain it in more details? Thanks! -Xiangrui On Mon, Jun 8, 2015 at 2:26 AM, Jean-Charles RISCH < risch.jeanchar...@gmail.com> wrote: > Hi, > > I am playing with Mllib (Spark 1.3.1) and my auto completion propositions > don't correspond to the of

Re: Optimization module in Python mllib

2015-06-17 Thread Xiangrui Meng
There is no plan at this time. We haven't reached 100% coverage on user-facing API in PySpark yet, which would have higher priority. -Xiangrui On Sun, Jun 7, 2015 at 1:42 AM, martingoodson wrote: > Am I right in thinking that Python mllib does not contain the optimization > module? Are there plan

Re: redshift spark

2015-06-17 Thread Xiangrui Meng
Hi Hafiz, As Ewan mentioned, the path is the path to the S3 files unloaded from Redshift. This is a more scalable way to get a large amount of data from Redshift than via JDBC. I'd recommend using the SQL API instead of the Hadoop API (https://github.com/databricks/spark-redshift). Best, Xiangrui

Re: Why the default Params.copy doesn't work for Model.copy?

2015-06-17 Thread Xiangrui Meng
That's is a bug, which will be fixed in https://github.com/apache/spark/pull/6622. I disabled Model.copy because models usually doesn't have a default constructor and hence the default Params.copy implementation won't work. Unfortunately, due to insufficient test coverage, StringIndexModel.copy is

Reading maprfs from Spark

2015-06-17 Thread Bikrant Neupane
Hi, I'm running Spark-1.4.0 on Mesos. I have been trying to read file from Mapr cluster but not have much success with it. I tried 2 versions of Apache Spark (with and without Hadoop). I can get to the spark-shell in the with-hadoop version, but still can't access maprfs[2]. Without-Hadoop versio

Re: RandomForest - subsamplingRate parameter

2015-06-17 Thread Xiangrui Meng
Because we don't have random access to the record, sampling still need to go through the records sequentially. It does save some computation, which is perhaps noticeable only if you have data cached in memory. Different random seeds are used for trees. -Xiangrui On Wed, Jun 3, 2015 at 4:40 PM, And

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Peter Rudenko
Hi, here's how to get Parrallel search pipleine: package org.apache.spark.ml.pipeline import org.apache.spark.ml.param.ParamMap import org.apache.spark.ml.{Pipeline, PipelineModel} import org.apache.spark.sql._ class ParralelGridSearchPipelineextends Pipeline { override def fit(dataset: Data

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread ayan guha
Try to grant read execute access through sentry. On 18 Jun 2015 05:47, "Nitin kak" wrote: > I am trying to run a hive query from Spark code using HiveContext object. > It was running fine earlier but since the Apache Sentry has been set > installed the process is failing with this exception : > >

Re: spark mlib variance analysis

2015-06-17 Thread Xiangrui Meng
We don't have R-like model summary in MLlib, but we plan to add some in 1.5. Please watch https://issues.apache.org/jira/browse/SPARK-7674. -Xiangrui On Thu, May 28, 2015 at 3:47 PM, rafac wrote: > I have a simple problem: > i got mean number of people on one place by hour(time-series like), and

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Xiangrui Meng
That sounds like a bug. Could you create a JIRA and ping Yin Huai (cc'ed). -Xiangrui On Wed, May 27, 2015 at 12:57 AM, Justin Yip wrote: > Hello, > > I am trying out 1.4.0 and notice there are some differences in behavior with > Timestamp between 1.3.1 and 1.4.0. > > In 1.3.1, I can compare a Tim

Re: Collabrative Filtering

2015-06-17 Thread Xiangrui Meng
Please following the code examples from the user guide: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark. -Xiangrui On Tue, May 26, 2015 at 12:34 AM, Yasemin Kaya wrote: > Hi, > > In CF > > String path = "data/mllib/als/test.data"; > JavaRDD data = sc.textFile

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Xiangrui Meng
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira wrote: > Hi, > > I am currently experimenting with linear regression (SGD) (Spark + MLlib, > ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I > do this (for now) by an exhaustive grid search of the step size and the > numbe

Re: BigDecimal problem in parquet file

2015-06-17 Thread Cheng Lian
Does increasing executor memory fix the memory problem? How many columns does the schema contain? Parquet can be super memory consuming when writing wide tables. Cheng On 6/15/15 5:48 AM, Bipin Nag wrote: HI Davies, I have tried recent 1.4 and 1.5-snapshot to 1) open the parquet and save i

Re: Spark or Storm

2015-06-17 Thread Jordan Pilat
>not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering together multiple consumer instances into a consumer group. If your topic is sufficiently partitioned, the consumer group can consume the topic in a parallelized fashion. If it isn't, you

Re: HiveContext saveAsTable create wrong partition

2015-06-17 Thread Cheng Lian
Thanks for reporting this. Would you mind to help creating a JIRA for this? On 6/16/15 2:25 AM, patcharee wrote: I found if I move the partitioned columns in schemaString and in Row to the end of the sequence, then it works correctly... On 16. juni 2015 11:14, patcharee wrote: Hi, I am using

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Cheng Lian
What's the size of this table? Is the data skewed (so that speculation is probably triggered)? Cheng On 6/15/15 10:37 PM, Night Wolf wrote: Hey Yin, Thanks for the link to the JIRA. I'll add details to it. But I'm able to reproduce it, at least in the same shell session, every time I do a w

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Ajay
Hi there! It seems like you have Read/Execute access permission (and no update/insert/delete access). What operation are you performing? Ajay > On Jun 17, 2015, at 5:24 PM, nitinkak001 wrote: > > I am trying to run a hive query from Spark code using HiveContext object. It > was running fine

Executor memory allocations

2015-06-17 Thread Corey Nolet
So I've seen in the documentation that (after the overhead memory is subtracted), the memory allocations of each executor are as follows (assume default settings): 60% for cache 40% for tasks to process data Reading about how Spark implements shuffling, I've also seen it say "20% of executor mem

Re: MLLib: instance weight

2015-06-17 Thread Xiangrui Meng
Hi Gajan, Please subscribe our user mailing list, which is the best place to get your questions answered. We don't have weighted instance support, but it should be easy to add and we plan to do it in the next release (1.5). Thanks for asking! Best, Xiangrui On Wed, Jun 17, 2015 at 2:33 PM, Gajan

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread nitinkak001
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUT

Re: Serial batching with Spark Streaming

2015-06-17 Thread Tathagata Das
The default behavior should be that batch X + 1 starts processing only after batch X completes. If you are using Spark 1.4.0, could you show us a screenshot of the streaming tab, especially the list of batches? And could you also tell us if you are setting any SparkConf configurations? On Wed, Jun

Re: Spark or Storm

2015-06-17 Thread Tathagata Das
To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two sep

Shuffled vs non-shuffled coalesce in Apache Spark

2015-06-17 Thread pawel.jurczenko
Hi! I would like to know what is the difference between the following transformations when they are executed right before writing RDD to a file? 1. coalesce(1, shuffle = true) 2. coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.fi

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : *org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUT

Serial batching with Spark Streaming

2015-06-17 Thread Michal Čizmazia
Is it possible to achieve serial batching with Spark Streaming? Example: I configure the Streaming Context for creating a batch every 3 seconds. Processing of the batch #2 takes longer than 3 seconds and creates a backlog of batches: batch #1 takes 2s batch #2 takes 10s batch #3 takes 2s batch

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
Ok what was wrong was that the spark-env did not contain the HADOOP_CONF_DIR properly set to /etc/hadoop/conf/ With that fixed, this issue is gone, but I can't seem to get Spark SQL 1.4.0 with Hive working on CDH 5.3 or 5.4 : Using this command line : IPYTHON=1 /.../spark-1.4.0-bin-hadoop2.4/bin/py

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
The major difference is that in Spark Streaming, there's no *need* for a TridentState for state inside your computation. All the stateful operations (reduceByWindow, updateStateByKey, etc) automatically handle exactly-once processing, keeping updates in order, etc. Also, you don't need to run a

RE: Spark or Storm

2015-06-17 Thread Evo Eftimov
The only thing which doesn't make much sense in Spark Streaming (and I am not saying it is done better in Storm) is the iterative and "redundant" shipping of the essentially the same tasks (closures/lambdas/functions) to the cluster nodes AND re-launching them there again and again This is a l

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
Hi Matei, Ah, can't get more accurate than from the horse's mouth... If you don't mind helping me understand it correctly.. >From what I understand, Storm Trident does the following (when used with Kafka): 1) Sit on Kafka Spout and create batches 2) Assign global sequential ID to the batches 3)

Re: --packages & Failed to load class for data source v1.4

2015-06-17 Thread Don Drake
I don't think this is the same issue as it works just fine in pyspark v1.3.1. Are you aware of any workaround? I was hoping to start testing one of my apps in Spark 1.4 and I use the CSV exports as a safety valve to easily debug my data flow. -Don On Sun, Jun 14, 2015 at 7:18 PM, Burak Yavuz w

Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
Also, still for 1), in conf/spark-defaults.sh, you can give the following arguments to tune the Driver's resources: spark.driver.cores spark.driver.memory Not sure if you can pass them at submit time, but it should be possible. -- View this message in context: http://apache-spark-user-list.10

Re: Can we increase the space of spark standalone cluster

2015-06-17 Thread maxdml
For 1) In standalone mode, you can increase the worker's resource allocation in their local conf/spark-env.sh with the following variables: SPARK_WORKER_CORES, SPARK_WORKER_MEMORY At application submit time, you can tune the number of resource allocated to executors with /--executor-cores/ and /

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have

Web UI vs History Server Bugs

2015-06-17 Thread jcai
Hi, I am running this on Spark stand-alone mode. I find that when I examine the web UI, a couple bugs arise: 1. There is a discrepancy between the number denoting the duration of the application when I run the history server and the number given by the web UI (default address is master:8080). I c

Re: How does one decide no of executors/cores/memory allocation?

2015-06-17 Thread nsalian
Hello shreesh, That would be quite a challenge to understand. A few things that I think should help estimate those numbers: 1) Understanding the cost of the individual transformations in the application E.g a flatMap can be more expansive in memory as opposed to a map 2) The communication patter

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-17 Thread Yin Huai
So, the second attemp of those tasks failed with NPE can complete and the job eventually finished? On Mon, Jun 15, 2015 at 10:37 PM, Night Wolf wrote: > Hey Yin, > > Thanks for the link to the JIRA. I'll add details to it. But I'm able to > reproduce it, at least in the same shell session, every

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Yin Huai
Hi John, Did you also set spark.sql.planner.externalSort to true? Probably you will not see executor lost with this conf. For now, maybe you can manually split the query to two parts, one for skewed keys and one for other records. Then, you union then results of these two parts together. Thanks,

Re: Spark or Storm

2015-06-17 Thread Spark Enthusiast
Again, by Storm, you mean Storm Trident, correct? On Wednesday, 17 June 2015 10:09 PM, Michael Segel wrote: Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea.  So in terms o

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Koert Kuipers
could it be composed maybe? a general version and then a sql version that exploits the additional info/abilities available there and uses the general version internally... i assume the sql version can benefit from the logical phase optimization to pick join details. or is there more? On Tue, Jun

Re: Unable to use more than 1 executor for spark streaming application with YARN

2015-06-17 Thread Saiph Kappa
How can I get more information regarding this exception? On Wed, Jun 17, 2015 at 1:17 AM, Saiph Kappa wrote: > Hi, > > I am running a simple spark streaming application on hadoop 2.7.0/YARN > (master: yarn-client) with 2 executors in different machines. However, > while the app is running, I can

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
> Seems you're hitting the self-join, currently Spark SQL won't cache any > result/logical tree for further analyzing or computing for self-join. Other joins don’t suffer from this problem? > Since the logical tree is huge, it's reasonable to take long time in > generating its tree string recu

Re: SparkR 1.4.0: read.df() function fails

2015-06-17 Thread Stensrud, Erik
Thanks to both of you! You solved the problem. Thanks Erik Stensrud Sendt fra min iPhone Den 16. jun. 2015 kl. 20.23 skrev Guru Medasani mailto:gdm...@gmail.com>>: Hi Esten, Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file path you specified is local?. mydf<-r

Re: ALS predictALL not completing

2015-06-17 Thread Nick Pentreath
So to be clear, you're trying to use the recommendProducts method of MatrixFactorizationModel? I don't see predictAll in 1.3.1 1.4.0 has a more efficient method to recommend products for all users (or vice versa): https://github.com/apache/spark/blob/v1.4.0/mllib/src/main/scala/org/apache/spark/ml

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations Out

Re: Loading lots of parquet files into dataframe from s3

2015-06-17 Thread arnonrgo
What happens is that Spark opens the files so in order to merge the schema. Unfortunately spark has an assumption that the files are local so that access would be fast which makes this step in s3 extremely slow. If you know all the files use the same schema (e.g. it is a result of a previous job)

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
@Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7 https://databricks.com/blog

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't rea

RE: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Cheng, Hao
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Since the logical tree is huge, it's reasonable to take long time in generating its tree string recursively. And I also doubt the computing can finish wit

Re: Spark or Storm

2015-06-17 Thread ayan guha
Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, "Enno Shioji" wrote: > Hi Ayan, > > Admittedly I haven't done much with

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
So, there is some input: So the problem could be in spark-sql-thriftserver. When I use spark console to submit SQL query, it takes 10 seconds and reasonable count of tasks. import com.datastax.spark.connector._; val cc = new CassandraSQLContext(sc); cc.sql("select su.user_id from appdata.site_u

Using spark.hadoop.* to set Hadoop properties

2015-06-17 Thread Corey Nolet
I've become accustomed to being able to use system properties to override properties in the Hadoop Configuration objects. I just recently noticed that when Spark creates the Hadoop Configuraiton in the SparkContext, it cycles through any properties prefixed with spark.hadoop. and add those properti

Re: Spark on EMR

2015-06-17 Thread Kelly, Jonathan
Yes, for now it is a wrapper around the old install-spark BA, but that will change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming soon though, of course. Unfortunately, though install-spark is

RE: Is HiveContext Thread Safe?

2015-06-17 Thread Cheng, Hao
Yes, it is thread safe. That’s how Spark SQL JDBC Server works. Cheng Hao From: V Dineshkumar [mailto:developer.dines...@gmail.com] Sent: Wednesday, June 17, 2015 9:44 PM To: user@spark.apache.org Subject: Is HiveContext Thread Safe? Hi, I have a HiveContext which I am using in multiple thread

Re: Spark on EMR

2015-06-17 Thread Eugen Cepoi
It looks like it is a wrapper around https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark So basically adding an option -v,1.4.0.a should work. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html 2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda : >

Is HiveContext Thread Safe?

2015-06-17 Thread V Dineshkumar
Hi, I have a HiveContext which I am using in multiple threads to submit a Spark SQL query using *sql* method. I just wanted to know whether this method is thread-safe or not?Will all my queries be submitted at the same time independent of each other or will be submitted sequential one after the o

  1   2   >