Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-25 Thread Abhishek Anand
On changing the default compression codec which is snappy to lzf the errors are gone !! How can I fix this using snappy as the codec ? Is there any downside of using lzf as snappy is the default codec that ships with spark. Thanks !!! Abhi On Mon, Feb 22, 2016 at 7:42 PM, Abhishek Anand

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi Jan , Thanks for help. Alas.. you suggestion also didnt work scala> import org.apache.spark.sql.types.{StructType, StructField, > StringType,IntegerType,LongType,DoubleType, FloatType}; > import org.apache.spark.sql.types.{StructType, StructField, StringType, > IntegerType, LongType,

RE: Spark Streaming - graceful shutdown when stream has no more data

2016-02-25 Thread Mao, Wei
I would argue against making it configurable unless there is real production use case. If it’s just for test, there are bunch of ways to achieve it. For example, you can mark if test streaming is finished globally, and stop ssc on another thread when status of that mark changed. Back to

When I merge some datas,can't go on...

2016-02-25 Thread Bonsen
I have a file,like 1.txt: 1 2 1 3 1 4 1 5 1 6 1 7 2 4 2 5 2 7 2 9 I want to merge them,results like this map(1->List(2,3,4,5,6,7),2->List(4,5,7,9)) what should I do?。。 val file1=sc.textFile("1.txt") val q1=file1.flatMap(_.split(' '))???,maybe I should change RDD[int] to RDD[int,int]? --

Re: [Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Jan Štěrba
just use coalesce function df.selectExpr("name", "coalesce(age, 0) as age") -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Fri, Feb 26, 2016 at 5:27 AM, Divya Gehlot wrote: > Hi, > I have dataset which

Re: merge join already sorted data?

2016-02-25 Thread Takeshi Yamamuro
Hi, SparkSQL inside can put order assumptions on columns (OrderedDistribution) though, JDBC datasources does not support this; spark is not sure how columns loaded from databases are ordered. Also, there is no way to let spark know this order. thanks, On Fri, Feb 26, 2016 at 2:22 PM, Ken Geis

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-25 Thread Zhun Shen
Hi, thanks for you advice. I tried your method, I use Gradle to manage my scala code. 'com.snowplowanalytics:scala-maxmind-iplookups:0.2.0’ was imported in Gradle. spark version: 1.6.0 scala: 2.10.4 scala-maxmind-iplookups: 0.2.0 I run my test, got the error as below:

Re: Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Takeshi Yamamuro
Hi, Could you make simple codes to reproduce the issue? I'm not exactly sure why shuffle data on temp dir. are wrongly deleted. thanks, On Fri, Feb 26, 2016 at 6:00 AM, Zee Chen wrote: > Hi, > > I am debugging a situation where SortShuffleWriter sometimes fail to > create

Survival Curves using AFT implementation in Spark

2016-02-25 Thread Stuti Awasthi
Hi All, I wanted to apply Survival Analysis using Spark AFT algorithm implementation. Now I perform the same in R using coxph model and passing the model in Survfit() function to generate survival curves Then I can visualize the survival curve on validation data to understand how good my model

Re: DirectFileOutputCommiter

2016-02-25 Thread Takeshi Yamamuro
Hi, Great work! What is the concrete performance gain of the committer on s3? I'd like to know. I think there is no direct committer for files because these kinds of committer has risks to loss data (See: SPARK-10063). Until this resolved, ISTM files cannot support direct commits. thanks, On

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML.

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Mohammad Tariq
Spark doesn't support subqueries in WHERE clause, IIRC. It supports subqueries only in the FROM clause as of now. See this ticket for more on this. [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Fri,

merge join already sorted data?

2016-02-25 Thread Ken Geis
I am loading data from two different databases and joining it in Spark. The data is indexed in the database, so it is efficient to retrieve the data ordered by a key. Can I tell Spark that my data is coming in ordered on that key so that when I join the data sets, they will be joined with little

Re: ALS trainImplicit performance

2016-02-25 Thread Sabarish Sasidharan
I have tested upto 3 billion. ALS scales, you just need to scale your cluster accordingly. More than building the model, it's getting the final recommendations that won't scale as nicely, especially when number of products is huge. This is the case when you are generating recommendations in a

Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-02-25 Thread Hyukjin Kwon
Hi, it looks you forgot to specify the "rowTag" option, which is "book" for the case of the sample data. Thanks 2016-01-29 8:16 GMT+09:00 Andrés Ivaldi : > Hi, could you get it work, tomorrow I'll be using the xml parser also, On > windows 7, I'll let you know the results.

[Help]: DataframeNAfunction fill method throwing exception

2016-02-25 Thread Divya Gehlot
Hi, I have dataset which looks like below name age alice 35 bob null peter 24 I need to replace null values of columns with 0 so I referred Spark API DataframeNAfunctions.scala

Saving and Loading Dataframes

2016-02-25 Thread raj.kumar
Hi, I am using mllib. I use the ml vectorization tools to create the vectorized input dataframe for the ml/mllib machine-learning models with schema: root |-- label: double (nullable = true) |-- features: vector (nullable = true) To avoid repeated vectorization, I am trying to save and load

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Harsh J
You should be able to cast the object type to the real underlying type (GenericRecord (if generic, which is so by default), or the actual type class (if specific)). The underlying implementation of KafkaAvroDecoder seems to use either one of those depending on a config switch:

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread ayan guha
Why is this not working for you? Are you trying on dataframe? What error are you getting? On Thu, Feb 25, 2016 at 10:23 PM, Ashok Kumar wrote: > Hi, > > What is the equivalent of this in Spark please > > select * from mytable where column1 in (select max(column1)

ALS trainImplicit performance

2016-02-25 Thread Roberto Pagliari
Does anyone know about the maximum number of ratings ALS was tested successfully? For example, is 1 billion ratings (nonzero entries) too much for it to work properly? Thank you,

How to overwrite data dynamically to specific partitions in Spark SQL

2016-02-25 Thread SRK
Hi, I need to overwrite data dynamically to specific partitions depending on filters. How can that be done in sqlContext? Thanks, Swetha -- View this message in context:

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
I got it working by using jsonRDD. This is what I had to do in order to make it work : val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet) val lines = messages.map(_._2.toString) lines.foreachRDD(jsonRDD

Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Shixiong(Ryan) Zhu
You can use `DStream.map` to transform objects to anything you want. On Thu, Feb 25, 2016 at 11:06 AM, Mohammad Tariq wrote: > Hi group, > > I have just started working with confluent platform and spark streaming, > and was wondering if it is possible to access individual

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
yes, should be this one https://gist.github.com/aarondav/c513916e72101bbe14ec then need to set it in spark-defaults.conf : https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13 Am Freitag, 26. Februar 2016 schrieb Yin Yang : >

Re: DirectFileOutputCommiter

2016-02-25 Thread Yin Yang
The header of DirectOutputCommitter.scala says Databricks. Did you get it from Databricks ? On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu wrote: > interesting in this topic as well, why the DirectFileOutputCommitter not > included? > > we added it in our fork, under >

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
interesting in this topic as well, why the DirectFileOutputCommitter not included? we added it in our fork, under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala moreover, this DirectFileOutputCommitter is not working for the insert operations in HiveContext, since the

Re: Spark SQL support for sub-queries

2016-02-25 Thread Mohammad Tariq
AFAIK, this isn't supported yet. A ticket is in progress though. [image: http://] Tariq, Mohammad about.me/mti [image: http://] On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh <

Re: LDA topic Modeling spark + python

2016-02-25 Thread Bryan Cutler
I'm not exactly sure how you would like to setup your LDA model, but I noticed there was no Python example for LDA in Spark. I created this issue to add it https://issues.apache.org/jira/browse/SPARK-13500. Keep an eye on this if it could be of help. bryan On Wed, Feb 24, 2016 at 8:34 PM,

Spark SQL support for sub-queries

2016-02-25 Thread Mich Talebzadeh
Hi, I guess the following confirms that Spark does bot support sub-queries val d = HiveContext.table("test.dummy") d.registerTempTable("tmp") HiveContext.sql("select * from tmp where id IN (select max(id) from tmp)") It crashes The SQL works OK in Hive itself on the underlying

Re: PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Shixiong(Ryan) Zhu
Please use Java 7 instead. On Thu, Feb 25, 2016 at 1:54 PM, Marco Mistroni wrote: > Hello all > could anyone help? > i have tried to install spark 1.6.0 on ubuntu, but the installation failed > Here are my steps > > 1. download spark (successful) > > 31 wget

PLease help: installation of spark 1.6.0 on ubuntu fails

2016-02-25 Thread Marco Mistroni
Hello all could anyone help? i have tried to install spark 1.6.0 on ubuntu, but the installation failed Here are my steps 1. download spark (successful) 31 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0.tgz 33 tar -zxf spark-1.6.0.tgz 2. cd spark-1.6.0 2.1 sbt assembly error]

Bug in DiskBlockManager subDirs logic?

2016-02-25 Thread Zee Chen
Hi, I am debugging a situation where SortShuffleWriter sometimes fail to create a file, with the following stack trace: 16/02/23 11:48:46 ERROR Executor: Exception in task 13.0 in stage 47827.0 (TID 1367089) java.io.FileNotFoundException:

Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
Hi group, I have just started working with confluent platform and spark streaming, and was wondering if it is possible to access individual fields from an Avro object read from a kafka topic through spark streaming. As per its default behaviour *KafkaUtils.createDirectStream[Object, Object,

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Umesh Kacha
Hi I am using Hadoop 2.4.0 it is not frequent sometimes it happens I dont think my spark logic has any problem if logic would have been wrong it would be failing everyday. I see mostly YARN killed executors so I see executor lost in my driver logs. On Thu, Feb 25, 2016 at 10:30 PM, Yin Yang

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
If you want to see which partitions exist on disk (without manually checking), you could write code against the Hadoop FileSystem library to check. Is that what you are asking? https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/package-summary.html On Thu, Feb 25, 2016 at 10:54 AM,

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Yin Yang
Which release of hadoop are you using ? Can you share a bit about the logic of your job ? Pastebinning portion of relevant logs would give us more clue. Thanks On Thu, Feb 25, 2016 at 8:54 AM, unk1102 wrote: > Hi I have spark job which I run on yarn and sometimes it

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
Kevin I meant the partitions on disk/hdfs not the inmemory RDD/Dataframe partitions. If I am right mapPartitions or forEachPartitions would identify and operate on the in memory partitions. Deenar On 25 February 2016 at 15:28, Kevin Mellott wrote: > Once you have

Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread unk1102
Hi I have spark job which I run on yarn and sometimes it behaves in weird manner it shows negative no of tasks in few executors and I keep on loosing executors I also see no of executors are more than I requested. My job is highly tuned not getting OOM or any problem. It is just YARN behaves in a

Re: Running executors missing in sparkUI

2016-02-25 Thread Jan Štěrba
I am running spark1.3 on Cloudera hadoop 5.4 -- Jan Sterba https://twitter.com/honzasterba | http://flickr.com/honzasterba | http://500px.com/honzasterba On Thu, Feb 25, 2016 at 4:22 PM, Yin Yang wrote: > Which Spark / hadoop release are you running ? > > Thanks > > On Thu,

Re: Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Bryan Cutler
Using flatmap on a string will treat it as a sequence, which is why you are getting an RDD of char. I think you want to just do a map instead. Like this val timestamps = stream.map(event => event.getCreatedAt.toString) On Feb 25, 2016 8:27 AM, "Dominik Safaric" wrote:

Spark Streaming - processing/transforming DStreams using a custom Receiver

2016-02-25 Thread Dominik Safaric
Recently, I've implemented the following Receiver and custom Spark Streaming InputDStream using Scala: /** * The GitHubUtils object declares an interface consisting of overloaded createStream * functions. The createStream function takes as arguments the ctx : StreamingContext * passed by the

d.filter("id in max(id)")

2016-02-25 Thread Ashok Kumar
Hi, How can I make that work? val d = HiveContext.table("table") select * from table where ID = MAX(ID) from table Thanks

Re: Multiple user operations in spark.

2016-02-25 Thread Sabarish Sasidharan
I don't have a proper answer to this. But to circumvent if you have 2 independent Spark jobs, you could update one when the other is serving reads. But it's still not scalable for incessant updates. Regards Sab On 25-Feb-2016 7:19 pm, "Udbhav Agarwal" wrote: > Hi, >

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I'm going to try to do it with Pregel.. it there are others ideas... great!. What do you call P time? I think that it's O(Number Vertex * N) 2016-02-25 16:17 GMT+01:00 Darren Govoni : > This might be hard to do. One generalization of this problem is >

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
Once you have loaded information into a DataFrame, you can use the *mapPartitionsi or forEachPartition *operations to both identify the partitions and operate against them. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame On Thu, Feb 25, 2016 at 9:24 AM,

Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
Hi How does one check for the presence of a partition in a Spark SQL partitioned table (save using dataframe.write.partitionedBy("partCol") not hive compatible tables), other than physically checking the directory on HDFS or doing a count(*) with the partition cols in the where clause ?

Re: Reindexing in graphx

2016-02-25 Thread Karl Higley
For real time graph mutations and queries, you might consider a graph database like Neo4j or TitanDB. Titan can be backed by HBase, which you're already using, so that's probably worth a look. On Thu, Feb 25, 2016, 9:55 AM Udbhav Agarwal wrote: > That’s a good thing

Re: Running executors missing in sparkUI

2016-02-25 Thread Yin Yang
Which Spark / hadoop release are you running ? Thanks On Thu, Feb 25, 2016 at 4:28 AM, Jan Štěrba wrote: > Hello, > > I have quite a weird behaviour that I can't quite wrap my head around. > I am running Spark on a Hadoop YARN cluster. I have Spark configured > in such a

RE: How could I do this algorithm in Spark?

2016-02-25 Thread Darren Govoni
This might be hard to do. One generalization of this problem is  https://en.m.wikipedia.org/wiki/Longest_path_problem Given a node (e.g. A), find longest path. All interior relations are transitive and can be inferred. But finding a distributed spark way of doing it in P time would be

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
Thank you!, I'm trying to do it with Pregel,, it's being hard because I have never used GraphX and Pregel before. 2016-02-25 14:00 GMT+01:00 Sabarish Sasidharan : > Like Robin said, pls explore Pregel. You could do it without Pregel but it > might be laborious. I have a

RE: Reindexing in graphx

2016-02-25 Thread Udbhav Agarwal
That’s a good thing you pointed out. Let me check that. Thanks. Another thing I was struggling with is while this process of addition of vertices is happening with the graph(name is inputGraph) am not able to access it or perform query over it. Currently when I am querying the graph during the

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
Good to know, thanks everybody! 2016-02-25 15:29 GMT+01:00 JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com>: > Actually that only applies to Spark SQL. I believe that in plain RDD, the > resulting join will have as many partitions as the RDD with the most > partition. > > > >

RE: Number partitions after a join

2016-02-25 Thread JOAQUIN GUANTER GONZALBEZ
Actually that only applies to Spark SQL. I believe that in plain RDD, the resulting join will have as many partitions as the RDD with the most partition. Cheers, Ximo De: Guillermo Ortiz [mailto:konstt2...@gmail.com] Enviado el: jueves, 25 de febrero de 2016 15:19 Para: Takeshi Yamamuro

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
thank you, I didn't see that option. 2016-02-25 14:51 GMT+01:00 Takeshi Yamamuro : > Hi, > > The number depends on `spark.sql.shuffle.partitions`. > See: > http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options > > On Thu, Feb 25, 2016

Re: Number partitions after a join

2016-02-25 Thread Takeshi Yamamuro
Hi, The number depends on `spark.sql.shuffle.partitions`. See: http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options On Thu, Feb 25, 2016 at 7:42 PM, Guillermo Ortiz wrote: > When you do a join in Spark, how many partitions are as

Multiple user operations in spark.

2016-02-25 Thread Udbhav Agarwal
Hi, I am using graphx. I am adding a batch of vertices to a graph with around 100,000 vertices and few edges. Adding around 400 vertices is taking 7 seconds with one machine of 8 core and 8g ram. My trouble is when this process of addition is happening with the graph(name is inputGraph) am not

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Sabarish Sasidharan
Like Robin said, pls explore Pregel. You could do it without Pregel but it might be laborious. I have a simple outline below. You will need more iterations if the number of levels is higher. a-b b-c c-d b-e e-f f-c flatmaptopair a -> (a-b) b -> (a-b) b -> (b-c) c -> (b-c) c -> (c-d) d -> (c-d)

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I'm taking a look to Pregel. It seems it's a good way to do it. The only negative thing that I see it's not a really complex graph with a lot of edges between the vertex .. They are more like a lot of isolated small graphs 2016-02-25 12:32 GMT+01:00 Robin East : > The

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Thanks very much, Nick and Sabarish. That helps me a lot. Regards, *Hiro* On Thu, Feb 25, 2016 at 8:52 PM, Nick Pentreath wrote: > Yes, ALS requires the aggregated version (A). You can use decimal or whole > numbers for the rating, depending on your application, as

Running executors missing in sparkUI

2016-02-25 Thread Jan Štěrba
Hello, I have quite a weird behaviour that I can't quite wrap my head around. I am running Spark on a Hadoop YARN cluster. I have Spark configured in such a way that it utilizes all free vcores in the cluster (setting max vcores per executor and number of executors to use all vcores in cluster).

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Nick Pentreath
Yes, ALS requires the aggregated version (A). You can use decimal or whole numbers for the rating, depending on your application, as for implicit data they are not "ratings" but rather "weights". A common approach is to apply different weightings to different user events (such as 1.0 for a page

Re: which is a more appropriate form of ratings ?

2016-02-25 Thread Sabarish Sasidharan
I believe the ALS algo expects the ratings to be aggregated (A). I don't see why you have to use decimals for rating. Regards Sab On Thu, Feb 25, 2016 at 4:50 PM, Hiroyuki Yamada wrote: > Hello. > > I just started working on CF in MLlib. > I am using trainImplicit because I

select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Ashok Kumar
Hi, What is the equivalent of this in Spark please select * from mytable where column1 in (select max(column1) from mytable) Thanks

which is a more appropriate form of ratings ?

2016-02-25 Thread Hiroyuki Yamada
Hello. I just started working on CF in MLlib. I am using trainImplicit because I only have implicit ratings like page views. I am wondering which is a more appropriate form of ratings. Let's assume that view count is regarded as a rating and user 1 sees page 1 3 times and sees page 2 twice and

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
Oh, the letters were just an example, it could be: a , t b, o t, k k, c So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o I don't know if you were thinking about sortBy because the another example where letter were consecutive. 2016-02-25 9:42 GMT+01:00 Guillermo Ortiz

Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
When you do a join in Spark, how many partitions are as result? is it a default number if you don't specify the number of partitions?

Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-25 Thread Hiroyuki Yamada
Hello Sean, Thank you very much for the quick response. That helps me a lot to understand it better ! Best regards, Hiro On Thu, Feb 25, 2016 at 6:59 PM, Sean Owen wrote: > This isn't specific to Spark; it's from the original paper. > > alpha doesn't do a whole lot, and it

Re: What is the point of alpha value in Collaborative Filtering in MLlib ?

2016-02-25 Thread Sean Owen
This isn't specific to Spark; it's from the original paper. alpha doesn't do a whole lot, and it is a global hyperparam. It controls the relative weight of observed versus unobserved user-product interactions in the factorization. Higher alpha means it's much more important to faithfully

Re: How to Exploding a Map[String,Int] column in a DataFrame (Scala)

2016-02-25 Thread Anthony Brew
Thanks guys both those answers are really helpful. Really appreciate this. On 25 Feb 2016 12:05 a.m., "Michał Zieliński" wrote: > Hi Anthony, > > Hopefully that's what you wanted :) > > case class Example(id:String, myMap: Map[String,Int]) >> val myDF =

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-25 Thread ankursaxena86
Here's the log messages I'm getting upon calling of this command : /usr/local/spark/bin/spark-submit --verbose --driver-memory 14G --num-executors 1 --driver-cores 1 --executor-memory 10G --class org.apache.spark.mllib.feature.myclass --driver-java-options -Djava.library.path=/usr/local/lib

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-25 Thread ankursaxena86
Following are the configuration changes I made to spark-env.sh: export SPARK_EXECUTOR_INSTANCES=1 export SPARK_EXECUTOR_CORES=1 export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=1 And yes, I mean to execute only one task per node at a time. -- View this message in context:

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
I don't see that sorting the data helps. The answer has to be all the associations. In this case the answer has to be: a , b --> it was a error in the question, sorry. b , d c , d x , y y , y I feel like all the data which is associate should be in the same executor. On this case if I order the