kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread main org.apache.spark.SparkException:

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread ankurcha
Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c

Backward compatibility with org.apache.spark.sql.api.java.Row class

2015-05-13 Thread Emerson Castañeda
Hello everyone I'm adopting the latest version of Apache Spark on my project, moving from *1.2.x* to *1.3.x*, and the only significative incompatibility for now is related to the *Row *class. Any idea about what did happen to* org.apache.spark.sql.api.java.Row* class in Apache Spark 1.3 ?

Increase maximum amount of columns for covariance matrix for principal components

2015-05-13 Thread Sebastian Alfers
Hello, in order to compute a huge dataset, the amount of columns to calculate the covariance matrix is limited: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L129 What is the reason behind this limitation and can it

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
No, creating DF using createDataFrame won’t work: val peopleDF = sqlContext.createDataFrame(people) the code can be compiled but raised the same error as toDF at the line above. On Wed, May 13, 2015 at 6:22 PM Sebastian Alfers

Fwd: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
Are you sure that you are submitting it correctly? Can you post the entire command you are using to run the .jar file via spark-submit? Ok, here it is: /opt/spark-1.3.1-bin-hadoop2.6/bin/spark-submit target/scala-2.11/webcat_2.11-1.0.jar However, on the server somehow I have to specify main

Re: value toDF is not a member of RDD object

2015-05-13 Thread SLiZn Liu
Additionally, after I successfully packaged the code, and submitted via spark-submit webcat_2.11-1.0.jar, the following error was thrown at the line where toDF() been called: Exception in thread main java.lang.NoSuchMethodError:

Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
From: http://spark.apache.org/docs/latest/streaming-kafka-integration.html I'm trying to use the direct approach to read messages form Kafka. Kafka is running as a cluster and configured with Zookeeper. On the above page it mentions: In the Kafka parameters, you must specify either

[Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread Ishwardeep Singh
Hi , I am using Spark SQL 1.3.1. I have created a dataFrame using jdbc data source and am using saveAsTable() method but got the following 2 exceptions: java.lang.RuntimeException: Unsupported datatype DecimalType() at scala.sys.package$.error(package.scala:27) at

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, You could retroactively union an existing DStream with one from a newly created file. Then when another file is detected, you would need to re-union the stream an create another DStream. It seems like the implementation of FileInputDStream only

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would suggest creating one DStream per directory and then using StreamingContext#union(...) to get a union DStream. - -- Ankur On 13/05/2015 00:53, hotdog wrote: I want to use use fileStream in spark streaming to monitor multi hdfs directories,

kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I have a simple application which fails with the following exception only when the application is restarted (i.e. the checkpointDir has entires from a previous execution): Exception in thread main org.apache.spark.SparkException:

Re: How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2015-05-13 Thread thanhtien522
Earthson wrote Finally, I've found two ways: 1. search the output with something like Submitted application application_1416319392519_0115 2. use specific AppName. We could query the ApplicationID(yarn) Hi Eathson, Can you explain more about case 2? How can we query the ApplicationID by the

com.esotericsoftware.kryo.KryoException: java.io.IOException: Stream is corrupted

2015-05-13 Thread Yifan LI
Hi, I was running our graphx application(worked finely on Spark 1.2.0) but failed on Spark 1.3.1 with below exception. Anyone has idea on this issue? I guess it was caused by using LZ4 codec? Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 54

Re: Building Spark

2015-05-13 Thread Stephen Boesch
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven requirements are much lower , around 1.7GB. So try using maven . For reference I have the following settings and both do compile. sbt would not work with lower values. $echo $SBT_OPTS -Xmx3012m -XX:MaxPermSize=512m

applications are still in progress?

2015-05-13 Thread Yifan LI
Hi, I have some applications finished(but actually failed before), that in WebUI show Application myApp is still in progress. and, in the eventlog folder, there are several log files like this: app-20150512***.inprogress So, I am wondering what the “inprogress” means… Thanks! :) Best,

Removing FINISHED applications and shuffle data

2015-05-13 Thread sayantini
Hi, Please help me with below two issues: *Environment:* I am running my spark cluster in stand alone mode. I am initializing the spark context from inside my tomcat server. I am setting below properties in environment.sh in $SPARK_HOME/conf directory

Spark and Flink

2015-05-13 Thread Pa Rö
hi, i use spark and flink in the same maven project, now i get a exception on working with spark, flink work well the problem are transitiv dependencies. maybe somebody know a solution, or versions, which work together. best regards paul ps: a cloudera maven repo flink would be desirable my

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Many thanks Cody and contributors for the help. On Wed, May 13, 2015 at 3:44 PM, Cody Koeninger c...@koeninger.org wrote: Either one will work, there is no semantic difference. The reason I designed the direct api to accept both of those keys is because they were used to define lists of

JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Hi, I want to get *JavaPairRDDString, String *from the tuple part of *JavaPairRDDString, Tuple2String, String .* As an example: ( http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551)) in my *JavaPairRDDString, Tuple2String, String *and I want to get *( (46551),

Building Spark

2015-05-13 Thread Heisenberg Bb
I tried to build Spark in my local machine Ubuntu 14.04 ( 4 GB Ram), my system is getting hanged (freezed). When I monitered system processes, the build process is found to consume 85% of my memory. Why does it need lot of resources. Is there any efficient method to build Spark. Thanks Akhil

Re: Spark and Flink

2015-05-13 Thread Ted Yu
You can run the following command: mvn dependency:tree And see what jetty versions are brought in. Cheers On May 13, 2015, at 6:07 AM, Pa Rö paul.roewer1...@googlemail.com wrote: hi, i use spark and flink in the same maven project, now i get a exception on working with spark, flink

Kafka + Direct + Zookeeper

2015-05-13 Thread James King
I'm trying Kafka Direct approach (for consume) but when I use only this config: kafkaParams.put(group.id, groupdid); kafkaParams.put(zookeeper.connect, zookeeperHostAndPort + /cb_kafka); I get this Exception in thread main org.apache.spark.SparkException: Must specify metadata.broker.list or

Re: Building Spark

2015-05-13 Thread Emre Sevinc
My 2 cents: If you have Java 8, you don't need any extra settings for Maven. -- Emre Sevinç On Wed, May 13, 2015 at 3:02 PM, Stephen Boesch java...@gmail.com wrote: Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven requirements are much lower , around 1.7GB. So try using

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread Cody Koeninger
Either one will work, there is no semantic difference. The reason I designed the direct api to accept both of those keys is because they were used to define lists of brokers in pre-existing Kafka project apis. I don't know why the Kafka project chose to use 2 different configuration keys. On

how to read lz4 compressed data using fileStream of spark streaming?

2015-05-13 Thread hotdog
in spark streaming, I want to use fileStream to monitor a directory. But the files in that directory are compressed using lz4. So the new lz4 files are not detected by the following code. How to detect these new files? val list_join_action_stream = ssc.fileStream[LongWritable, Text,

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
As far as I can tell, Dibyendu's cons boil down to: 1. Spark checkpoints can't be recovered if you upgrade code 2. Some Spark transformations involve a shuffle, which can repartition data It's not accurate to imply that either one of those things are inherently cons of the direct stream api.

Worker Spark Port

2015-05-13 Thread James King
I understated that this port value is randomly selected. Is there a way to enforce which spark port a Worker should use?

Spark Sorted DataFrame Repartitioning

2015-05-13 Thread Night Wolf
Hi guys, If I load a dataframe via a sql context that has a SORT BY in the query and I want to repartition the data frame will it keep the sort order in each partition? I want to repartition because I'm going to run a Map that generates lots of data internally so to avoid Out Of Memory errors I

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread Cody Koeninger
In my mind, this isn't really a producer vs consumer distinction, this is a broker vs zookeeper distinction. The producer apis talk to brokers. The low level consumer api (what direct stream uses) also talks to brokers. The high level consumer api talks to zookeeper, at least initially. TLDR;

Re: value toDF is not a member of RDD object

2015-05-13 Thread Todd Nist
I believe what Dean Wampler was suggesting is to use the sqlContext not the sparkContext (sc), which is where the createDataFrame function resides: https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.SQLContext HTH. -Todd On Wed, May 13, 2015 at 6:00 AM, SLiZn Liu

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Many thanks Cody! On Wed, May 13, 2015 at 4:22 PM, Cody Koeninger c...@koeninger.org wrote: In my mind, this isn't really a producer vs consumer distinction, this is a broker vs zookeeper distinction. The producer apis talk to brokers. The low level consumer api (what direct stream uses)

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-13 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing shows setting up your stream and calling .checkpoint(checkpointDir) inside the functionToCreateContext. It looks to me like you're setting up your stream and calling checkpoint outside, after getOrCreate. I'm not

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread Ishwardeep Singh
Hi, I am using spark-shell and the steps using which I can reproduce the issue are as follows: scala val dateDimDF= sqlContext.load(jdbc,Map(url-jdbc:teradata://192.168.145.58/DBS_PORT=1025,DATABASE=BENCHQADS,LOB_SUPPORT=OFF,USER= BENCHQADS,PASSWORD=abc,dbtable - date_dim)) scala

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Dibyendu Bhattacharya
Thanks Cody for your email. I think my concern was not to get the ordering of message within a partition , which as you said is possible if one knows how Spark works. The issue is how Spark schedule jobs on every batch which is not on the same order they generated. So if that is not guaranteed it

Re: Kafka Direct Approach + Zookeeper

2015-05-13 Thread James King
Looking at Consumer Configs in http://kafka.apache.org/documentation.html#consumerconfigs The properties *metadata.broker.list* or *bootstrap.servers *are not mentioned. Should I need these for consume side? On Wed, May 13, 2015 at 3:52 PM, James King jakwebin...@gmail.com wrote: Many thanks

Re: Reading Real Time Data only from Kafka

2015-05-13 Thread Cody Koeninger
You linked to a google mail tab, not a public archive, so I don't know exactly which conversation you're referring to. As far as I know, streaming only runs a single job at a time in the order they were defined, unless you turn on an experimental option for more parallelism (TD or someone more

Re: Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Ted Yu
Is python installed on the machine where you ran ./spark-ec2 ? Cheers On Wed, May 13, 2015 at 1:33 PM, Su She suhsheka...@gmail.com wrote: I'm trying to set up my own cluster and am having trouble running this script: ./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context seems no longer valid, which crashes subsequent jobs. My spark version is 1.1.1. I will do more investigation into this issue, perhaps after upgrading to

Spark performance in cluster mode using yarn

2015-05-13 Thread sachin Singh
Hi Friends, please someone can give the idea, Ideally what should be time(complete job execution) for spark job, I have data in a hive table, amount of data would be 1GB , 2 lacs rows for whole month, I want to do monthly aggregation, using SQL queries,groupby I have only one node,1

PostgreSQL JDBC Classpath Issue

2015-05-13 Thread George Adams
Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my classpath. I’ve outlined the issue on Stack Overflow (http://stackoverflow.com/questions/30221677/spark-sql-postgresql-data-source-issues). I’m not sure how to fix this since I built the uber jar using sbt-assembly and the

Problem with current spark

2015-05-13 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to run an application that uses a Hive context to perform some queries over JSON files. The code of the application is here: https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1 I can run it on Spark 1.3.1 after rebuilding it

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
That is not supposed to happen :/ That is probably a bug. If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo-inc.com wrote: Actually I tried that before asking. However, it killed the spark context. :-) Du

Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Su She
I'm trying to set up my own cluster and am having trouble running this script: ./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2 --zone=us-west-2c --num-slaves=1 launch my-spark-cluster based off: https://spark.apache.org/docs/latest/ec2-scripts.html It just tries to open the

Re: Trouble trying to run ./spark-ec2 script

2015-05-13 Thread Su She
Hmm, just tried to run it again, but opened the script with python, the cmd line seemed to pop up really quick and exited. On Wed, May 13, 2015 at 2:06 PM, Su She suhsheka...@gmail.com wrote: Hi Ted, Yes I do have Python 3.5 installed. I just ran py from the ec2 directory and it started up the

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
You might get stage information through SparkListener. But I am not sure whether you can use that information to easily kill stages. Though i highly recommend using Spark 1.3.1 (or even Spark master). Things move really fast between releases. 1.1.1 feels really old to me :P TD On Wed, May 13,

force the kafka consumer process to different machines

2015-05-13 Thread hotdog
I 'm using streaming integrated with streaming-kafka. My kafka topic has 80 partitions, while my machines have 40 cores. I found that when the job is running, the kafka consumer processes are only deploy to 2 machines, the bandwidth of the 2 machines will be very very high. I wonder is there

Re: force the kafka consumer process to different machines

2015-05-13 Thread Akhil Das
With this lowlevel Kafka API https://github.com/dibbhatt/kafka-spark-consumer/, you can actually specify how many receivers that you want to spawn and most of the time it spawns evenly, usually you can put a sleep just after creating the context for the executors to connect to the driver and then

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread ayan guha
Your stack trace says it can't convert date to integer. You sure about column positions? On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi , I am using Spark SQL 1.3.1. I have created a dataFrame using jdbc data source and am using saveAsTable() method but got

Re: Worker Spark Port

2015-05-13 Thread Cody Koeninger
I believe most ports are configurable at this point, look at http://spark.apache.org/docs/latest/configuration.html search for .port On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com wrote: I understated that this port value is randomly selected. Is there a way to enforce

Re: force the kafka consumer process to different machines

2015-05-13 Thread Dibyendu Bhattacharya
or you can use this Receiver as well : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer Where you can specify how many Receivers you need for your topic and it will divides the partitions among the Receiver and return the joined stream for you . Say you specified 20 receivers , in

Re: Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Dean Wampler
Is the $foo or mydf(foo) or both checked at compile time to verify that the column reference is valid? Thx. Dean On Wednesday, May 13, 2015, Michael Armbrust mich...@databricks.com wrote: I would not say that either method is preferred (neither is old/deprecated). One advantage to the second

Spark recovery takes long

2015-05-13 Thread NB
Hello Spark gurus, We have a spark streaming application that is consuming from a Flume stream and has some window operations. The input batch sizes are 1 minute and intermediate Window operations have window sizes of 1 minute, 1 hour and 6 hours. I enabled checkpointing and Write ahead log so

spark-streaming whit flume error

2015-05-13 Thread ??
Hi all, I want use spark-streaming with flume ,now i am in truble, I don't know how to configure the flume ,I use I configure flume like this : a1.sources = r1 a1.channels = c1 c2 a1.sources.r1.type = avro a1.sources.r1.channels = c1 c2 a1.sources.r1.bind = 0.0.0.0 a1.sources.r1.port =

Re: JavaPairRDD

2015-05-13 Thread Tristan Blakers
You could use a map() operation, but the easiest way is probably to just call values() method on the JavaPairRDDA,B to get a JavaRDDB. See this link: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html Tristan On 13 May 2015 at 23:12, Yasemin Kaya

Re: --jars works in yarn-client but not yarn-cluster mode, why?

2015-05-13 Thread Fengyun RAO
I look into the Environment in both modes. yarn-client: spark.jars local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar,file:/home/xxx/my-app.jar yarn-cluster: spark.yarn.secondary.jars local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar I

--jars works in yarn-client but not yarn-cluster mode, why?

2015-05-13 Thread Fengyun RAO
Hadoop version: CDH 5.4. We need to connect to HBase, thus need extra /opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar dependency. It works in yarn-client mode: spark-submit --class xxx.xxx.MyApp --master yarn-client --num-executors 10 --executor-memory 10g --jars

Re: JavaPairRDD

2015-05-13 Thread Yasemin Kaya
Thank you Tristan. It is totally what I am looking for :) 2015-05-14 5:05 GMT+03:00 Tristan Blakers tris...@blackfrog.org: You could use a map() operation, but the easiest way is probably to just call values() method on the JavaPairRDDA,B to get a JavaRDDB. See this link:

spark sql hive-shims

2015-05-13 Thread Lior Chaga
Hi, Using spark sql with HiveContext. Spark version is 1.3.1 When running local spark everything works fine. When running on spark cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims. This class belongs to hive-shims-0.23, and is a runtime dependency for spark-hive:

How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-13 Thread MEETHU MATHEW
Hi all,  Quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads.   How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark? I found some examples in scala and java, but

Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Diana Carroll
I'm just getting started with Spark SQL and DataFrames in 1.3.0. I notice that the Spark API shows a different syntax for referencing columns in a dataframe than the Spark SQL Programming Guide. For instance, the API docs for the select method show this: df.select($colA, $colB) Whereas the

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M) corpora. Has anybody had success running it at this scale? -Shilad -- Shilad W. Sen Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Re: Kryo serialization of classes in additional jars

2015-05-13 Thread Akshat Aranya
I cherry-picked this commit into my local 1.2 branch. It fixed the problem with setting spark.serializer, but I get a similar problem with spark.closure.serializer org.apache.spark.SparkException: Failed to register classes with Kryo at

Re: Backward compatibility with org.apache.spark.sql.api.java.Row class

2015-05-13 Thread Michael Armbrust
Sorry for missing that in the upgrade guide. As part of unifying the Java and Scala interfaces we got rid of the java specific row. You are correct in assuming that you want to use row in org.apache.spark.sql from both Scala and Java now. On Wed, May 13, 2015 at 2:48 AM, Emerson Castañeda

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Hi TD, Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming app is standing a much better chance to complete processing each batch within the batch interval. Du On Tuesday, May 12, 2015 10:31 PM, Tathagata Das t...@databricks.com wrote: From the code it seems

data schema and serialization format suggestions

2015-05-13 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I want to get a general idea about best practices around data format serialization. Currently, I am using avro as the data serialization format but the emitted types aren't very scala friendly. So, I was wondering how others deal with this

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
Sander, I eventually solved this problem via the --[no-]switch_user flag, which is set to true by default. I set this to false, which would have the user that owns the process run the job, otherwise it was my username (scarman) running the job, which would fail because obviously my username

Re: Spark on Mesos

2015-05-13 Thread Tim Chen
Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos scheduler will pick up your local user and tries to impersonate as the same user and chown the directory before executing any task. If you try to run Spark driver as root it should resolve the problem. No switch user

Re: Spark on Mesos

2015-05-13 Thread Stephen Carman
Yup, exactly as Tim mentioned on it too. I went back and tried what you just suggested and that was also perfectly fine. Steve On May 13, 2015, at 1:58 PM, Tim Chen t...@mesosphere.iomailto:t...@mesosphere.io wrote: Hi Stephen, You probably didn't run the Spark driver/shell as root, as Mesos

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has anybody had success running it at this scale? Thanks in advance for your guidance! -Shilad -- Shilad W. Sen Associate Professor

Re: Worker Spark Port

2015-05-13 Thread James King
Indeed, many thanks. On Wednesday, 13 May 2015, Cody Koeninger c...@koeninger.org wrote: I believe most ports are configurable at this point, look at http://spark.apache.org/docs/latest/configuration.html search for .port On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Hi TD, Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout? Otherwise it keeps running until completion, producing results not used but consuming resources. Thanks,Du On Wednesday, May 13, 2015 10:33 AM, Du Li l...@yahoo-inc.com.INVALID wrote: Hi TD,

Re: how to set random seed

2015-05-13 Thread ayan guha
Easiest way is to broadcast it. On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote: In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-13 Thread luohui20001
Hi Hao: I tried broadcastjoin with following steps, and found that my query is still running slow ,not very sure if I'm doing right with broadcastjoin:1.add spark.sql.autoBroadcastJoinThreshold 104857600(100MB) in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start