Re: How to handle under-performing nodes in the cluster

2015-03-23 Thread Akhil Das
It seems that node is not getting allocated with enough tasks, try increasing your level of parallelism or do a manual repartition so that everyone gets even tasks to operate on. Thanks Best Regards On Fri, Mar 20, 2015 at 8:05 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi all, I have 6

Re: Spark UI tunneling

2015-03-23 Thread Akhil Das
Did you try ssh -L 4040:127.0.0.1:4040 user@host Thanks Best Regards On Mon, Mar 23, 2015 at 1:12 PM, sergunok ser...@gmail.com wrote: Is it a way to tunnel Spark UI? I tried to tunnel client-node:4040 but my browser was redirected from localhost to some cluster locally visible domain

Re: can distinct transform applied on DStream?

2015-03-22 Thread Akhil Das
What do you mean not distinct? It does works for me: [image: Inline image 1] Code: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.{SparkContext, SparkConf} val ssc = new StreamingContext(sc, Seconds(1)) val data =

Re: Launching Spark Cluster Application through IDE

2015-03-20 Thread Akhil Das
From IntelliJ, you can use the remote debugging feature. http://stackoverflow.com/questions/19128264/how-to-remote-debug-in-intellij-12-1-4 For remote debugging, you need to pass the following: -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n jvm options and configure your

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-20 Thread Akhil Das
clues why it happens only after v1.2.0 and above? Nothing else changes. Thanks, Eason On Tue, Mar 17, 2015 at 8:39 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its clearly saying: java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream

Re: Measuer Bytes READ and Peak Memory Usage for Query

2015-03-20 Thread Akhil Das
You could do a cache and see the memory usage under Storage tab in the driver UI (runs on port 4040) Thanks Best Regards On Fri, Mar 20, 2015 at 12:02 PM, anu anamika.guo...@gmail.com wrote: Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please

Re: Load balancing

2015-03-20 Thread Akhil Das
1. If you are consuming data from Kafka or any other receiver based sources, then you can start 1-2 receivers per worker (assuming you'll have min 4 core per worker) 2. If you are having single receiver or is a fileStream then what you can do to distribute the data across machines is to do a

Re: Spark 1.2. loses often all executors

2015-03-20 Thread Akhil Das
Isn't that a feature? Other than running a buggy pipeline, just kills all executors? You can always handle exceptions with proper try catch in your code though. Thanks Best Regards On Fri, Mar 20, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote: Hi, I recently changed from Spark 1.1. to Spark

Re: Database operations on executor nodes

2015-03-19 Thread Akhil Das
Totally depends on your database, if that's a NoSQL database like MongoDB/HBase etc then you can use the native .saveAsNewAPIHAdoopFile or .saveAsHadoopDataSet etc. For a SQL databases, i think people usually puts the overhead on driver like you did. Thanks Best Regards On Wed, Mar 18, 2015 at

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
Can you see where exactly it is spending time? Like you said it goes to Stage 2, then you will be able to see how much time it spend on Stage 1. See if its a GC time, then try increasing the level of parallelism or repartition it like sc.getDefaultParallelism*3. Thanks Best Regards On Thu, Mar

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
for the model? I have a Spark Master and 2 Workers running on CDH 5.3...what would the default spark-shell level of parallelism be...I thought it would be 3? Thank you for the help! -Su On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you see where

Re: Null pointer exception reading Parquet

2015-03-19 Thread Akhil Das
How are you running the application? Can you try running the same inside spark-shell? Thanks Best Regards On Wed, Mar 18, 2015 at 10:51 PM, sprookie cug12...@gmail.com wrote: Hi All, I am using Saprk version 1.2 running locally. When I try to read a paquet file I get below exception, what

Re: updateStateByKey performance API

2015-03-18 Thread Akhil Das
You can always throw more machines at this and see if the performance is increasing. Since you haven't mentioned anything regarding your # cores etc. Thanks Best Regards On Wed, Mar 18, 2015 at 11:42 AM, nvrs nvior...@gmail.com wrote: Hi all, We are having a few issues with the performance

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
You can simply turn it on using: ./sbin/start-history-server.sh ​Read more here http://spark.apache.org/docs/1.3.0/monitoring.html.​ Thanks Best Regards On Wed, Mar 18, 2015 at 4:00 PM, patcharee patcharee.thong...@uni.no wrote: Hi, I am using spark 1.3. I would like to use Spark Job

Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS? Thanks Best Regards On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com wrote: I'm trying to figure out how I might be able to use Spark with a SOCKS proxy. That is, my dream is to be able to write code in my IDE then run it

Re: Using Spark with a SOCKS proxy

2015-03-18 Thread Akhil Das
Did you try ssh tunneling instead of SOCKS? Thanks Best Regards On Wed, Mar 18, 2015 at 5:45 AM, Kelly, Jonathan jonat...@amazon.com wrote: I'm trying to figure out how I might be able to use Spark with a SOCKS proxy. That is, my dream is to be able to write code in my IDE then run it

Re: [spark-streaming] can shuffle write to disk be disabled?

2015-03-18 Thread Akhil Das
I think you can disable it with spark.shuffle.spill=false Thanks Best Regards On Wed, Mar 18, 2015 at 3:39 PM, Darren Hoo darren@gmail.com wrote: Thanks, Shao On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai saisai.s...@intel.com wrote: Yeah, as I said your job processing time is much

Re: Spark Job History Server

2015-03-18 Thread Akhil Das
) at java.lang.Class.forName(Class.java:191) at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:183) at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) Patcharee On 18. mars 2015 11:35, Akhil Das wrote: You can simply turn it on using

Re: What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-03-17 Thread Akhil Das
Create SparkContext set master as yarn-cluster then run it as a standalone program? Thanks Best Regards On Tue, Mar 17, 2015 at 1:27 AM, rrussell25 rrussel...@gmail.com wrote: Hi, were you ever able to determine a satisfactory approach for this problem? I have a similar situation and would

Re: Spark @ EC2: Futures timed out Ask timed out

2015-03-17 Thread Akhil Das
Did you launch the cluster using spark-ec2 script? Just make sure all ports are open for master, slave instances security group. From the error, it seems its not able to connect to the driver program (port 58360) Thanks Best Regards On Tue, Mar 17, 2015 at 3:26 AM, Otis Gospodnetic

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-17 Thread Akhil Das
both versions on the project and the cluster. Any clues? Even the sample code from Spark website failed to work. Thanks, Eason On Sun, Mar 15, 2015 at 11:56 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you change both the versions? The one in your build file of your project

Re: Any IRC channel on Spark?

2015-03-17 Thread Akhil Das
There's one on Freenode, You can join #Apache-Spark There's like 60 people idling. :) Thanks Best Regards On Mon, Mar 16, 2015 at 10:46 PM, Feng Lin lfliu.x...@gmail.com wrote: Hi, everyone, I'm wondering whether there is a possibility to setup an official IRC channel on freenode. I

Re: Spark Streaming with compressed xml files

2015-03-16 Thread Akhil Das
One approach would be, If you are using fileStream you can access the individual filenames from the partitions and with that filename you can apply your uncompression logic/parsing logic and get it done. Like: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i];

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-16 Thread Akhil Das
Did you change both the versions? The one in your build file of your project and the spark version of your cluster? Thanks Best Regards On Sat, Mar 14, 2015 at 6:47 AM, EH eas...@gmail.com wrote: Hi all, I've been using Spark 1.1.0 for a while, and now would like to upgrade to Spark 1.1.1

Re: org.apache.spark.SparkException Error sending message

2015-03-16 Thread Akhil Das
Not sure if this will help, but can you try setting the following: set(spark.core.connection.ack.wait.timeout,6000) Thanks Best Regards On Sat, Mar 14, 2015 at 4:08 AM, Chen Song chen.song...@gmail.com wrote: When I ran Spark SQL query (a simple group by query) via hive support, I have seen

Re: how to print RDD by key into file with grouByKey

2015-03-16 Thread Akhil Das
If you want more partitions then you have specify it as: Rdd.groupByKey(*10*).mapValues... ​I think if you don't specify anything, the # partitions will be the # cores that you have for processing.​ Thanks Best Regards On Sat, Mar 14, 2015 at 12:28 AM, Adrian Mocanu amoc...@verticalscope.com

Re: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-16 Thread Akhil Das
If you use fileStream, there's an option to filter out files. In your case you can easily create a filter to remove _temporary files. In that case, you will have to move your codes inside foreachRDD of the dstream since the application will become a streaming app. Thanks Best Regards On Sat, Mar

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
How are you setting it? and how are you submitting the job? Thanks Best Regards On Mon, Mar 16, 2015 at 12:52 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have set spark.executor.memory to 2048m, and in the UI Environment page, I can see this value has been set correctly. But in the

Re: k-means hang without error/warning

2015-03-16 Thread Akhil Das
How many threads are you allocating while creating the sparkContext? like local[4] will allocate 4 threads. You can try increasing it to a higher number also try setting level of parallelism to a higher number. Thanks Best Regards On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen davidshe...@gmail.com

Re: Question about Spark Streaming Receiver Failure

2015-03-16 Thread Akhil Das
You need to figure out why the receivers failed in the first place. Look in your worker logs and see what really happened. When you run a streaming job continuously for longer period mostly there'll be a lot of logs (you can enable log rotation etc.) and if you are doing a groupBy, join, etc type

Re: unable to access spark @ spark://debian:7077

2015-03-16 Thread Akhil Das
Try setting SPARK_MASTER_IP and you need to use the Spark URI (spark://yourlinuxhost:7077) as displayed in the top left corner of Spark UI (running on port 8080). Also when you are connecting from your mac, make sure your network/firewall isn't blocking any port between the two machines. Thanks

Re: start-slave.sh failed with ssh port other than 22

2015-03-16 Thread Akhil Das
Open sbin/slaves.sh and sbin/spark-daemon.sh and then look for ssh command, pass the port argument to that command in your case *-p 58518* and save those files, do a start-all.sh :) Thanks Best Regards On Mon, Mar 16, 2015 at 1:37 PM, ZhuGe t...@outlook.com wrote: Hi all: I am new to spark

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
, 2015 at 1:52 PM, Xi Shen davidshe...@gmail.com wrote: I set it in code, not by configuration. I submit my jar file to local. I am working in my developer environment. On Mon, 16 Mar 2015 18:28 Akhil Das ak...@sigmoidanalytics.com wrote: How are you setting it? and how are you submitting the job

Re: Processing of text file in large gzip archive

2015-03-16 Thread Akhil Das
1. I don't think textFile is capable of unpacking a .gz file. You need to use hadoopFile or newAPIHadoop file for this. 2. Instead of map, do a mapPartitions 3. You need to open the driver UI and see what's really taking time. If that is running on a remote machine and you are not able to access

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
: Hi Akhil, Yes, you are right. If I ran the program from IDE as a normal java program, the executor's memory is increased...but not to 2048m, it is set to 6.7GB...Looks like there's some formula to calculate this value. Thanks, David On Mon, Mar 16, 2015 at 7:36 PM Akhil Das ak

Re: How to set Spark executor memory?

2015-03-16 Thread Akhil Das
of executor memory, it should be 2g * 0.6 = 1.2g. My machine has 56GB memory, and 0.6 of that should be 33.6G...I hate math xD On Mon, Mar 16, 2015 at 7:59 PM Akhil Das ak...@sigmoidanalytics.com wrote: How much memory are you having on your machine? I think default value is 0.6

Re: spark sql performance

2015-03-13 Thread Akhil Das
That totally depends on your data size and your cluster setup. Thanks Best Regards On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal udbhav.agar...@syncoms.com wrote: Hi, What is query time for join query on hbase with spark sql. Say tables in hbase have 0.5 million records each. I am

Re: set up spark cluster with heterogeneous hardware

2015-03-13 Thread Akhil Das
You could also add SPARK_MASTER_IP to bind to a specific host/IP so that it won't get confused with those hosts in your /etc/hosts file. Thanks Best Regards On Fri, Mar 13, 2015 at 12:00 PM, Du Li l...@yahoo-inc.com.invalid wrote: Hi Spark community, I searched for a way to configure a

Re: spark sql performance

2015-03-13 Thread Akhil Das
,* *Udbhav Agarwal* *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* 13 March, 2015 12:01 PM *To:* Udbhav Agarwal *Cc:* user@spark.apache.org *Subject:* Re: spark sql performance That totally depends on your data size and your cluster setup. Thanks Best Regards

Re: spark sql performance

2015-03-13 Thread Akhil Das
,* *Udbhav Agarwal* *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* 13 March, 2015 12:27 PM *To:* Udbhav Agarwal *Cc:* user@spark.apache.org *Subject:* Re: spark sql performance So you can cache upto 8GB of data in memory (hope your data size of one table is 2GB

Re: spark sql performance

2015-03-13 Thread Akhil Das
was running the query on one machine with 3gm ram and the join query was taking around 6 seconds. *Thanks,* *Udbhav Agarwal* *From:* Udbhav Agarwal *Sent:* 13 March, 2015 12:45 PM *To:* 'Akhil Das' *Cc:* user@spark.apache.org *Subject:* RE: spark sql performance Okay Akhil! Thanks

Re: KafkaUtils and specifying a specific partition

2015-03-13 Thread Akhil Das
Here's a simple consumer which does that https://github.com/dibbhatt/kafka-spark-consumer/ Thanks Best Regards On Thu, Mar 12, 2015 at 10:28 PM, ColinMc colin.mcqu...@shiftenergy.com wrote: Hi, How do you use KafkaUtils to specify a specific partition? I'm writing customer Marathon jobs

Re: Error running rdd.first on hadoop

2015-03-13 Thread Akhil Das
Make sure your hadoop is running on port 8020, you can check it in your core-site.xml file and use that URI like: sc.textFile(hdfs://myhost:myport/data) Thanks Best Regards On Fri, Mar 13, 2015 at 5:15 AM, Lau, Kawing (GE Global Research) kawing@ge.com wrote: Hi I was running with

Re: spark sql performance

2015-03-13 Thread Akhil Das
: Lets say am using 4 machines with 3gb ram. My data is customers records with 5 columns each in two tables with 0.5 million records. I want to perform join query on these two tables. *Thanks,* *Udbhav Agarwal* *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* 13 March, 2015 12

Re: Using rdd methods with Dstream

2015-03-13 Thread Akhil Das
Like this? dtream.repartition(1).mapPartitions(it = it.take(5)) Thanks Best Regards On Fri, Mar 13, 2015 at 4:11 PM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I normally use dstream.transform whenever I need to use methods which are available in RDD API but not in streaming

Re: connecting spark application with SAP hana

2015-03-12 Thread Akhil Das
SAP hana can be integrated with hadoop http://saphanatutorial.com/sap-hana-and-hadoop/, so you will be able to read/write to it using newAPIHadoopFile api of spark by passing the correct Configurations etc. Thanks Best Regards On Thu, Mar 12, 2015 at 1:15 PM, Hafiz Mujadid

Re: Read parquet folders recursively

2015-03-12 Thread Akhil Das
)) } val baseStatus = fs.getFileStatus(basePath) if (baseStatus.isDir) recurse(basePath) else Array(baseStatus) } — Best Regards! Yijie Shen On March 12, 2015 at 2:35:49 PM, Akhil Das (ak...@sigmoidanalytics.com) wrote: Hi We have a custom build to read directories

Re: sc.textFile() on windows cannot access UNC path

2015-03-12 Thread Akhil Das
3.Call sc.newAPIHadoopFile(…) with sc.newAPIHadoopFile[LongWritable, Text, UncTextInputFormat](“file: 10.196.119.230/folder1/abc.txt”, classOf[UncTextInputFormat], classOf[LongWritable], classOf[Text], conf) Ningjun *From:* Akhil Das

Re: hbase sql query

2015-03-12 Thread Akhil Das
Like this? val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]).cache() Here's a complete example

Re: bad symbolic reference. A signature in SparkContext.class refers to term conf in value org.apache.hadoop which is not available

2015-03-12 Thread Akhil Das
Spark 1.3.0 is not officially out yet, so i don't think sbt will download the hadoop dependencies for your spark by itself. You could try manually adding the hadoop dependencies yourself (hadoop-core, hadoop-common, hadoop-client) Thanks Best Regards On Wed, Mar 11, 2015 at 9:07 PM, Patcharee

Re: StreamingListener

2015-03-12 Thread Akhil Das
At the end of foreachrdd i believe. Thanks Best Regards On Thu, Mar 12, 2015 at 6:48 AM, Corey Nolet cjno...@gmail.com wrote: Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: Read parquet folders recursively

2015-03-12 Thread Akhil Das
Hi We have a custom build to read directories recursively, Currently we use it with fileStream like: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](/datadumps/, (t: Path) = true, true, *true*) Making the 4th argument true to read recursively. You could give it a try

Re: Temp directory used by spark-submit

2015-03-11 Thread Akhil Das
After setting SPARK_LOCAL_DIRS/SPARK_WORKER_DIR you need to restart your spark instances (stop-all.sh and start-all.sh), You can also try setting java.io.tmpdir while creating the SparkContext. Thanks Best Regards On Wed, Mar 11, 2015 at 1:47 AM, Justin Yip yipjus...@prediction.io wrote:

Re: Pyspark not using all cores

2015-03-11 Thread Akhil Das
Can you paste your complete spark-submit command? Also did you try specifying *--worker-cores*? Thanks Best Regards On Tue, Mar 10, 2015 at 9:00 PM, htailor hemant.tai...@live.co.uk wrote: Hi All, I need some help with a problem in pyspark which is causing a major issue. Recently I've

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das
wrote: I am running on a 4 workers cluster each having between 16 to 30 cores and 50 GB of ram On Wed, 11 Mar 2015 8:55 am Akhil Das ak...@sigmoidanalytics.com wrote: Depending on your cluster setup (cores, memory), you need to specify the parallelism/repartition the data. Thanks Best

Re: sc.textFile() on windows cannot access UNC path

2015-03-11 Thread Akhil Das
...@lexisnexis.com wrote: This sounds like the right approach. Is there any sample code showing how to use sc.newAPIHadoopFile ? I am new to Spark and don’t know much about Hadoop. I just want to read a text file from UNC path into an RDD. Thanks *From:* Akhil Das [mailto:ak

Re: Spark fpg large basket

2015-03-11 Thread Akhil Das
Depending on your cluster setup (cores, memory), you need to specify the parallelism/repartition the data. Thanks Best Regards On Wed, Mar 11, 2015 at 12:18 PM, Sean Barzilay sesnbarzi...@gmail.com wrote: Hi I am currently using spark 1.3.0-snapshot to run the fpg algorithm from the mllib

Re: SocketTextStream not working from messages sent from other host

2015-03-11 Thread Akhil Das
May be you can use this code for your purpose https://gist.github.com/akhld/4286df9ab0677a555087 It basically sends the content of the given file through Socket (both IO/NIO), i used it for a benchmark between IO and NIO. Thanks Best Regards On Wed, Mar 11, 2015 at 11:36 AM, Cui Lin

Re: S3 SubFolder Write Issues

2015-03-11 Thread Akhil Das
Does it write anything in BUCKET/SUB_FOLDER/output? Thanks Best Regards On Wed, Mar 11, 2015 at 10:15 AM, cpalm3 cpa...@gmail.com wrote: Hi All, I am hoping someone has seen this issue before with S3, as I haven't been able to find a solution for this problem. When I try to save as Text

Re: Read Parquet file from scala directly

2015-03-10 Thread Akhil Das
Here's a Java version https://github.com/cloudera/parquet-examples/tree/master/MapReduce It won't be that hard to make that in Scala. Thanks Best Regards On Mon, Mar 9, 2015 at 9:55 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I have a lot of parquet files, and I try to open them

Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try using KryoSerializer, Enabling RDD Compression. Thanks Best Regards On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x m...@spokeo.com wrote: I'm basically running a sorting using spark. The spark program will read from HDFS,

Re: Spark with Spring

2015-03-10 Thread Akhil Das
It will be good if you can explain the entire usecase like what kind of requests, what sort of processing etc. Thanks Best Regards On Mon, Mar 9, 2015 at 11:18 PM, Tarun Garg bigdat...@live.com wrote: Hi, I have a existing web base system which receives the request and process that. This

Re: Joining data using Latitude, Longitude

2015-03-10 Thread Akhil Das
Are you using SparkSQL for the join? In that case I'm not quiet sure you have a lot of options to join on the nearest co-ordinate. If you are using the normal Spark code (by creating key-pair on lat,lon) you can apply certain logic like trimming the lat,lon etc. If you want more specific computing

Re: A strange problem in spark sql join

2015-03-09 Thread Akhil Das
Make sure you don't have two master instances running on the same machine. It could happen like you were running the job and in the middle you tried to stop the cluster which didn't completely stopped it and you did a start-all again which will eventually end up having 2 master instances running,

Re: A way to share RDD directly using Tachyon?

2015-03-09 Thread Akhil Das
Did you try something like: myRDD.saveAsObjectFile(tachyon://localhost:19998/Y) val newRDD = sc.objectFile[MyObject](tachyon://localhost:19998/Y) Thanks Best Regards On Sun, Mar 8, 2015 at 3:59 PM, Yijie Shen henry.yijies...@gmail.com wrote: Hi, I would like to share a RDD in several Spark

Re: Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-08 Thread Akhil Das
Mostly, when you use different versions of jars, it will throw up incompatible version errors. Thanks Best Regards On Fri, Mar 6, 2015 at 7:38 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling

Re: Help with transformWith in SparkStreaming

2015-03-08 Thread Akhil Das
You could do it like this: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) = { var first = ; var second = ; var third = 0

Re: distcp on ec2 standalone spark cluster

2015-03-08 Thread Akhil Das
Did you follow these steps? https://wiki.apache.org/hadoop/AmazonS3 Also make sure your jobtracker/mapreduce processes are running fine. Thanks Best Regards On Sun, Mar 8, 2015 at 7:32 AM, roni roni.epi...@gmail.com wrote: Did you get this to work? I got pass the issues with the cluster not

Re: Loading previously serialized object to Spark

2015-03-08 Thread Akhil Das
Can you paste the complete code? Thanks Best Regards On Sat, Mar 7, 2015 at 2:25 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I've implemented class MyClass in MLlib that does some operation on LabeledPoint. MyClass extends serializable, so I can map this operation on data of

Re: spark-stream programme failed on yarn-client

2015-03-06 Thread Akhil Das
Looks like an issue with your yarn setup, could you try doing a simple example with spark-shell? Start the spark shell as: $*MASTER=yarn-client bin/spark-shell* *spark-shell *sc.parallelize(1 to 1000).collect ​If that doesn't work, then make sure your yarn services are up and running and in

Re: spark-ec2 script problems

2015-03-05 Thread Akhil Das
It works pretty fine for me with the script comes with 1.2.0 release. Here's a few things which you can try: - Add your s3 credentials to the core-site.xml property namefs.s3.awsAccessKeyId/name valueID/value/propertyproperty namefs.s3.awsSecretAccessKey/name valueSECRET/value/property - Do a

Re: Managing permissions when saving as text file

2015-03-05 Thread Akhil Das
Why not setup HDFS? Thanks Best Regards On Thu, Mar 5, 2015 at 4:03 PM, didmar marin.did...@gmail.com wrote: Hi, I'm having a problem involving file permissions on the local filesystem. On a first machine, I have two different users : - launcher, which launches my job from an uber jar

Re: using log4j2 with spark

2015-03-05 Thread Akhil Das
You may exclude the log4j dependency while building. You can have a look at this build file to see how to exclude libraries http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html Thanks Best Regards On Thu, Mar 5, 2015 at 1:20

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Akhil Das
When you use KafkaUtils.createStream with StringDecoders, it will return String objects inside your messages stream. To access the elements from the json, you could do something like the following: val mapStream = messages.map(x= { val mapper = new ObjectMapper() with ScalaObjectMapper

Re: spark.local.dir leads to Job cancelled because SparkContext was shut down

2015-03-04 Thread Akhil Das
When you say multiple directories, make sure those directories are available and spark have permission to write to those directories. You can look at the worker logs to see the exact reason of failure. Thanks Best Regards On Tue, Mar 3, 2015 at 6:45 PM, lisendong lisend...@163.com wrote: As

Re: spark master shut down suddenly

2015-03-04 Thread Akhil Das
You can check in the mesos logs and see whats really happening. Thanks Best Regards On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com wrote: 15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing

Re: delay between removing the block manager of an executor, and marking that as lost

2015-03-04 Thread Akhil Das
You can look at the following - spark.akka.timeout - spark.akka.heartbeat.pauses from http://spark.apache.org/docs/1.2.0/configuration.html Thanks Best Regards On Tue, Mar 3, 2015 at 4:46 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Is there any relation between removing

Re:

2015-03-04 Thread Akhil Das
You may look at https://issues.apache.org/jira/browse/SPARK-4516 Thanks Best Regards On Wed, Mar 4, 2015 at 12:25 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I got this error message: 15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block fetches

Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Akhil Das
Looks like you are having 2 netty jars in the classpath. Thanks Best Regards On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: From the lines pointed in the exception log, I figured out that my code is unable to get the spark context. To isolate

Re: gc time too long when using mllib als

2015-03-03 Thread Akhil Das
You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards On Tue, Mar 3, 2015 at 2:26 PM, lisendong lisend...@163.com wrote: why does the gc time so long? i 'm using als in mllib, while the garbage collection time is too long

Re: One of the executor not getting StopExecutor message

2015-03-03 Thread Akhil Das
communication issue. If i try to take thread dump of the executor, once it appears to be in trouble, then time out happens. Can it be something related to* spark.akka.threads?* On Fri, Feb 27, 2015 at 3:55 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Mostly, that particular executor

Re: LBGFS optimizer performace

2015-03-02 Thread Akhil Das
Can you try increasing your driver memory, reducing the executors and increasing the executor memory? Thanks Best Regards On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres gsala...@ime.usp.br wrote: Hi there: I'm using LBFGS optimizer to train a logistic regression model. The

Re: unsafe memory access in spark 1.2.1

2015-03-02 Thread Akhil Das
Not sure, but It could be related to th netty off heap access as described here https://issues.apache.org/jira/browse/SPARK-4516, but the message was different though. Thanks Best Regards On Mon, Mar 2, 2015 at 12:51 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: Thanks, We

Re: Architecture of Apache Spark SQL

2015-03-02 Thread Akhil Das
Here's the whole tech stack around it: [image: Inline image 1] For a bit more details you can refer this slide http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 Previous project was Shark (SQL over spark), you can read about it from here

Re: External Data Source in Spark

2015-03-02 Thread Akhil Das
Wouldn't it be possible with .saveAsNewHadoopAPIFile? How are you pushing the filters and projections currently? Thanks Best Regards On Tue, Mar 3, 2015 at 1:11 AM, Addanki, Santosh Kumar santosh.kumar.adda...@sap.com wrote: Hi Colleagues, Currently we have implemented External Data

Re: Running in-memory SQL on streamed relational data

2015-02-28 Thread Akhil Das
I think you can do simple operations like foreachRDD or transform to get access to the RDDs in the stream and then you can do SparkSQL over it. Thanks Best Regards On Sat, Feb 28, 2015 at 3:27 PM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: Hi, I have been looking at Spark Streaming

Re: Spark partial data in memory/and partial in disk

2015-02-27 Thread Akhil Das
You can use persist(StorageLevel.MEMORY_AND_DISK) if you are not having sufficient memory to cache everything. Thanks Best Regards On Fri, Feb 27, 2015 at 7:20 PM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, How do we manage putting partial data in to memory and partial into

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-27 Thread Akhil Das
You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-4516 Apart from that little more information about your job would be helpful. Thanks Best Regards On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, My Spark Job is failing with

Re: One of the executor not getting StopExecutor message

2015-02-27 Thread Akhil Das
Mostly, that particular executor is stuck on GC Pause, what operation are you performing? You can try increasing the parallelism if you see only 1 executor is doing the task. Thanks Best Regards On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, I am

Re: Scheduler hang?

2015-02-26 Thread Akhil Das
at which the system appears to hang. I'm worried about some sort of message loss or inconsistency. * Yes, we are using Kryo. * I'll try that, but I'm again a little confused why you're recommending this. I'm stumped so might as well? On Wed, Feb 25, 2015 at 11:13 PM, Akhil Das ak

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
? Would I need to create a new tab and add the metrics? Any good or simple examples showing how this can be done? On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
By throughput you mean Number of events processed etc? [image: Inline image 1] Streaming tab already have these statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote: On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote

Re: Spark cluster set up on EC2 customization

2015-02-25 Thread Akhil Das
You can easily add a function (say setup_pig) inside the function setup_cluster in this script https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L649 Thanks Best Regards On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak ssti...@live.com wrote: Hi, I was looking at the documentation

Re: group by order by fails

2015-02-25 Thread Akhil Das
Which version of spark are you having? It seems there was a similar Jira https://issues.apache.org/jira/browse/SPARK-2474 Thanks Best Regards On Thu, Feb 26, 2015 at 12:03 PM, tridib tridib.sama...@live.com wrote: Hi, I need to find top 10 most selling samples. So query looks like: select

Re: Scheduler hang?

2015-02-25 Thread Akhil Das
What operation are you trying to do and how big is the data that you are operating on? Here's a few things which you can try: - Repartition the RDD to a higher number than 222 - Specify the master as local[*] or local[10] - Use Kryo Serializer (.set(spark.serializer,

Re: Number of parallel tasks

2015-02-25 Thread Akhil Das
Did you try setting .set(spark.cores.max, 20) Thanks Best Regards On Wed, Feb 25, 2015 at 10:21 PM, Akshat Aranya aara...@gmail.com wrote: I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Akhil Das
anamika.guo...@gmail.com wrote: Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema? On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you happen

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener And for Streaming: https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Thanks Best Regards On Tue, Feb 24,

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread Akhil Das
Did you happen to have a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Thanks Best Regards On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com wrote: My issue is posted here on stack-overflow. What am I doing wrong

Re: Spark on EC2

2015-02-24 Thread Akhil Das
If you signup for Google Compute Cloud, you will get free $300 credits for 3 months and you can start a pretty good cluster for your testing purposes. :) Thanks Best Regards On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have just signed up for Amazon AWS

<    4   5   6   7   8   9   10   11   12   13   >