Re: Research ideas using spark

2015-07-15 Thread Akhil Das
Try to repartition it to a higher number (at least 3-4 times the total # of cpu cores). What operation are you doing? It may happen that if you are doing a join/groupBy sort of operation that task which is taking time is having all the values, in that case you need to use a Partitioner which will

Re: Spark Intro

2015-07-14 Thread Akhil Das
This is where you can get started https://spark.apache.org/docs/latest/sql-programming-guide.html Thanks Best Regards On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar vinodsachin...@gmail.com wrote: Hi Everyone, I am developing application which handles bulk of data around millions(This may

Re: Spark executor memory information

2015-07-14 Thread Akhil Das
1. Yes open up the webui running on 8080 to see the memory/cores allocated to your workers, and open up the ui running on 4040 and click on the Executor tab to see the memory allocated for the executor. 2. mllib codes can be found over here https://github.com/apache/spark/tree/master/mllib and

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread Akhil Das
Why not add a trigger to your database table and whenever its updated push the changes to kafka etc and use normal sparkstreaming? You can also write a receiver based architecture https://spark.apache.org/docs/latest/streaming-custom-receivers.html for this, but that will be a bit time consuming.

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
This will get you started https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hello everyone, I am interested to contribute to apache spark. I

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Look in the worker logs and see whats going on. Thanks Best Regards On Tue, Jul 14, 2015 at 4:02 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, I use Spark 1.4. When saving the model to HDFS, I got error? Please help! Regards my scala command:

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
You can try to resolve some Jira issues, to start with try out some newbie JIRA's. Thanks Best Regards On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: I saw the contribution sections. As a new contibutor, should I try to build patches or can I add

Re: Spark Intro

2015-07-14 Thread Akhil Das
environment of spark. I tried spark SQL but it seems it returns data slower than compared to MsSQL.( I have tested with data which has 4 records) On Tue, Jul 14, 2015 at 3:50 AM, Akhil Das ak...@sigmoidanalytics.com wrote: This is where you can get started https://spark.apache.org/docs

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Someone else also reported this error with spark 1.4.0 Thanks Best Regards On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan arthur.hk.c...@gmail.com wrote: Hi, Below is the log form the worker. 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file

Re: hive-site.xml spark1.3

2015-07-14 Thread Akhil Das
Try adding it in your SPARK_CLASSPATH inside conf/spark-env.sh file. Thanks Best Regards On Tue, Jul 14, 2015 at 7:05 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql CLI doesn't pick it up. (copying the same

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread Akhil Das
Can you paste your conf/spark-env.sh file? Put SPARK_MASTER_IP as the master machine's host name in spark-env.sh file. Also add your slaves hostnames into conf/slaves file and do a sbin/start-all.sh Thanks Best Regards On Tue, Jul 14, 2015 at 1:26 PM, sivarani whitefeathers...@gmail.com wrote:

Re: Caching in spark

2015-07-13 Thread Akhil Das
wrote: Hi Akhil, It's interesting if RDDs are stored internally in a columnar format as well? Or it is only when an RDD is cached in SQL context, it is converted to columnar format. What about data frames? Thanks! -- Ruslan Dautkhanov On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das ak

Re: Master vs. Slave Nodes Clarification

2015-07-13 Thread Akhil Das
You are a bit confused about master node, slave node and the driver machine. 1. Master node can be kept as a smaller machine in your dev environment, mostly in production you will be using Mesos or Yarn cluster manager. 2. Now, if you are running your driver program (the streaming job) on the

Re: Spark Standalone Mode not working in a cluster

2015-07-13 Thread Akhil Das
Just make sure you are having the same installation of spark-1.4.0-bin-hadoop2.6 everywhere. (including the slaves, master, and from where you start the spark-shell). Thanks Best Regards On Mon, Jul 13, 2015 at 4:34 AM, Eduardo erocha@gmail.com wrote: My installation of spark is not

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-12 Thread Akhil Das
Yes, that is correct. You can use this boiler plate to avoid spark-submit. //The configurations val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(SigmoidApp) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)

Re: Issues when combining Spark and a third party java library

2015-07-12 Thread Akhil Das
Did you try setting the HADOOP_CONF_DIR? Thanks Best Regards On Sat, Jul 11, 2015 at 3:17 AM, maxdml maxdemou...@gmail.com wrote: Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 and higher from the official website. -- View this message in context:

Re: Linear search between particular log4j log lines

2015-07-12 Thread Akhil Das
Can you not use sc.wholeTextFile() and use a custom parser or a regex to extract out the TransactionIDs? Thanks Best Regards On Sat, Jul 11, 2015 at 8:18 AM, ssbiox sergey.korytni...@gmail.com wrote: Hello, I have a very specific question on how to do a search between particular lines of

Re: Worker dies with java.io.IOException: Stream closed

2015-07-12 Thread Akhil Das
Can you dig a bit more in the worker logs? Also make sure that spark has permission to write to /opt/ on that machine as its one machine always throwing up. Thanks Best Regards On Sat, Jul 11, 2015 at 11:18 PM, gaurav sharma sharmagaura...@gmail.com wrote: Hi All, I am facing this issue in

Re: query on Spark + Flume integration using push model

2015-07-10 Thread Akhil Das
Here's an example https://github.com/przemek1990/spark-streaming Thanks Best Regards On Thu, Jul 9, 2015 at 4:35 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello all, I'm trying to configure the flume to push data into a sink so that my stream job could pick up the data. My events

Re: Accessing Spark Web UI from another place than where the job actually ran

2015-07-10 Thread Akhil Das
When you connect to the machines you can create an ssh tunnel to access the UI : ssh -L 8080:127.0.0.1:8080 MasterMachinesIP And then you can simply open localhost:8080 in your browser and it should show up the UI. Thanks Best Regards On Thu, Jul 9, 2015 at 7:44 PM, rroxanaioana

Re: DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-10 Thread Akhil Das
It seems an issue with the azure, there was a discussion over here https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-spark-install/ Thanks Best Regards On Thu, Jul 9, 2015 at 9:42 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm running Spark 1.4 on

Re: Caching in spark

2015-07-10 Thread Akhil Das
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory Thanks Best Regards On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar vinodsachin...@gmail.com wrote: Hi Guys, Can any one please share me how to use caching feature of spark via spark sql queries? -Vinod

Re: SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use when running spark-shell

2015-07-10 Thread Akhil Das
that's because sc is already initialized. You can do sc.stop() before you initialize another one. Thanks Best Regards On Fri, Jul 10, 2015 at 3:54 PM, Prateek . prat...@aricent.com wrote: Hi, I am running single spark-shell but observing this error when I give val sc = new

Re: Job completed successfully without processing anything

2015-07-09 Thread Akhil Das
Looks like a configuration problem with your spark setup, are you running the driver on a different network? Can you try a simple program from spark-shell and make sure your setup is proper? (like sc.parallelize(1 to 1000).collect()) Thanks Best Regards On Thu, Jul 9, 2015 at 1:02 AM, ÐΞ€ρ@Ҝ

Re: Connecting to nodes on cluster

2015-07-09 Thread Akhil Das
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two days I have been trying to connect my laptop to the server using spark master ip:port but its been unsucessful. The server contains data

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-09 Thread Akhil Das
Did you try sc.shutdown and creating a new one? Thanks Best Regards On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole hujie.ea...@gmail.com wrote: I am using spark 1.4.1rc1 with default hive settings Thanks - Terry Hi All, I'd like to use the hive context in spark shell, i need to recreate the

Re: What does RDD lineage refer to ?

2015-07-09 Thread Akhil Das
Yes, just to add see the following scenario of rdd lineage: RDD1 - RDD2 - RDD3 - RDD4 here RDD2 depends on the RDD1's output and the lineage goes till RDD4. Now, for some reason RDD3 is lost, and spark will recompute it from RDD2. Thanks Best Regards On Thu, Jul 9, 2015 at 5:51 AM, canan

Re: Spark job hangs when History server events are written to hdfs

2015-07-08 Thread Akhil Das
Can you look in the datanode logs and see whats going on? Most likely, you are hitting the ulimit on open file handles. Thanks Best Regards On Wed, Jul 8, 2015 at 10:55 AM, Pankaj Arora pankaj.ar...@guavus.com wrote: Hi, I am running long running application over yarn using spark and I am

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
multithread it? Sincerely, Ashish Dutt On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White bwwintheho

Re: unable to bring up cluster with ec2 script

2015-07-08 Thread Akhil Das
Its showing connection refused, for some reason it was not able to connect to the machine either its the machine\s start up time or its with the security group. Thanks Best Regards On Wed, Jul 8, 2015 at 2:04 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I'm following the tutorial

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Akhil Das
Whats the point of creating them in parallel? You can multi-thread it run it in parallel though. Thanks Best Regards On Wed, Jul 8, 2015 at 5:34 AM, Brandon White bwwintheho...@gmail.com wrote: Say I have a spark job that looks like following: def loadTable1() { val table1 =

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks Best Regards On Tue, Jul 7, 2015 at 12:54 AM, maxdml maxdemou...@gmail.com wrote: Hi, I've been compiling spark 1.4.0 with SBT, from the

Re: How to debug java.io.OptionalDataException issues

2015-07-07 Thread Akhil Das
Did you try kryo? Wrap everything with kryo and see if you are still hitting the exception. (At least you could see a different exception stack). Thanks Best Regards On Tue, Jul 7, 2015 at 6:05 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, suffering from a pretty strange issue:

Re: Data interaction between various RDDs in Spark Streaming

2015-07-07 Thread Akhil Das
UpdatestateByKey? Thanks Best Regards On Wed, Jul 8, 2015 at 1:05 AM, swetha swethakasire...@gmail.com wrote: Hi, Suppose I want the data to be grouped by and Id named 12345 and I have certain amount of data coming out from one batch for 12345 and I have data related to 12345 coming after

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-07 Thread Akhil Das
Can you try adding sc.stop at the end of your program? looks like its having a hard-time closing off sparkcontext. Thanks Best Regards On Tue, Jul 7, 2015 at 4:08 PM, Hafsa Asif hafsa.a...@matchinguu.com wrote: Hi, I run the following simple Java spark standalone app with maven command

Re: How to implement top() and filter() on object List for JavaRDD

2015-07-07 Thread Akhil Das
Here's a simplified example: SparkConf conf = new SparkConf().setAppName( Sigmoid).setMaster(local); JavaSparkContext sc = new JavaSparkContext(conf); ListString user = new ArrayListString(); user.add(Jack); user.add(Jill);

Re: Master doesn't start, no logs

2015-07-07 Thread Akhil Das
instances having successively run on the same machine? -- Henri Maxime Demoulin 2015-07-07 4:10 GMT-04:00 Akhil Das ak...@sigmoidanalytics.com: Strange. What are you having in $SPARK_MASTER_IP? It may happen that it is not able to bind to the given ip but again it should be in the logs. Thanks

Re: Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-06 Thread Akhil Das
If you don't want those logs flood your screen, you can disable it simply with: import org.apache.log4j.{Level, Logger} Logger.getLogger(org).setLevel(Level.OFF) Logger.getLogger(akka).setLevel(Level.OFF) Thanks Best Regards On Sun, Jul 5, 2015 at 7:27 PM, Hellen

Re: cores and resource management

2015-07-06 Thread Akhil Das
Try with *spark.cores.max*, executor cores is used when you usually run it on yarn mode. Thanks Best Regards On Mon, Jul 6, 2015 at 1:22 AM, nizang ni...@windward.eu wrote: hi, We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're trying to run an application on a number of

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
While the job is running, just look in the directory and see whats the root cause of it (is it the logs? is it the shuffle? etc). Here's a few configuration options which you can try: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation:

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
You can also set these in the spark-env.sh file : export SPARK_WORKER_DIR=/mnt/spark/ export SPARK_LOCAL_DIR=/mnt/spark/ Thanks Best Regards On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: While the job is running, just look in the directory and see whats

Re: Unable to start spark-sql

2015-07-06 Thread Akhil Das
Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I am

Re: JDBC Streams

2015-07-05 Thread Akhil Das
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them. Thanks Best Regards On Sun, Jul 5, 2015 at

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Akhil Das
Looks like, it spend more time writing/transferring the 40GB of shuffle when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle? Thanks Best Regards On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu ilovesonsofanar...@gmail.com wrote: Hi, I am using TeraSort benchmark from

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Akhil Das
With binary i think it might not be possible, although if you can download the sources and then build it then you can remove this function https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext.

Re: duplicate names in sql allowed?

2015-07-03 Thread Akhil Das
I think you can open up a jira, not sure if this PR https://github.com/apache/spark/pull/2209/files (SPARK-2890 https://issues.apache.org/jira/browse/SPARK-2890) broke the validation piece. Thanks Best Regards On Fri, Jul 3, 2015 at 4:29 AM, Koert Kuipers ko...@tresata.com wrote: i am

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Can you paste the code? Something is missing Thanks Best Regards On Fri, Jul 3, 2015 at 3:14 PM, Jem Tucker jem.tuc...@gmail.com wrote: In the driver when running spark-submit with --master yarn-client On Fri, Jul 3, 2015 at 10:23 AM Akhil Das ak...@sigmoidanalytics.com wrote: Where does

Re: build spark 1.4 source code for sparkR with maven

2015-07-03 Thread Akhil Das
Did you try: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package Thanks Best Regards On Fri, Jul 3, 2015 at 2:27 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all, Anyone build spark 1.4 source code for sparkR with maven/sbt, what's comand ? using

Re: Accessing the console from spark

2015-07-03 Thread Akhil Das
Where does it returns null? Within the driver or in the executor? I just tried System.console.readPassword in spark-shell and it worked. Thanks Best Regards On Fri, Jul 3, 2015 at 2:32 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, We have an application that requires a username/password to

Re: Making Unpersist Lazy

2015-07-02 Thread Akhil Das
rdd's which are no longer required will be removed from memory by spark itself (which you can consider as lazy?). Thanks Best Regards On Wed, Jul 1, 2015 at 7:48 PM, Jem Tucker jem.tuc...@gmail.com wrote: Hi, The current behavior of rdd.unpersist() appears to not be lazily executed and

Re: Convert CSV lines to List of Objects

2015-07-02 Thread Akhil Das
Have a look at the sc.wholeTextFiles, you can use it to read the whole csv contents into the value and then split it on \n and add them up to a list and return it. *sc.wholeTextFiles:* Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported

Re: output folder structure not getting commited and remains as _temporary

2015-07-01 Thread Akhil Das
Looks like a jar conflict to me. ava.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData.getBytesWritten() You are having multiple versions of the same jars in the classpath. Thanks Best Regards On Wed, Jul 1, 2015 at 6:58 AM, nkd kalidas.nimmaga...@gmail.com

Re: Issue with parquet write after join (Spark 1.4.0)

2015-07-01 Thread Akhil Das
It says: Caused by: java.net.ConnectException: Connection refused: slave2/...:54845 Could you look in the executor logs (stderr on slave2) and see what made it shut down? Since you are doing a join there's a high possibility of OOM etc. Thanks Best Regards On Wed, Jul 1, 2015 at 10:20 AM,

Re: Spark run errors on Raspberry Pi

2015-07-01 Thread Akhil Das
Now i'm having a strange feeling to try this on KBOX http://kevinboone.net/kbox.html :/ Thanks Best Regards On Wed, Jul 1, 2015 at 9:10 AM, Exie tfind...@prodevelop.com.au wrote: FWIW, I had some trouble getting Spark running on a Pi. My core problem was using snappy for compression as it

Re: Run multiple Spark jobs concurrently

2015-07-01 Thread Akhil Das
Have a look at https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Wed, Jul 1, 2015 at 12:01 PM, Nirmal Fernando nir...@wso2.com wrote: Hi All, Is there any additional configs that we have to do to perform $subject? -- Thanks regards, Nirmal Associate

Re: Can I do Joins across Event Streams ?

2015-07-01 Thread Akhil Das
Have a look at the window, updateStateByKey operations, if you are looking for something more sophisticated then you can actually persists these streams in an intermediate storage (say for x duration) like HBase or Cassandra or any other DB and you can do global aggregations with these. Thanks

Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread Akhil Das
.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath

Re: Issues in reading a CSV file from local file system using spark-shell

2015-07-01 Thread Akhil Das
Since its a windows machine, you are very likely to be hitting this one https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Jul 1, 2015 at 12:36 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm running Spark 1.4.0 without Hadoop. I'm using the binary

Re: Checkpoint support?

2015-06-30 Thread Akhil Das
Have a look at the StageInfo https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.StageInfo class, it has method stageFailed. You could make use of it. I don't understand the point of restarting the entire application. Thanks Best Regards On Tue, Jun 30, 2015 at

Re: Error while installing spark

2015-06-30 Thread Akhil Das
How much memory you have on that machine? You can increase the heap-space by *export _JAVA_OPTIONS=-Xmx2g* Thanks Best Regards On Tue, Jun 30, 2015 at 11:00 AM, Chintan Bhatt chintanbhatt...@charusat.ac.in wrote: Facing following error message while performing sbt/sbt assembly Error

Re: got java.lang.reflect.UndeclaredThrowableException when running multiply APPs in spark

2015-06-30 Thread Akhil Das
This: Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Could happen for many reasons, one of them could be because of insufficient memory. Are you running all 20 apps on the same node? How are you submitting the apps? (with spark-submit?). I see you have

Re: s3 bucket access/read file

2015-06-30 Thread Akhil Das
Try this way: val data = sc.textFile(s3n://ACCESS_KEY:SECRET_KEY@mybucket/temp/) Thanks Best Regards On Mon, Jun 29, 2015 at 11:59 PM, didi did...@gmail.com wrote: Hi *Cant read text file from s3 to create RDD * after setting the configuration val

Re: problem for submitting job

2015-06-29 Thread Akhil Das
Cool. On 29 Jun 2015 21:10, 郭谦 buptguoq...@gmail.com wrote: Akhil Das, You give me a new idea to solve the problem. Vova provides me a way to solve the problem just before Vova Shelgunovvvs...@gmail.com Sample code for submitting job from any other java app, e.g. servlet: http

Re: spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-29 Thread Akhil Das
Here's a bunch of configuration for that https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior Thanks Best Regards On Fri, Jun 26, 2015 at 10:37 PM, igor.berman igor.ber...@gmail.com wrote: Hi, wanted to get some advice regarding tunning spark application I see for some of

Re: Master dies after program finishes normally

2015-06-29 Thread Akhil Das
Which version of spark are you using? You can try changing the heap size manually by *export _JAVA_OPTIONS=-Xmx5g * Thanks Best Regards On Fri, Jun 26, 2015 at 7:52 PM, Yifan LI iamyifa...@gmail.com wrote: Hi, I just encountered the same problem, when I run a PageRank program which has lots

Re: problem for submitting job

2015-06-29 Thread Akhil Das
You can create a SparkContext in your program and run it as a standalone application without using spark-submit. Here's something that will get you started: //Create SparkContext val sconf = new SparkConf() .setMaster(spark://spark-ak-master:7077) .setAppName(Test)

Re:

2015-06-26 Thread Akhil Das
. The input size is 512.0 MB (hadoop) / 4159106. Can this be reduced to 64 MB so as to increase the number of tasks. Similar to split size that increases the number of mappers in Hadoop M/R. On Thu, Jun 25, 2015 at 12:06 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Look in the tuning

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: Problem Run Spark Example HBase Code Using Spark-Submit

2015-06-26 Thread Akhil Das
Try to add them in the SPARK_CLASSPATH in your conf/spark-env.sh file Thanks Best Regards On Thu, Jun 25, 2015 at 9:31 PM, Bin Wang binwang...@gmail.com wrote: I am trying to run the Spark example code HBaseTest from command line using spark-submit instead run-example, in that case, I can

Re: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Akhil Das
You just need to set your HADOOP_HOME which appears to be null in the stackstrace. If you are not having the winutils.exe, then you can download https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip and put it there. Thanks Best Regards On Thu, Jun 25, 2015 at 11:30 PM, Ashic

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: Performing sc.paralleize (..) in workers not in the driver program

2015-06-26 Thread Akhil Das
Why do you want to do that? Thanks Best Regards On Thu, Jun 25, 2015 at 10:16 PM, shahab shahab.mok...@gmail.com wrote: Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati srikanth...@gmail.com wrote: Hello, When I run a spark job with spark-submit it fails with below exception for code line /*val webLogDF =

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Akhil Das
​JavaPairInputDStreamString, String messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ); Here: jssc = JavaStreamingContext String.class = Key ,

Re: Spark 1.4 RDD to DF fails with toDF()

2015-06-26 Thread Akhil Das
/releases/; ) On Fri, Jun 26, 2015 at 4:13 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Its a scala version conflict, can you paste your build.sbt file? Thanks Best Regards On Fri, Jun 26, 2015 at 7:05 AM, stati srikanth...@gmail.com wrote: Hello, When I run a spark job with spark-submit

Re:

2015-06-25 Thread Akhil Das
(๏̯͡๏) deepuj...@gmail.com wrote: Its taking an hour and on Hadoop it takes 1h 30m, is there a way to make it run faster ? On Wed, Jun 24, 2015 at 11:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Cool. :) On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its running now

Re: Can Spark1.4 work with CDH4.6

2015-06-25 Thread Akhil Das
a different guava dependency but the error does go away this way On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best Regards On Thu, Jun 25, 2015 at 12:39 PM, 1106944...@qq.com 1106944...@qq.com wrote: Hi all I have installed spark1.4, then want to use sparkR . assueme spark master ip= node1, how to start sparkR ? and summit job to

Re: Akka failures: Driver Disassociated

2015-06-25 Thread Akhil Das
Can you look in the worker logs and see whats going on? It may happen that you ran out of diskspace etc. Thanks Best Regards On Thu, Jun 25, 2015 at 12:08 PM, barmaley o...@solver.com wrote: I'm running Spark 1.3.1 on AWS... Having long-running application (spark context) which accepts and

Re: Killing Long running tasks (stragglers)

2015-06-25 Thread Akhil Das
That totally depends on the way you extract the data. It will be helpful if you can paste your code so that we will understand it better. Thanks Best Regards On Wed, Jun 24, 2015 at 2:32 PM, William Ferrell wferr...@gmail.com wrote: Hello - I am using Apache Spark 1.2.1 via pyspark. Thanks

Re: spark1.4 sparkR usage

2015-06-25 Thread Akhil Das
, Is this the official R Package? It is written : *NOTE: The API from the upcoming Spark release (1.4) will not have the same API as described here. * Thanks, JC ᐧ 2015-06-25 10:55 GMT+02:00 Akhil Das ak...@sigmoidanalytics.com: Here you go https://amplab-extras.github.io/SparkR-pkg/ Thanks Best

Re: Should I keep memory dedicated for HDFS and Spark on cluster nodes?

2015-06-24 Thread Akhil Das
Depending the size of the memory you are having, you ccould allocate 60-80% of the memory for the spark worker process. Datanode doesn't require too much memory. On 23 Jun 2015 21:26, maxdml max...@cs.duke.edu wrote: I'm wondering if there is a real benefit for splitting my memory in two for

Re:

2015-06-24 Thread Akhil Das
Can you look a bit more in the error logs? It could be getting killed because of OOM etc. One thing you can try is to set the spark.shuffle.blockTransferService to nio from netty. Thanks Best Regards On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a Spark job

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Akhil Das
Can you try to add those jars in the SPARK_CLASSPATH and give it a try? Thanks Best Regards On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was

Re: kafka spark streaming with mesos

2015-06-24 Thread Akhil Das
A screenshot of your framework running would also be helpful. How many cores does it have? Did you try running it in coarse grained mode? Try to add these to the conf: sparkConf.set(spark.mesos.coarse, true) sparkConfset(spark.cores.max, 2) Thanks Best Regards On Wed, Jun 24, 2015 at 1:35 AM,

Re:

2015-06-24 Thread Akhil Das
) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) On Wed, Jun 24, 2015 at 7:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you look a bit more in the error logs? It could be getting

Re: Multiple executors writing file using java filewriter

2015-06-23 Thread Akhil Das
Why don't you do a normal .saveAsTextFiles? Thanks Best Regards On Mon, Jun 22, 2015 at 11:55 PM, anshu shukla anshushuk...@gmail.com wrote: Thanx for reply !! YES , Either it should write on any machine of cluster or Can you please help me ... that how to do this . Previously i was

Re: Spark job fails silently

2015-06-23 Thread Akhil Das
Looks like a hostname conflict to me. 15/06/22 17:04:45 WARN Utils: Your hostname, datasci01.dev.abc.com resolves to a loopback address: 127.0.0.1; using 10.0.3.197 instead (on interface eth0) 15/06/22 17:04:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Can you paste

Re: Any way to retrieve time of message arrival to Kafka topic, in Spark Streaming?

2015-06-23 Thread Akhil Das
May be while producing the messages, you can make it as a keyedMessage with the timestamp as key and on the consumer end you can easily identify the key (which will be the timestamp) from the message. If the network is fast enough, then i think there could be a small millisecond lag. Thanks Best

Re: What does [Stage 0: (0 + 2) / 2] mean on the console

2015-06-23 Thread Akhil Das
Well, you could that (Stage information) is an ASCII representation of the WebUI (running on port 4040). Since you set local[4] you will have 4 threads for your computation, and since you are having 2 receivers, you are left with 2 threads to process ((0 + 2) -- This 2 is your 2 threads.) And the

Re: Programming with java on spark

2015-06-23 Thread Akhil Das
Did you happened to try this? JavaPairRDDInteger, String hadoopFile = sc.hadoopFile( /sigmoid, DataInputFormat.class, LongWritable.class, Text.class) Thanks Best Regards On Tue, Jun 23, 2015 at 6:58 AM, 付雅丹 yadanfu1...@gmail.com wrote: Hello, everyone! I'm new in spark.

Re: Spark Streaming: limit number of nodes

2015-06-23 Thread Akhil Das
Use *spark.cores.max* to limit the CPU per job, then you can easily accommodate your third job also. Thanks Best Regards On Tue, Jun 23, 2015 at 5:07 PM, Wojciech Pituła w.pit...@gmail.com wrote: I have set up small standalone cluster: 5 nodes, every node has 5GB of memory an 8 cores. As you

Re: jars are not loading from 1.3. those set via setJars to the SparkContext

2015-06-22 Thread Akhil Das
Yes. Thanks Best Regards On Mon, Jun 22, 2015 at 8:33 PM, Murthy Chelankuri kmurt...@gmail.com wrote: I have more than one jar. can we set sc.addJar multiple times with each dependent jar ? On Mon, Jun 22, 2015 at 8:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try sc.addJar instead

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
You can use fileStream for that, look at the XMLInputFormat https://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java of mahout. It should give you full XML object as on record, (as opposed to an XML

Re: s3 - Can't make directory for path

2015-06-22 Thread Akhil Das
Could you elaborate a bit more? What do you meant by set up a standalone server? and what is leading you to that exceptions? Thanks Best Regards On Mon, Jun 22, 2015 at 2:22 AM, nizang ni...@windward.eu wrote: hi, I'm trying to setup a standalone server, and in one of my tests, I got the

Re: memory needed for each executor

2015-06-22 Thread Akhil Das
Totally depends on the use-case that you are solving with Spark, for instance there was some discussion around the same which you could read over here http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-td23326.html Thanks Best Regards

Re: JavaDStreamString read and write rdbms

2015-06-22 Thread Akhil Das
Its pretty straight forward, this would get you started http://stackoverflow.com/questions/24896233/how-to-save-apache-spark-schema-output-in-mysql-database Thanks Best Regards On Mon, Jun 22, 2015 at 12:39 PM, Manohar753 manohar.re...@happiestminds.com wrote: Hi Team, How to split and

Re: Serializer not switching

2015-06-22 Thread Akhil Das
How are you submitting the application? Could you paste the code that you are running? Thanks Best Regards On Mon, Jun 22, 2015 at 5:37 PM, Sean Barzilay sesnbarzi...@gmail.com wrote: I am trying to run a function on every line of a parquet file. The function is in an object. When I run the

Re: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-22 Thread Akhil Das
to use XmlInputFormatof mahout in Spark Streaming (I am not Spark Streaming Expert yet ;-)). Can you show me some sample code for explanation. Thanks in advance, Yong On Mon, Jun 22, 2015 at 6:44 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use fileStream for that, look

<    1   2   3   4   5   6   7   8   9   10   >