Re: IP error on starting spark-shell on windows 7

2015-12-13 Thread Akhil Das
Its a warning, not an error. What happens when you don't specify SPARK_LOCAL_IP at all? If it is able to bring up the spark shell, then try *netstat -np* and see on which address the driver is binding to. Thanks Best Regards On Thu, Dec 10, 2015 at 9:49 AM, Stefan Karos

Re: Not able to receive data in spark from rsyslog

2015-12-07 Thread Akhil Das
Just make sure you are binding on the correct interface. - java.net.ConnectException: Connection refused​ Means spark was not able to connect to that host/port. You can validate it by telneting to that host/port. ​ Thanks Best Regards On Fri, Dec 4, 2015 at 1:00 PM, masoom alam

Re: How the cores are used in Directstream approach

2015-12-07 Thread Akhil Das
You will have to do a repartition after creating the dstream to utilize all cores. directStream keeps exactly the same partitions as in kafka for spark. Thanks Best Regards On Thu, Dec 3, 2015 at 9:42 AM, Charan Ganga Phani Adabala < char...@eiqnetworks.com> wrote: > Hi, > > We have* 1 kafka

Re: Spark Streaming Shuffle to Disk

2015-12-07 Thread Akhil Das
UpdateStateByKey and your batch data could be filling up your executor memory and hence it might be hitting the disk, you can verify it by looking at the memory footprint while your job is running. Looking at the executor logs will also give you a better understanding of whats going on. Thanks

Re: Predictive Modeling

2015-12-07 Thread Akhil Das
You can write a simple python script to process the 1.5GB dataset, use the pandas library for building your predictive model. Thanks Best Regards On Fri, Dec 4, 2015 at 3:02 PM, Chintan Bhatt < chintanbhatt...@charusat.ac.in> wrote: > Hi, > I'm very much interested to make a predictive model

Re: Spark applications metrics

2015-12-07 Thread Akhil Das
Usually your application is composed of jobs and jobs are composed of tasks, on the task level you can see how much read/write was happened from the stages tab of your driver ui. Thanks Best Regards On Fri, Dec 4, 2015 at 6:20 PM, patcharee wrote: > Hi > > How can I

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-07 Thread Akhil Das
Whats in your SparkIsAwesome class? Just make sure that you are giving enough partition to spark to evenly distribute the job throughout the cluster. Try submitting the job this way: ~/spark/bin/spark-submit --executor-cores 10 --executor-memory 5G --driver-memory 5G --class

Re: How to access a RDD (that has been broadcasted) inside the filter method of another RDD?

2015-12-07 Thread Akhil Das
Something like this? val broadcasted = sc.broadcast(...) RDD2.filter(value => { //simply use *broadcasted* if(broadcasted.contains(value)) true }) Thanks Best Regards On Fri, Dec 4, 2015 at 10:43 PM, Abhishek Shivkumar < abhishek.shivku...@bigdatapartnership.com> wrote: > Hi, > > I have

Re: AWS CLI --jars comma problem

2015-12-07 Thread Akhil Das
Not a direct answer but you can create a big fat jar combining all the classes in the three jars and pass it. Thanks Best Regards On Thu, Dec 3, 2015 at 10:21 PM, Yusuf Can Gürkan wrote: > Hello > > I have a question about AWS CLI for people who use it. > > I create a

Re: Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Akhil Das
Did you read http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support Thanks Best Regards On Fri, Dec 4, 2015 at 4:12 PM, Mich Talebzadeh wrote: > Hi, > > > > > > I am trying to make Hive work with Spark. > > > > I have been told that I

Re: 8080 not working

2015-12-05 Thread Akhil Das
Can you provide more information? Are you having a spark stand alone cluster running? If not, then you won't be able to access 8080. If you want to load file data, you can open up the spark-shell and type val rawData = sc.textFile("/path/to/your/file") You can also use the

Re: Improve saveAsTextFile performance

2015-12-05 Thread Akhil Das
Which version of spark are you using? Can you look at the event timeline and the DAG of the job and see where its spending more time? .save simply triggers your entire pipeline, If you are doing a join/groupBy kind of operations then you need to make sure the keys are evenly distributed throughout

Re: spark sql cli query results written to file ?

2015-12-02 Thread Akhil Das
Something like this? val df = sqlContext.read.load("examples/src/main/resources/users.parquet")df.select("name", "favorite_color").write.save("namesAndFavColors.parquet") It will save the name, favorite_color columns to a parquet file. You can read more information over here

Re: send this email to unsubscribe

2015-12-02 Thread Akhil Das
If you haven't unsubscribed already, shoot an email to user-unsubscr...@spark.apache.org Also read more here http://spark.apache.org/community.html Thanks Best Regards On Thu, Nov 26, 2015 at 7:51 AM, ngocan211 . wrote: > >

Re: Error in block pushing thread puts the KinesisReceiver in a stuck state

2015-12-02 Thread Akhil Das
Did you go through the executor logs completely? Futures timed out exception can occur mostly when one of the task/job spend way too much time and fails to respond, this happens when there's a GC pause or memory overhead. Thanks Best Regards On Tue, Dec 1, 2015 at 12:09 AM, Spark Newbie

Re: spark sql cli query results written to file ?

2015-12-02 Thread Akhil Das
Oops 3 mins late. :) Thanks Best Regards On Thu, Dec 3, 2015 at 11:49 AM, Sahil Sareen <sareen...@gmail.com> wrote: > Yeah, Thats the example from the link I just posted. > > -Sahil > > On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das <ak...@sigmoidanalytics.com> &

Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Akhil Das
Not quiet sure whats happening, but its not an issue with multiplication i guess as the following query worked for me: trades.select(trades("price")*9.5).show +-+ |(price * 9.5)| +-+ |199.5| |228.0| |190.0| |199.5| |190.0| |

Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Akhil Das
Not quiet sure whats happening, but its not an issue with multiplication i guess as the following query worked for me: trades.select(trades("price")*9.5).show +-+ |(price * 9.5)| +-+ |199.5| |228.0| |190.0| |199.5| |190.0| |

Re: Debug Spark

2015-12-02 Thread Akhil Das
This doc will get you started https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Thanks Best Regards On Sun, Nov 29, 2015 at 9:48 PM, Masf wrote: > Hi > > Is it possible to debug spark locally with IntelliJ or another

Re: Spark on Mesos with Centos 6.6 NFS

2015-12-01 Thread Akhil Das
Can you try mounting the NFS directory on all machines on the same location? (say /mnt/nfs) and try it again? Thanks Best Regards On Thu, Nov 26, 2015 at 1:22 PM, leonidas wrote: > Hello, > I have a setup with spark 1.5.1 on top of Mesos with one master and 4 > slaves. I

Re: [SPARK STREAMING ] Sending data to ElasticSearch

2015-11-09 Thread Akhil Das
Have a look at https://github.com/elastic/elasticsearch-hadoop#apache-spark You can simply call the .saveToEs function to store your RDD data into ES. Thanks Best Regards On Thu, Oct 29, 2015 at 8:19 PM, Nipun Arora wrote: > Hi, > > I am sending data to an

Re: heap memory

2015-11-09 Thread Akhil Das
Its coming from parquet , you can try increasing your driver memory and see if its still coming. Thanks Best Regards On Fri, Oct 30, 2015 at 7:16 PM, Younes Naguib <

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I

Re: Guidance to get started

2015-11-09 Thread Akhil Das
You can read the installation details from here http://spark.apache.org/docs/latest/ You can read about contributing to spark from here https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Thu, Oct 29, 2015 at 3:53 PM, Aaska Shah

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
Is that all you have in the executor logs? I suspect some of those jobs are having a hard time managing the memory. Thanks Best Regards On Sun, Nov 1, 2015 at 9:38 PM, Romi Kuntsman wrote: > [adding dev list since it's probably a bug, but i'm not sure how to > reproduce so I

Re: Issue on spark.driver.maxResultSize

2015-11-09 Thread Akhil Das
You can set it in your conf/spark-defaults.conf file, or you will have to set it before you create the SparkContext. Thanks Best Regards On Fri, Oct 30, 2015 at 4:31 AM, karthik kadiyam < karthik.kadiyam...@gmail.com> wrote: > Hi, > > In spark streaming job i had the following setting > >

Re: Spark 1.5.1 Dynamic Resource Allocation

2015-11-09 Thread Akhil Das
Did you go through http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup for yarn, i guess you will have to copy the spark-1.5.1-yarn-shuffle.jar to the classpath of all nodemanagers in your cluster. Thanks Best Regards On Fri, Oct 30, 2015 at 7:41 PM, Tom Stewart <

Re: How to properly read the first number lines of file into a RDD

2015-11-09 Thread Akhil Das
​There's multiple way to achieve this: 1. Read the N lines from the driver and then do a sc.parallelize(nlines) to create an RDD out of it. 2. Create an RDD with N+M, do a take on N and then broadcast or parallelize the returning list. 3. Something like this if the file is in hdfs: val n_f =

Re: sample or takeSample or ??

2015-11-09 Thread Akhil Das
You can't create a new RDD by selecting few elements. A rdd.take(n), takeSample etc are actions and it will trigger your entire pipeline to be executed. You can although do something like this i guess: val sample_data = rdd.take(10) val sample_rdd = sc.parallelize(sample_data) Thanks Best

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
ntsman*, *Big Data Engineer* > http://www.totango.com > > On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Is that all you have in the executor logs? I suspect some of those jobs >> are having a hard time managing the memory. >

Re: Some spark apps fail with "All masters are unresponsive", while others pass normally

2015-11-09 Thread Akhil Das
ntsman*, *Big Data Engineer* > http://www.totango.com > > On Mon, Nov 9, 2015 at 4:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Is that all you have in the executor logs? I suspect some of those jobs >> are having a hard time managing the memory. >

Re: Prevent partitions from moving

2015-11-03 Thread Akhil Das
Most likely in your case, the partition keys are not evenly distributed and hence you can notice some of your tasks taking way too longer time to process. You will have to use custom partitioner

Re: Apache Spark on Raspberry Pi Cluster with Docker

2015-11-03 Thread Akhil Das
Can you try it with just: spark-submit --master spark://master:6066 --class SimpleApp target/simple-project-1.0.jar And see if it works? Even better idea would be to spawn a spark-shell (*MASTER=spark://master:6066 bin/spark-shell*) and try out a simple *sc.parallelize(1 to 1000).collect*

Re: Submitting Spark Applications - Do I need to leave ports open?

2015-11-02 Thread Akhil Das
Yes you need to open up a few ports for that to happen, have a look at http://spark.apache.org/docs/latest/configuration.html#networking you can see *.port configurations which bounds to random by default, fix those ports to a specific number and open those ports in your firewall and it should

Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-02 Thread Akhil Das
Can you give a try putting the jar locally without hdfs? Thanks Best Regards On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com wrote: > hi all, >when using command: > spark-submit *--deploy-mode cluster --jars > hdfs:///user/spark/cypher.jar* --class >

Re: How to catch error during Spark job?

2015-11-02 Thread Akhil Das
Usually you add exception handling within the transformations, in your case you have it added in the driver code. This approach won't be able to catch those exceptions happening inside the executor. eg: try { val rdd = sc.parallelize(1 to 100) rdd.foreach(x => throw new

Re: streaming.twitter.TwitterUtils what is the best way to save twitter status to HDFS?

2015-11-01 Thread Akhil Das
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since you want to store the Status object and for every batch it will create a directory under /status (name will mostly be the timestamp), since the data is small (hardly couple of MBs for 1 sec interval) it will not overwhelm

Re: Unable to run applications on spark in standalone cluster mode

2015-11-01 Thread Akhil Das
Can you paste the contents of your spark-env.sh file? Also would be good to have a look at the /etc/hosts file. Cannot bind to the given ip address can be resolved if you put the hostname instead of the ip address. Also make sure the configuration (conf directory) across your cluster have the same

Re: Unable to use saveAsSequenceFile

2015-11-01 Thread Akhil Das
Make sure your firewall isn't blocking the requests. Thanks Best Regards On Sat, Oct 24, 2015 at 5:04 PM, Amit Singh Hora wrote: > Hi All, > > I am trying to wrote an RDD as Sequence file into my Hadoop cluster but > getting connection time out again and again ,I can ping

Re: java how to configure streaming.dstream.DStream<> saveAsTextFiles() to work with hdfs?

2015-11-01 Thread Akhil Das
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is pointing to your hadoop configuration directory (with core-site.xml, hdfs-site.xml files), If you have them set properly then make sure you are giving the full hdfs url like:

Re: Spark Streaming: how to StreamingContext.queueStream

2015-11-01 Thread Akhil Das
You can do something like this: val rddQueue = scala.collection.mutable.Queue(rdd1,rdd2,rdd3) val qDstream = ssc.queueStream(rddQueue) Thanks Best Regards On Sat, Oct 24, 2015 at 4:43 AM, Anfernee Xu wrote: > Hi, > > Here's my situation, I have some kind of offline

Re: Running 2 spark application in parallel

2015-11-01 Thread Akhil Das
Have a look at the dynamic resource allocation listed here https://spark.apache.org/docs/latest/job-scheduling.html Thanks Best Regards On Thu, Oct 22, 2015 at 11:50 PM, Suman Somasundar < suman.somasun...@oracle.com> wrote: > Hi all, > > > > Is there a way to run 2 spark applications in

Re: Saving RDDs in Tachyon

2015-10-30 Thread Akhil Das
I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile Thanks Best Regards On Fri, Oct 23, 2015 at 7:57 AM, mark wrote: > I have Avro records stored in Parquet files in HDFS. I want to read these > out as an RDD and save that RDD in Tachyon for

Re: Whether Spark will use disk when the memory is not enough on MEMORY_ONLY Storage Level

2015-10-30 Thread Akhil Das
You can set it to MEMORY_AND_DISK, in this case data will fall back to disk when it exceeds the memory. Thanks Best Regards On Fri, Oct 23, 2015 at 9:52 AM, JoneZhang wrote: > 1.Whether Spark will use disk when the memory is not enough on MEMORY_ONLY > Storage Level? >

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
eems correct. > > Thanks, > Eugen > > 2015-10-23 8:30 GMT+02:00 Akhil Das <ak...@sigmoidanalytics.com>: > >> Mostly a network issue, you need to check your network configuration from >> the aws console and make sure the ports are accessible within the cluster. >>

Re: spark streaming failing to replicate blocks

2015-10-23 Thread Akhil Das
know why this happens, is that some > known issue? > > Thanks, > Eugen > > 2015-10-22 19:08 GMT+07:00 Akhil Das <ak...@sigmoidanalytics.com>: > >> Can you try fixing spark.blockManager.port to specific port and see if >> the issue exists? >> >>

Re: Get the previous state string

2015-10-22 Thread Akhil Das
That way, you will eventually end up bloating up that list. Instead, you could push the stream to a noSQL database (like hbase or cassandra etc) and then read it back and join it with your current stream if that's what you are looking for. Thanks Best Regards On Thu, Oct 15, 2015 at 6:11 PM,

Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-10-22 Thread Akhil Das
Did you try passing the mysql connector jar through --driver-class-path Thanks Best Regards On Sat, Oct 17, 2015 at 6:33 AM, Hurshal Patel wrote: > Hi all, > > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > >

Re: Output println info in LogMessage Info ?

2015-10-22 Thread Akhil Das
Yes, using log4j you can log everything. Here's a thread with example http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Sun, Oct 18, 2015 at 12:10 AM, kali.tumm...@gmail.com <

Re: Can I convert RDD[My_OWN_JAVA_CLASS] to DataFrame in Spark 1.3.x?

2015-10-22 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection if you haven't seen that already. Thanks Best Regards On Thu, Oct 15, 2015 at 10:56 PM, java8964 wrote: > Hi, Sparkers: > > I wonder if I can convert a RDD

Re: spark streaming failing to replicate blocks

2015-10-22 Thread Akhil Das
Can you try fixing spark.blockManager.port to specific port and see if the issue exists? Thanks Best Regards On Mon, Oct 19, 2015 at 6:21 PM, Eugen Cepoi wrote: > Hi, > > I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. > The job is reading data from

Re: multiple pyspark instances simultaneously (same time)

2015-10-22 Thread Akhil Das
Did you read https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application Thanks Best Regards On Thu, Oct 15, 2015 at 11:31 PM, jeff.sadow...@gmail.com < jeff.sadow...@gmail.com> wrote: > I am having issues trying to setup spark to run jobs simultaneously. > > I

Re: Guaranteed processing orders of each batch in Spark Streaming

2015-10-22 Thread Akhil Das
I guess the order is guaranteed unless you set the spark.streaming.concurrentJobs to a higher number than 1. Thanks Best Regards On Mon, Oct 19, 2015 at 12:28 PM, Renjie Liu wrote: > Hi, all: > I've read source code and it seems that there is no guarantee that the >

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Akhil Das
Convert your data to parquet, it saves space and time. Thanks Best Regards On Mon, Oct 19, 2015 at 11:43 PM, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like

Re: java TwitterUtils.createStream() how create "user stream" ???

2015-10-22 Thread Akhil Das
I don't think the one that comes with spark would listen to specific user feeds, but yes you can filter out the public tweets by passing the filters argument. Here's an example for you to start

Re: Can we split partition

2015-10-22 Thread Akhil Das
Did you try coalesce? It doesn't shuffle the data around. Thanks Best Regards On Wed, Oct 21, 2015 at 10:27 AM, shahid wrote: > Hi > > I have a large partition(data skewed) i need to split it to no. of > partitions, repartitioning causes lot of shuffle. Can we do that..? > >

Re: Spark handling parallel requests

2015-10-19 Thread Akhil Das
the number of > concurrent requests? ... > > As Akhil said, Kafka might help in your case. Otherwise, you need to read > the designs or even source codes of Kafka and Spark Streaming. > > Best wishes, > > Xiao Li > > > 2015-10-11 23:19 GMT-07:00 Akhil Das <ak...@sigm

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
Key and Value are the ones that you are using with your InputFormat. Eg: JavaReceiverInputDStream lines = jssc.fileStream("/sigmoid", LongWritable.class, Text.class, TextInputFormat.class); TextInputFormat uses the LongWritable as Key and Text as Value classes. If your data is plain CSV or text

Re: Changing application log level in standalone cluster

2015-10-14 Thread Akhil Das
You should be able to do that from your application. In the beginning of the application, just add: import org.apache.log4j.Loggerimport org.apache.log4j.Level Logger.getLogger("org").setLevel(Level.OFF)Logger.getLogger("akka").setLevel(Level.OFF) That will switch off the logs. Thanks Best

Re: Cannot connect to standalone spark cluster

2015-10-14 Thread Akhil Das
Open a spark-shell by: MASTER=Ellens-MacBook-Pro.local:7077 bin/spark-shell And if its able to connect, then check your java projects build file and make sure you are having the proper spark version. Thanks Best Regards On Sat, Oct 10, 2015 at 3:07 AM, ekraffmiller

Re: how to use SharedSparkContext

2015-10-14 Thread Akhil Das
Did a quick search and found the following, I haven't tested it myself. Add the following to your build.sbt libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.5.0_1.4.0_1.4.1_0.1.2" Create a class extending com.holdenkarau.spark.testing.SharedSparkContext And you

Re: spark streaming filestream API

2015-10-14 Thread Akhil Das
les/src/main/java/mr/wholeFile/WholeFileInputFormat.java?r=3 > In this case, key would be Nullwritable and value would be BytesWritable > right? > > > > Unfortunately my files are binary and not text files. > > > > Regards, > > Anand.C > > > > *From:*

Re: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-13 Thread Akhil Das
You need to add "org.apache.spark" % "spark-streaming_2.10" % "1.5.0" to the dependencies list. Thanks Best Regards On Tue, Oct 6, 2015 at 3:20 PM, shahab wrote: > Hi, > > I am trying to use Spark 1.5, Mlib, but I keep getting > "sbt.ResolveException: unresolved

Re: Spark handling parallel requests

2015-10-12 Thread Akhil Das
Instead of pushing your requests to the socket, why don't you push them to a Kafka or any other message queue and use spark streaming to process them? Thanks Best Regards On Mon, Oct 5, 2015 at 6:46 PM, wrote: > Hi , > i am using Scala , doing a socket

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-12 Thread Akhil Das
Can you look a bit deeper in the executor logs? It could be filling up the memory and getting killed. Thanks Best Regards On Mon, Oct 5, 2015 at 8:55 PM, Lan Jiang wrote: > I am still facing this issue. Executor dies due to > > org.apache.avro.AvroRuntimeException:

Re: How to change verbosity level and redirect verbosity to file?

2015-10-12 Thread Akhil Das
Have a look http://stackoverflow.com/questions/28454080/how-to-log-using-log4j-to-local-file-system-inside-a-spark-application-that-runs Thanks Best Regards On Mon, Oct 5, 2015 at 9:42 PM, wrote: > Hi, > > I would like to read the full spark-submit log once a job

Re: Too many executors are created

2015-10-11 Thread Akhil Das
For some reason the executors are getting killed, 15/09/29 12:21:02 INFO AppClient$ClientEndpoint: Executor updated: app-20150929120924-/24463 is now EXITED (Command exited with code 1) Can you paste your spark-submit command? You can also look in the executor logs and see whats going on.

Re: Spark cluster - use machine name in WorkerID, not IP address

2015-10-11 Thread Akhil Das
Did you try setting the SPARK_LOCAL_IP in the conf/spark-env.sh file on each node? Thanks Best Regards On Fri, Oct 2, 2015 at 4:18 AM, markluk wrote: > I'm running a standalone Spark cluster of 1 master and 2 slaves. > > My slaves file under /conf list the fully qualified

Re: Compute Real-time Visualizations using spark streaming

2015-10-11 Thread Akhil Das
Simplest approach would be to push the streaming data (after the computations) to a SQL-Like DB and then let your visualization piece pull it from the DB. Another approach would be to make your visualization piece a web-socket (If you are using D3JS etc) and then from your streaming application

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
Can you try replacing your code with the hdfs uri? like: sc.textFile("hdfs://...").collect().foreach(println) Thanks Best Regards On Tue, Sep 29, 2015 at 1:45 AM, Stephen Hankinson wrote: > Hi, > > Wondering if anyone can help me with the issue I am having. > > I am

Re: using JavaRDD in spark-redis connector

2015-09-30 Thread Akhil Das
You can create a JavaRDD as normal and then call the .rdd() to get the RDD. Thanks Best Regards On Mon, Sep 28, 2015 at 9:01 PM, Rohith P wrote: > Hi all, > I am trying to work with spark-redis connector (redislabs) which > requires all transactions between

Re: Reading kafka stream and writing to hdfs

2015-09-30 Thread Akhil Das
Like: counts.saveAsTestFiles("hdfs://host:port/some/location") Thanks Best Regards On Tue, Sep 29, 2015 at 2:15 AM, Chengi Liu wrote: > Hi, > I am going thru this example here: > >

Re: log4j Spark-worker performance problem

2015-09-30 Thread Akhil Das
Depends how big the lines are, on a typical HDD you can write at max 10-15MB/s, and on SSDs it can be upto 30-40MB/s. Thanks Best Regards On Mon, Sep 28, 2015 at 3:57 PM, vaibhavrtk wrote: > Hello > > We need a lot of logging for our application about 1000 lines needed to

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-09-30 Thread Akhil Das
Each Json Doc should be in a single line i guess. http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Note that the file that is offered as *a json file* is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a

Re: UnknownHostException with Mesos and custom Jar

2015-09-30 Thread Akhil Das
gt; Stephen Hankinson, P. Eng. > CTO > Affinio Inc. > 301 - 211 Horseshoe Lake Dr. > Halifax, Nova Scotia, Canada > B3S 0B9 > > http://www.affinio.com > > On Wed, Sep 30, 2015 at 4:21 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Can you try rep

Re: HDFS is undefined

2015-09-28 Thread Akhil Das
For some reason Spark isnt picking up your hadoop confs, Did you download spark compiled with the hadoop version that you are having in the cluster? Thanks Best Regards On Fri, Sep 25, 2015 at 7:43 PM, Angel Angel wrote: > hello, > I am running the spark application. >

Re: how to handle OOMError from groupByKey

2015-09-28 Thread Akhil Das
You can try to increase the number of partitions to get ride of the OOM errors. Also try to use reduceByKey instead of groupByKey. Thanks Best Regards On Sat, Sep 26, 2015 at 1:05 AM, Elango Cheran wrote: > Hi everyone, > I have an RDD of the format (user: String,

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
explaine to me how increasing number of partition (which >> is thing is worker nodes) will help. >> >> As issue is that my master is getting OOM. >> >> Thanks, >> Saurav Sinha >> >> On Mon, Sep 28, 2015 at 2:32 PM, Akhil Das <ak...@sigmoidanalytics.com&g

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
Update the dependency version in your jobs build file, Also make sure you have updated the spark version to 1.5.0 everywhere. (in the cluster, code) Thanks Best Regards On Mon, Sep 28, 2015 at 11:29 AM, lokeshkumar wrote: > Hi forum > > I have just upgraded spark from 1.4.0

Re: Master getting down with Memory issue.

2015-09-28 Thread Akhil Das
This behavior totally depends on the job that you are doing. Usually increasing the # of partitions will sort out this issue. It would be good if you can paste the code snippet or explain what type of operations that you are doing. Thanks Best Regards On Mon, Sep 28, 2015 at 11:37 AM, Saurav

Re: Spark 1.5.0 Not able to submit jobs using cluster URL

2015-09-28 Thread Akhil Das
d I did that. > > On Mon, Sep 28, 2015 at 2:30 PM Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Update the dependency version in your jobs build file, Also make sure you >> have updated the spark version to 1.5.0 everywhere. (in the cluster, code) >> >

Re: Weird worker usage

2015-09-26 Thread Akhil Das
that will solve your problem. Thanks Best Regards On Sun, Sep 27, 2015 at 12:50 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > That means only > > Thanks > Best Regards > > On Sun, Sep 27, 2015 at 12:07 AM, N B <nb.nos...@gmail.com> wrote: > >> Hello, >

Re: Weird worker usage

2015-09-26 Thread Akhil Das
y >> disregarding N2 until its the final stage where data is being written out >> to Elasticsearch. I am not sure I understand the reason behind it not >> distributing more partitions to N2 to begin with and use it effectively. >> Since there are only 12 cores on N1 and 25 total partiti

Re: executor-cores setting does not work under Yarn

2015-09-25 Thread Akhil Das
Which version of spark are you having? Can you also check whats set in your conf/spark-defaults.conf file? Thanks Best Regards On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote: > Running Spark app over Yarn 2.7 > > Here is my sparksubmit setting: > --master yarn-cluster

Re: Error: Asked to remove non-existent executor

2015-09-25 Thread Akhil Das
What you mean by you are behind a NAT? Does it mean you are submitting your jobs to a remote spark cluster from your local machine? If that's the case then you need to take care of few ports (in the NAT) http://spark.apache.org/docs/latest/configuration.html#networking which assume random as

Re: Weird worker usage

2015-09-25 Thread Akhil Das
Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition. Thanks Best Regards On Fri, Sep 25, 2015 at 1:44 PM, N B wrote: > Hello all, > > I have a

Re: unsubscribe

2015-09-25 Thread Akhil Das
Send an email to dev-unsubscr...@spark.apache.org instead of dev@spark.apache.org Thanks Best Regards On Fri, Sep 25, 2015 at 4:00 PM, Nirmal R Kumar wrote: >

Re: Setting Spark TMP Directory in Cluster Mode

2015-09-25 Thread Akhil Das
Try with spark.local.dir in the spark-defaults.conf or SPARK_LOCAL_DIR in the spark-env.sh file. Thanks Best Regards On Fri, Sep 25, 2015 at 2:14 PM, mufy wrote: > Faced with an issue where Spark temp files get filled under > /opt/spark-1.2.1/tmp on the local filesystem

Why does my Thrift (c_glib) client fails to receive binary (GByteArray) data?

2015-09-23 Thread Akhil Das
Hi I have posted the original question here http://stackoverflow.com/questions/32740773/why-does-my-thrift-c-glib-client-fails-to-receive-binary-gbytearray-data I'm sending a GByteArray from the server to the client and for some reason the client crashes with the following message: [image:

Re: unsubscribe

2015-09-23 Thread Akhil Das
To unsubscribe, you need to send an email to user-unsubscr...@spark.apache.org as described here http://spark.apache.org/community.html Thanks Best Regards On Wed, Sep 23, 2015 at 1:23 AM, Stuart Layton wrote: > > > -- > Stuart Layton >

Re: Has anyone used the Twitter API for location filtering?

2015-09-23 Thread Akhil Das
> > > On Tue, Sep 22, 2015 at 2:20 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> ​That's because sometime getPlace returns null and calling getLang over >> null throws up either null pointer exception or noSuchMethodError. You need >> to filter out

Re: SparkContext declared as object variable

2015-09-23 Thread Akhil Das
createStream and applying some > transformations. > > Is creating sparContext at object level instead of creating in main > doesn't work > > On Tue, Sep 22, 2015 at 2:59 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Its a "value" not a variable,

Re: Spark Web UI + NGINX

2015-09-22 Thread Akhil Das
Can you not just tunnel it? Like on Machine A: ssh -L 8080:127.0.0.1:8080 machineB And on your local machine: ssh -L 80:127.0.0.1:8080 machineA And then simply open http://localhost/ that will show up the spark ui running on machineB. People at digitalOcean has made wonder article on how

Re: Spark Lost executor && shuffle.FetchFailedException

2015-09-22 Thread Akhil Das
If you can look a bit deeper in the executor logs, then you might find the root cause for this issue. Also make sure the ports (seems 34869 here) are accessible between all the machines. Thanks Best Regards On Mon, Sep 21, 2015 at 12:40 PM, wrote: > Hi All: > > > When I

Re: Cache after filter Vs Writing back to HDFS

2015-09-22 Thread Akhil Das
Instead of .map you can try doing a .mapPartitions and see the performance. Thanks Best Regards On Fri, Sep 18, 2015 at 2:47 AM, Gavin Yue wrote: > For a large dataset, I want to filter out something and then do the > computing intensive work. > > What I am doing now: >

Re: Lost tasks in Spark SQL join jobs

2015-09-22 Thread Akhil Das
If you look a bit in the error logs, you can possibly see other issues like GC over head etc, which causes the next set of tasks to fail. Thanks Best Regards On Thu, Sep 17, 2015 at 9:26 AM, Gang Bai wrote: > Hi all, > > I’m joining two tables on a specific

Re: Has anyone used the Twitter API for location filtering?

2015-09-22 Thread Akhil Das
​That's because sometime getPlace returns null and calling getLang over null throws up either null pointer exception or noSuchMethodError. You need to filter out those statuses which doesn't include location data.​ Thanks Best Regards On Fri, Sep 18, 2015 at 12:46 AM, Jo Sunad

Re: SparkContext declared as object variable

2015-09-22 Thread Akhil Das
Its a "value" not a variable, and what are you parallelizing here? Thanks Best Regards On Fri, Sep 18, 2015 at 11:21 PM, Priya Ch wrote: > Hello All, > > Instead of declaring sparkContext in main, declared as object variable > as - > > object sparkDemo > { > >

c_glib tutorial example crashes

2015-09-22 Thread Akhil Das
Hi I have created a JIRA for this https://issues.apache.org/jira/browse/THRIFT-3346 The issue is that, the tutorial server program throws up the following error whenever the tutorial client connects to it. >From server: [image: Inline image 1] And this is from the tutorial client: [image:

Re: AWS_CREDENTIAL_FILE

2015-09-22 Thread Akhil Das
No, you can either set the configurations within your SparkConf's hadoop configuration: val hadoopConf = sparkContext.hadoopConfiguration hadoopConf.set("fs.s3n.awsAccessKeyId", s3Key) hadoopConf.set("fs.s3n.awsSecretAccessKey", s3Secret) or you can set it in the environment as: export

<    1   2   3   4   5   6   7   8   9   10   >