Re: Where does the Driver run?

2019-03-24 Thread Akhil Das
uld > support your claim and contradict Spark docs for deployMode = cluster. > > The evidence seems to contradict the docs. I am now beginning to wonder if > the Driver only runs in the cluster if we use spark-submit > > > > From: Akhil Das > Reply: Akhil Das > Dat

Re: Where does the Driver run?

2019-03-23 Thread Akhil Das
If you are starting your "my-app" on your local machine, that's where the driver is running. [image: image.png] Hope this helps. On Sun, Mar 24, 2019 at 4:13 AM Pat Ferrel wrote: > I have researched this for a significant amount of

Re: Configuration for unit testing and sql.shuffle.partitions

2017-09-16 Thread Akhil Das
spark.sql.shuffle.partitions is still used I believe. I can see it in the code and in the documentation page

Re: PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-16 Thread Akhil Das
scala> case class Fruit(price: Double, name: String) defined class Fruit scala> val ds = Seq(Fruit(10.0,"Apple")).toDS() ds: org.apache.spark.sql.Dataset[Fruit] = [price: double, name: string] scala> ds.rdd.flatMap(f => f.name.toList).collect res8: Array[Char] = Array(A, p, p, l, e) This is

Re: Size exceeds Integer.MAX_VALUE issue with RandomForest

2017-09-16 Thread Akhil Das
What are the parameters you passed to the classifier and what is the size of your train data? You are hitting that issue because one of the block size is over 2G, repartitioning the data will help. On Fri, Sep 15, 2017 at 7:55 PM, rpulluru wrote: > Hi, > > I am using

Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-16 Thread Akhil Das
I guess no. I came across a test case where they are marked as Unsupported, you can see it here. However, the one running inside Databricks has support for this.

Re: spark.streaming.receiver.maxRate

2017-09-16 Thread Akhil Das
I believe that's a question to the NiFi list, as you can see the the code base is quite old https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver/src/main/java/org/apache/nifi/spark and it doesn't make use of the

Re: How is data desensitization (example: select bank_no from users)?

2017-08-24 Thread Akhil Das
Usually analysts will not have access to data stored in the PCI Zone, you could write the data out to a table for the analysts by masking the sensitive information. Eg: > val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7)) > val df = sc.parallelize(Seq(("user1",

Re: UI for spark machine learning.

2017-08-24 Thread Akhil Das
How many iterations are you doing on the data? Like Jörn said, you don't necessarily need a billion samples for linear regression. On Tue, Aug 22, 2017 at 6:28 PM, Sea aj wrote: > Jorn, > > My question is not about the model type but instead, the spark capability > on reusing

Re: ORC Transaction Table - Spark

2017-08-24 Thread Akhil Das
How are you reading the data? Its clearly saying *java.lang.NumberFormatException: For input string: "0645253_0001" * On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal wrote: > Hi, > > I am trying to read hive orc transaction table through Spark but I am > getting the

Re: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-24 Thread Akhil Das
Have you tried setting spark.executor.instances=0 to a positive non-zero value? Also, since its a streaming application set executor cores > 1. On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan wrote: > I ran the HdfsWordCount example using this command: > >

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
-connector-user On Fri, Jul 1, 2016 at 5:45 PM, Joaquin Alzola <joaquin.alz...@lebara.com> wrote: > HI Akhil > > > > I am using: > > Cassandra: 3.0.5 > > Spark: 1.6.1 > > Scala 2.10 > > Spark-cassandra connector: 1.6.0 > > > > *From:* Akhil Da

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-01 Thread Akhil Das
You can use this https://github.com/wurstmeister/kafka-docker to spin up a kafka cluster and then point your sparkstreaming to it to consume from it. On Fri, Jul 1, 2016 at 1:19 AM, SRK wrote: > Hi, > > I need to do integration tests using Spark Streaming. My idea is

Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Akhil Das
Something like this? import sqlContext.implicits._ case class Holder(str: String, js:JsValue) yourRDD.map(x => Holder(x._1, x._2)).toDF() On Fri, Jul 1, 2016 at 3:36 AM, Dood@ODDO wrote: > Hello, > > I have an RDD[(String,JsValue)] that I want to convert into a

Re: Remote RPC client disassociated

2016-07-01 Thread Akhil Das
This looks like a version conflict, which version of spark are you using? The Cassandra connector you are using is for Scala 2.10x and Spark 1.6 version. On Thu, Jun 30, 2016 at 6:34 PM, Joaquin Alzola wrote: > HI List, > > > > I am launching this spark-submit job: >

Re: Spark Task is not created

2016-06-25 Thread Akhil Das
Would be good if you can paste the piece of code that you are executing. On Sun, Jun 26, 2016 at 11:21 AM, Ravindra wrote: > Hi All, > > May be I need to just set some property or its a known issue. My spark > application hangs in test environment whenever I see

Re: Unable to acquire bytes of memory

2016-06-21 Thread Akhil Das
Looks like this issue https://issues.apache.org/jira/browse/SPARK-10309 On Mon, Jun 20, 2016 at 4:27 PM, pseudo oduesp wrote: > Hi , > i don t have no idea why i get this error > > > > Py4JJavaError: An error occurred while calling o69143.parquet. > :

Re: Unsubscribe

2016-06-21 Thread Akhil Das
You need to send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html On Mon, Jun 20, 2016 at 1:10 PM, Ram Krishna wrote: > Hi Sir, > > Please unsubscribe me > > -- > Regards, > Ram Krishna KT > >

Re: Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Akhil Das
spark.executor.instances is the parameter that you are looking for. Read more here http://spark.apache.org/docs/latest/running-on-yarn.html On Sun, Jun 19, 2016 at 2:17 AM, Natu Lauchande wrote: > Hi, > > I am running some spark loads . I notice that in it only uses one

Re: spark streaming - how to purge old data files in data directory

2016-06-18 Thread Akhil Das
Currently, there is no out of the box solution for this. Although, you can use other hdfs utils to remove older files from the directory (say 24hrs old). Another approach is discussed here

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
properly? > > Thanks & Regards > Biplob Biswas > > On Sat, Jun 18, 2016 at 5:59 PM, Akhil Das <ak...@hacked.work> wrote: > >> Looks like you need to set your master to local[2] or local[*] >> >> On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas &l

Re: Many executors with the same ID in web UI (under Executors)?

2016-06-18 Thread Akhil Das
A screenshot of the executor tab will explain it better. Usually executors are allocated when the job is started, if you have a multi-node cluster then you'll see executors launched on different nodes. On Sat, Jun 18, 2016 at 9:04 PM, Jacek Laskowski wrote: > Hi, > > This is

Re: Running JavaBased Implementationof StreamingKmeans

2016-06-18 Thread Akhil Das
Looks like you need to set your master to local[2] or local[*] On Sat, Jun 18, 2016 at 4:54 PM, Biplob Biswas wrote: > Hi, > > I implemented the streamingKmeans example provided in the spark website but > in Java. > The full implementation is here, > >

Re: Getting NPE when trying to do spark streaming with Twitter

2016-04-11 Thread Akhil Das
Looks like a yarn issue to me, Can you try checking out this code? https://github.com/akhld/sparkstreaming-twitter just git clone and do a sbt run after configuring your credentials in the main file

Re: Spark not handling Null

2016-04-11 Thread Akhil Das
Surround it with a try..catch where its complaining about the null pointer to avoid the job being failed. What is happening here is like you are returning null and the following operation is working on null which causes the job to fail. Thanks Best Regards On Mon, Apr 11, 2016 at 12:51 PM,

Re:

2016-04-04 Thread Akhil Das
1 core with 4 partitions means it executes it one by one, not parallel. For the Kafka question, if you don't have higher data volume then you may not need 40 partitions. Thanks Best Regards On Sat, Apr 2, 2016 at 7:35 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > As

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
contains only json in each line. > > *Thanks*, > <https://in.linkedin.com/in/ramkumarcs31> > > > On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Something like this (in scala): >> >> val rdd = parquetFile.j

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
Something like this (in scala): val rdd = parquetFile.javaRDD().map(row => row.mkstring(",")) You can create a map operation over your javaRDD to convert the org.apache.spark.sql.Row to String (the Row.mkstring()

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-31 Thread Akhil Das
gt; wrote: > We are using KafkaUtils.createStream API to read data from kafka topics > and we are using StorageLevel.MEMORY_AND_DISK_SER option while configuring > kafka streams. > > On Wed, Mar 30, 2016 at 12:58 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: >

Re: Spark streaming spilling all the data to disk even if memory available

2016-03-30 Thread Akhil Das
Can you elaborate more on from where you are streaming the data and what type of consumer you are using etc? Thanks Best Regards On Tue, Mar 29, 2016 at 6:10 PM, Mayur Mohite wrote: > Hi, > > We are running spark streaming app on a single machine and we have >

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Akhil Das
In your case, you will be able to see the webui (unless restricted with iptables) but you won't be able to submit jobs to that machine from a remote machine since the spark master is spark://127.0.0.1:7077 Thanks Best Regards On Tue, Mar 29, 2016 at 8:12 PM, David O'Gwynn

Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 Thanks Best Regards On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > > Hi All, > > I have written a spark program on my dev box , >IDE:Intellij >

Re: Master options Cluster/Client descrepencies.

2016-03-30 Thread Akhil Das
Have a look at http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 Thanks Best Regards On Wed, Mar 30, 2016 at 12:09 AM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > > Hi All, > > I have written a spark program on my dev box , >IDE:Intellij >

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does? Thanks Best Regards On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh wrote: > Hi All, > > I have an RDD having the data in the following form : > > tempRDD: RDD[(String, (String, String))] > > (brand , (product, key)) > >

Re: aggregateByKey on PairRDD

2016-03-30 Thread Akhil Das
Isn't it what tempRDD.groupByKey does? Thanks Best Regards On Wed, Mar 30, 2016 at 7:36 AM, Suniti Singh wrote: > Hi All, > > I have an RDD having the data in the following form : > > tempRDD: RDD[(String, (String, String))] > > (brand , (product, key)) > >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote: > Hi, > > i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. >

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-30 Thread Akhil Das
Looks like the winutils.exe is missing from the environment, See https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Wed, Mar 30, 2016 at 10:44 AM, Selvam Raman wrote: > Hi, > > i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine. >

Re: Issue with wholeTextFiles

2016-03-22 Thread Akhil Das
Can you paste the exception stack here? Thanks Best Regards On Mon, Mar 21, 2016 at 1:42 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > I'm using Hadoop 1.0.4 and Spark 1.2.0. > > I'm facing a strange issue. I have a requirement to read a small file from > HDFS and all

Re: pyspark sql convert long to timestamp?

2016-03-22 Thread Akhil Das
Have a look at the from_unixtime() functions. https://spark.apache.org/docs/1.5.0/api/python/_modules/pyspark/sql/functions.html#from_unixtime Thanks Best Regards On Tue, Mar 22, 2016 at 4:49 AM, Andy Davidson < a...@santacruzintegration.com> wrote: > Any idea how I have a col in a data frame

Re: Setting up spark to run on two nodes

2016-03-21 Thread Akhil Das
You can simply execute the sbin/start-slaves.sh file to start up all slave processes. Just make sure you have spark installed on the same path on all the machines. Thanks Best Regards On Sat, Mar 19, 2016 at 4:01 AM, Ashok Kumar wrote: > Experts. > > Please your

Re: Potential conflict with org.iq80.snappy in Spark 1.6.0 environment?

2016-03-21 Thread Akhil Das
Looks like a jar conflict, could you paste the piece of code? and how your dependency file looks like? Thanks Best Regards On Sat, Mar 19, 2016 at 7:49 AM, vasu20 wrote: > Hi, > > I have some code that parses a snappy thrift file for objects. This code > works fine when run

Re: Building spark submodule source code

2016-03-21 Thread Akhil Das
Have a look at the intellij setup https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Once you have the setup ready, you don't have to recompile the whole stuff every time. Thanks Best Regards On Mon, Mar 21, 2016 at 8:14 AM, Tenghuan He

Re: Error using collectAsMap() in scala

2016-03-21 Thread Akhil Das
What you should be doing is a join, something like this: //Create a key, value pair, key being the column1 val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(",")) //Create a key, value pair, key being the column2 val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))

Re: unsubscribe

2016-03-15 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Mar 15, 2016 at 1:28 PM, Netwaver wrote: > unsubscribe > > > >

Re: unsubscribe

2016-03-15 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org for unsubscribing. Read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Mar 15, 2016 at 12:56 PM, satish chandra j wrote: > unsubscribe >

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh

Re: create hive context in spark application

2016-03-15 Thread Akhil Das
Did you ry submitting your application with spark-submit ? You can also try opening a spark-shell and see if it picks up your hive-site.xml. Thanks Best Regards On Tue, Mar 15, 2016 at 11:58 AM, antoniosi

Re: Streaming app consume multiple kafka topics

2016-03-15 Thread Akhil Das
One way would be to keep it this way: val stream1 = KafkaUtils.createStream(..) // for topic 1 val stream2 = KafkaUtils.createStream(..) // for topic 2 And you will know which stream belongs to which topic. Another approach which you can put in your code itself would be to tag the topic name

Can someone fix this download URL?

2016-03-13 Thread Akhil Das
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz [image: Inline image 1] There's a broken link for Spark 1.6.1 prebuilt hadoop 2.6 direct download. Thanks Best Regards

Re: Sample project on Image Processing

2016-02-22 Thread Akhil Das
What type of Image processing are you doing? Here's a simple example with Tensorflow https://databricks.com/blog/2016/01/25/deep-learning-with-spark-and-tensorflow.html Thanks Best Regards On Mon, Feb 22, 2016 at 1:53 PM, Mishra, Abhishek wrote: > Hello, > > I am

Re: How do we run that PR auto-close script again?

2016-02-22 Thread Akhil Das
This? http://apache-spark-developers-list.1001551.n3.nabble.com/Automated-close-of-PR-s-td15862.html Thanks Best Regards On Mon, Feb 22, 2016 at 2:47 PM, Sean Owen wrote: > I know Patrick told us at some point, but I can't find the email or > wiki that describes how to run

Re: [Example] : read custom schema from file

2016-02-22 Thread Akhil Das
If you are talking about a CSV kind of file, then here's an example http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection Thanks Best Regards On Mon, Feb 22, 2016 at 1:10 PM, Divya Gehlot wrote: > Hi, > Can anybody help me

Re: How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread Akhil Das
On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi Folks, > > > > I am exploring spark for streaming from two sources (a) Kinesis and (b) > HDFS for some of our use-cases. Since we maintain state gathered over last > x hours in spark streaming, we

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das
. > > Thanks, > Divya > > > > On 15 February 2016 at 16:37, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> You can set *yarn.nodemanager.webapp.address* in the >> yarn-site.xml/yarn-default.xml file to change it I guess. >> >> Thanks &

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das
You can set *yarn.nodemanager.webapp.address* in the yarn-site.xml/yarn-default.xml file to change it I guess. Thanks Best Regards On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot wrote: > Hi, > I have hadoop cluster set up in EC2. > I am unable to view application logs

Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das
You can set *yarn.nodemanager.webapp.address* in the yarn-site.xml/yarn-default.xml file to change it I guess. Thanks Best Regards On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot wrote: > Hi, > I have hadoop cluster set up in EC2. > I am unable to view application logs

Re: spark-streaming with checkpointing: error with sparkOnHBase lib

2016-01-27 Thread Akhil Das
Were you able to resolve this? It'd be good if you can paste the code snippet to reproduce this. Thanks Best Regards On Fri, Jan 22, 2016 at 2:06 PM, vinay gupta wrote: > Hi, > I have a spark-streaming application which uses sparkOnHBase lib to do >

Re: MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-27 Thread Akhil Das
Did you try enabling spark.memory.useLegacyMode and upping spark.storage.memoryFraction? Thanks Best Regards On Fri, Jan 22, 2016 at 3:40 AM, Arun Luthra wrote: > WARN MemoryStore: Not enough space to cache broadcast_4 in memory! > (computed 60.2 MB so far) > WARN

Re: Using distinct count in over clause

2016-01-27 Thread Akhil Das
Does it support over? I couldn't find it in the documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features Thanks Best Regards On Fri, Jan 22, 2016 at 2:31 PM, 汪洋 wrote: > I think it cannot be right. > > 在 2016年1月22日,下午4:53,汪洋

Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries. https://github.com/ssavvides/tpch-spark Thanks Best Regards On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa wrote: > Hi, > I have downloaded the Amplab benchmark dataset from >

Re: Generate Amplab queries set

2016-01-27 Thread Akhil Das
Have a look at the TPC-H queries, I found this repository with the quries. https://github.com/ssavvides/tpch-spark Thanks Best Regards On Fri, Jan 22, 2016 at 1:35 AM, sara mustafa wrote: > Hi, > I have downloaded the Amplab benchmark dataset from >

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Akhil Das
Why would you want to print all rows? You can try the following: sqlContext.sql("select day_time from my_table limit 10").collect().foreach(println) Thanks Best Regards On Sun, Jan 24, 2016 at 5:58 PM, Eli Super wrote: > Unfortunately still getting error when use

Re: spark streaming input rate strange

2016-01-27 Thread Akhil Das
How are you verifying the data dropping? Can you send 10k, 20k events and write the same to an output location from spark streaming and verify it? If you are finding a data mismatch then its a problem with your MulticastSocket implementation. Thanks Best Regards On Fri, Jan 22, 2016 at 5:44 PM,

Re: How to send a file to database using spark streaming

2016-01-27 Thread Akhil Das
This is a good start https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md Thanks Best Regards On Sat, Jan 23, 2016 at 12:19 PM, Sree Eedupuganti wrote: > New to Spark Streaming. My question is i want to load the XML files to > database

Re: Debug what is replication Level of which RDD

2016-01-27 Thread Akhil Das
How many RDDs are you persisting? If its 2, then you can verify it by disabling the persist for one of them and from the UI you can see which one of mappedRDD/shuffledRDD. Thanks Best Regards On Sun, Jan 24, 2016 at 3:25 AM, gaurav sharma wrote: > Hi All, > > I have

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Akhil Das
Can you look in the executor logs and see why the sparkcontext is being shutdown? Similar discussion happened here previously. http://apache-spark-user-list.1001560.n3.nabble.com/RECEIVED-SIGNAL-15-SIGTERM-td23668.html Thanks Best Regards On Thu, Jan 21, 2016 at 5:11 PM, Soni spark

Re: Parquet write optimization by row group size config

2016-01-20 Thread Akhil Das
st enough, maybe i > missed something > > Regards, > Pavel > > On Wed, Jan 20, 2016 at 9:51 AM Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Did you try re-partitioning the data before doing the write? >> >> Thanks >> Best Regards >>

Re: Appending filename information to RDD initialized by sc.textFile

2016-01-19 Thread Akhil Das
You can use the sc.newAPIHadoopFile and pass your own InputFormat and RecordReader which will read the compressed .gz files to your usecase. For a start, you can look at the: - wholeTextFile implementation

Re: process of executing a program in a distributed environment without hadoop

2016-01-19 Thread Akhil Das
If you are processing a file, then you can keep the same file in all machines in the same location and everything should work. Thanks Best Regards On Wed, Jan 20, 2016 at 11:15 AM, Kamaruddin wrote: > I want to execute a program in a distributed environment without

Re: Parquet write optimization by row group size config

2016-01-19 Thread Akhil Das
Did you try re-partitioning the data before doing the write? Thanks Best Regards On Tue, Jan 19, 2016 at 6:13 PM, Pavel Plotnikov < pavel.plotni...@team.wrike.com> wrote: > Hello, > I'm using spark on some machines in standalone mode, data storage is > mounted on this machines via nfs. A have

Re: Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-24 Thread Akhil Das
Would you mind posting the relevant code snippet? Thanks Best Regards On Wed, Dec 23, 2015 at 7:33 PM, Vyacheslav Yanuk wrote: > Hi. > I have very strange situation with direct reading from Kafka. > For example. > I have 1000 messages in Kafka. > After submitting my

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
Both are similar, give both a go and choose the one you like. On Dec 23, 2015 7:55 PM, "Eran Witkon" <eranwit...@gmail.com> wrote: > Thanks, so based on that article, should I use sbt or maven? Or either? > Eran > On Wed, 23 Dec 2015 at 13:05 Akhil Das <ak...@sigmoidan

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
, Eran Witkon <eranwit...@gmail.com> wrote: > Thanks, all of these examples shows how to link to spark source and build > it as part of my project. why should I do that? why not point directly to > my spark.jar? > Am I missing something? > Eran > > On Wed, Dec 23, 20

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
1. Install sbt plugin on IntelliJ 2. Create a new project/Import an sbt project like Dean suggested 3. Happy Debugging. You can also refer to this article for more information https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Thanks Best

Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Akhil Das
You need to: 1. Make sure your local router have NAT enabled and port forwarded the networking ports listed here . 2. Make sure on your clusters 7077 is accessible from your local (public) ip address. You can try telnet

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das
I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 11:19 PM, Abhinav M Kulkarni <

Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-22 Thread Akhil Das
Why not create a custom dstream and generate the data from there itself instead of spark connecting to a socket server which will be fed by another twitter client? Thanks Best Regards On Sat, Dec 19, 2015 at 5:47 PM, Amir

Re: Numpy and dynamic loading

2015-12-22 Thread Akhil Das
I guess you will have to install numpy on all the machines for this to work. Try reinstalling on all the machines: sudo apt-get purge python-numpy sudo pip uninstall numpy sudo pip install numpy Thanks Best Regards On Sun, Dec 20, 2015 at 8:33 AM, Abhinav M Kulkarni <

Re: Memory allocation for Broadcast values

2015-12-22 Thread Akhil Das
If you are creating a huge map on the driver, then spark.driver.memory should be set to a higher value to hold your map. Since you are going to broadcast this map, your spark executors must have enough memory to hold this map as well which can be set using the spark.executor.memory, and

Re: configure spark for hive context

2015-12-22 Thread Akhil Das
Looks like you put a wrong configuration file which crashed spark to parse the configuration values from it. Thanks Best Regards On Mon, Dec 21, 2015 at 3:35 PM, Divya Gehlot wrote: > Hi, > I am trying to configure spark for hive context (Please dont get mistaken >

Re: hive on spark

2015-12-21 Thread Akhil Das
Looks like a version mismatch, you need to investigate more and make sure the versions satisfies. Thanks Best Regards On Sat, Dec 19, 2015 at 2:15 AM, Ophir Etzion wrote: > During spark-submit when running hive on spark I get: > > Exception in thread "main"

Re: ​Spark 1.6 - YARN Cluster Mode

2015-12-21 Thread Akhil Das
Try adding these properties: spark.driver.extraJavaOptions -Dhdp.version=2.3.2.0-2950 spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.0-2950 ​There was a similar discussion with Spark 1.3.0 over here http://stackoverflow.com/questions/29470542/spark-1-3-0-running-pi-example-on-yarn-fails ​

Re: Error on using updateStateByKey

2015-12-21 Thread Akhil Das
You can do it like this: private static Function2 UPDATEFUNCTION = new Function2() { @Override public Optional call(List nums, Optional current) throws Exception { long sum = current.or(0L);

Re: One task hangs and never finishes

2015-12-21 Thread Akhil Das
Pasting the relevant code might help to understand better what exactly you are doing. Thanks Best Regards On Thu, Dec 17, 2015 at 9:25 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > > I have an application running a set of transformations and finishes with > saveAsTextFile.

Re: Using Spark to process JSON with gzip filed

2015-12-20 Thread Akhil Das
On Fri, Dec 18, 2015 at 11:37 AM Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Something like this? This one uses the ZLIB compression, you can replace >> the decompression logic with GZip one in your case. >> >> compressedStream.map(x => { >&

Re: Saving to JDBC

2015-12-18 Thread Akhil Das
You will have to properly order the columns before writing or you can change the column order in the actual table according to your job. Thanks Best Regards On Tue, Dec 15, 2015 at 1:47 AM, Bob Corsaro wrote: > Is there anyway to map pyspark.sql.Row columns to JDBC table

Re: UNSUBSCRIBE

2015-12-18 Thread Akhil Das
Send the mail to user-unsubscr...@spark.apache.org read more over here http://spark.apache.org/community.html Thanks Best Regards On Tue, Dec 15, 2015 at 3:39 AM, Mithila Joshi wrote: > unsubscribe > > On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram

Re: How to do map join in Spark SQL

2015-12-18 Thread Akhil Das
You can broadcast your json data and then do a map side join. This article is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/ Thanks Best Regards On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov wrote: > I have big folder having ORC files. Files

Re: Using Spark to process JSON with gzip filed

2015-12-18 Thread Akhil Das
Something like this? This one uses the ZLIB compression, you can replace the decompression logic with GZip one in your case. compressedStream.map(x => { val inflater = new Inflater() inflater.setInput(x.getPayload) val decompressedData = new Array[Byte](x.getPayload.size * 2)

Re: Unable to get json for application jobs in spark 1.5.0

2015-12-18 Thread Akhil Das
Which version of spark are you using? You can test this by opening up a spark-shell, firing a simple job (sc.parallelize(1 to 100).collect()) and then accessing the http://sigmoid-driver:4040/api/v1/applications/Spark%20shell/jobs [image: Inline image 1] Thanks Best Regards On Tue, Dec 15, 2015

Re: about spark on hbase

2015-12-18 Thread Akhil Das
*First you create the HBase configuration:* val hbaseTableName = "paid_daylevel" val hbaseColumnName = "paid_impression" val hconf = HBaseConfiguration.create() hconf.set("hbase.zookeeper.quorum", "sigmoid-dev-master") hconf.set("hbase.zookeeper.property.clientPort",

Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to take over the cluster. You can read a bit about it here https://www.sigmoid.com/securing-apache-spark-cluster/ You can also look at this small exploit I wrote https://www.exploit-db.com/exploits/36562/ Thanks Best

Re: security testing on spark ?

2015-12-18 Thread Akhil Das
If the port 7077 is open for public on your cluster, that's all you need to take over the cluster. You can read a bit about it here https://www.sigmoid.com/securing-apache-spark-cluster/ You can also look at this small exploit I wrote https://www.exploit-db.com/exploits/36562/ Thanks Best

Re: Spark basicOperators

2015-12-18 Thread Akhil Das
You can pretty much measure it from the Event timeline listed in the driver ui, You can click on jobs/tasks and get the time that it took for each of it from there. Thanks Best Regards On Thu, Dec 17, 2015 at 7:27 AM, sara mustafa wrote: > Hi, > > The class

Re: spark master process shutdown for timeout

2015-12-18 Thread Akhil Das
Did you happened to have a look at this https://issues.apache.org/jira/browse/SPARK-9629 Thanks Best Regards On Thu, Dec 17, 2015 at 12:02 PM, yaoxiaohua wrote: > Hi guys, > > I have two nodes used as spark master, spark1,spark2 > > Spark1.4.0 > > Jdk

Re: Unsubsribe

2015-12-14 Thread Akhil Das
Send an email to user-unsubscr...@spark.apache.org to unsubscribe from the list. See more over http://spark.apache.org/community.html Thanks Best Regards 2015-12-09 22:18 GMT+05:30 Michael Nolting : > cancel > > -- >

Re: Can't filter

2015-12-14 Thread Akhil Das
If you are not using Spark submit to run the job, then you need to add the following line: sc.addJar("target/scala_2.11/spark.jar") After creating the SparkContext, where the spark.jar is your project jar. Thanks Best Regards On Thu, Dec 10, 2015 at 5:29 PM, Бобров Виктор wrote:

Re: How to change StreamingContext batch duration after loading from checkpoint

2015-12-14 Thread Akhil Das
Taking the values from a configuration file rather hard-coding in the code might help, haven't tried it though. Thanks Best Regards On Mon, Dec 7, 2015 at 9:53 PM, yam wrote: > Is there a way to change the streaming context batch interval after > reloading > from

Re: HDFS

2015-12-14 Thread Akhil Das
Try to set the spark.locality.wait to a higher number and see if things change. You can read more about the configuration properties from here http://spark.apache.org/docs/latest/configuration.html#scheduling Thanks Best Regards On Sat, Dec 12, 2015 at 12:16 AM, shahid ashraf

  1   2   3   4   5   6   7   8   9   10   >