Re: Hive on Spark orc file empty

2015-11-16 Thread Deepak Sharma
Sai, I am bit confused here. How are you using write with results? I am using spark 1.4.1 and when i use write , it complains about write not being member of DataFrame. error:value write is not a member of org.apache.spark.sql.DataFrame Thanks Deepak On Mon, Nov 16, 2015 at 4:10 PM, 张炜

Best practises

2015-10-30 Thread Deepak Sharma
Hi I am looking for any blog / doc on the developer's best practices if using Spark .I have already looked at the tuning guide on spark.apache.org. Please do let me know if any one is aware of any such resource. Thanks Deepak

Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
Hi All I am confused on RDD persistence in cache . If I cache RDD , is it going to stay there in memory even if my spark program completes execution , which created it. If not , how can I guarantee that RDD is persisted in cache even after the program finishes execution. Thanks Deepak

Re: Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
uot; <engr...@gmail.com> wrote: > The cache gets cleared out when the job finishes. I am not aware of a way > to keep the cache around between jobs. You could save it as an object file > to disk and load it as an object file on your next job for speed. > On Thu, Nov 5, 2015 at 6:1

Any role for volunteering

2015-12-04 Thread Deepak Sharma
Hi All Sorry for spamming your inbox. I am really keen to work on a big data project full time(preferably remote from India) , if not I am open to volunteering as well. Please do let me know if there is any such opportunity available -- Thanks Deepak

Re: Autoscaling of Spark YARN cluster

2015-12-14 Thread Deepak Sharma
An approach I can think of is using Ambari Metrics Service(AMS) Using these metrics , you can decide upon if the cluster is low in resources. If yes, call the Ambari management API to add the node to the cluster. Thanks Deepak On Mon, Dec 14, 2015 at 2:48 PM, cs user

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Deepak Sharma
I have never tried this but there is yarn client api's that you can use in your spark program to get the application id. Here is the link to the yarn client java doc: http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html getApplications() is the method for your

Re: Newbie question

2016-01-07 Thread Deepak Sharma
Yes , you can do it unless the method is marked static/final. Most of the methods in SparkContext are marked static so you can't over ride them definitely , else over ride would work usually. Thanks Deepak On Fri, Jan 8, 2016 at 12:06 PM, yuliya Feldman wrote: >

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Invalid jobj 2. If SparkR was restarted, Spark operations need to be > re-executed. > > > Not sure what is causing this? Any leads or ideas? I am using rstudio. > > > > On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Hi Sandee

Re: sparkR ORC support.

2016-01-05 Thread Deepak Sharma
Hi Sandeep I am not sure if ORC can be read directly in R. But there can be a workaround .First create hive table on top of ORC files and then access hive table in R. Thanks Deepak On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana wrote: > Hello > > I need to read an ORC

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma
I am not sure if Spark provides any support for incremental extracts inherently. But you can maintain a file e.g. extractRange.conf in hdfs , to read from it the end range and update it with new end range from spark job before it finishes with the new relevant ranges to be used next time. On

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Deepak Sharma
Hi Ajay Looking at spark code , i can see you used hive context. Can you try using sql context instead of hive context there? Thanks Deepak On Mon, Jun 13, 2016 at 10:15 PM, Ajay Chander wrote: > Hi Mohit, > > Thanks for your time. Please find my response below. > > Did

Re: Query related to spark cluster

2016-05-30 Thread Deepak Sharma
Hi Saurabh You can have hadoop cluster running YARN as scheduler. Configure spark to run with the same YARN setup. Then you need R only on 1 node , and connect to the cluster using the SparkR. Thanks Deepak On Mon, May 30, 2016 at 12:12 PM, Jörn Franke wrote: > > Well if

Re: Accessing s3a files from Spark

2016-05-31 Thread Deepak Sharma
Hi Mayuresh Instead of s3a , have you tried the https:// uri for the same s3 bucket? HTH Deepak On Tue, May 31, 2016 at 4:41 PM, Mayuresh Kunjir wrote: > > > On Tue, May 31, 2016 at 5:29 AM, Steve Loughran > wrote: > >> which s3 endpoint? >> >> >

LinkedIn streams in Spark

2016-04-10 Thread Deepak Sharma
Hello All, I am looking for a use case where anyone have used spark streaming integration with LinkedIn. -- Thanks Deepak

Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Deepak Sharma
There is Spark action defined for oozie workflows. Though I am not sure if it supports only Java SPARK jobs or Scala jobs as well. https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html Thanks Deepak On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot wrote: > Hi, >

Re: Detecting application restart when running in supervised cluster mode

2016-04-05 Thread Deepak Sharma
Hi Rafael If you are using yarn as the engine , you can always use RM UI to see the application progress. Thanks Deepak On Tue, Apr 5, 2016 at 12:18 PM, Rafael Barreto wrote: > Hello, > > I have a driver deployed using `spark-submit` in supervised cluster mode. >

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

Re: Setting Spark Worker Memory

2016-05-11 Thread Deepak Sharma
Since you are registering workers from the same node , do you have enough cores and RAM(In this case >=9 cores and > = 24 GB ) on this node(11.14.224.24)? Thanks Deepak On Wed, May 11, 2016 at 9:08 PM, شجاع الرحمن بیگ wrote: > Hi All, > > I need to set same memory and

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
(Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Yes, it seems to be the case. > In this case executors should have continued logging values till 300, but > they are shutdown as soon as i do "yarn kill .." > > On Thu, May 12, 2016 at 12:11 PM Deepak Sharma

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
Hi Rakesh Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for your spark configuration instance? If not try this , and let us know if this helps. Thanks Deepak On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Issue i

Debug spark core and streaming programs in scala

2016-05-15 Thread Deepak Sharma
Hi I have scala program consisting of spark core and spark streaming APIs Is there any open source tool that i can use to debug the program for performance reasons? My primary interest is to find the block of codes that would be exeuted on driver and what would go to the executors. Is there JMX

Re: Graceful shutdown of spark streaming on yarn

2016-05-13 Thread Deepak Sharma
dead and it shuts down abruptly. >> Could this issue be related to yarn? I see correct behavior locally. I >> did "yarn kill " to kill the job. >> >> >> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma <deepakmc...@gmail.com> >> wrote: >>

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
er$: VALUE -> 205 > 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 206 > > > > > > > On Thu, May 12, 2016 at 11:45 AM Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Hi Rakesh >> Did you tried se

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Deepak Sharma
Spark 2.0 is yet to come out for public release. I am waiting to get hands on it as well. Please do let me know if i can download source and build spark2.0 from github. Thanks Deepak On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote: > Hi All, > > We are evaluating a

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Deepak Sharma
With Structured Streaming ,Spark would provide apis over spark sql engine. Its like once you have the structured stream and dataframe created out of this , you can do ad-hoc querying on the DF , which means you are actually querying the stram without having to store or transform. I have not used

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
Hi Ajay You can look at wholeTextFiles method of rdd[string,string] and then map each of rdd to saveAsTextFile . This will serve the purpose . I don't think if anything default like distcp exists in spark Thanks Deepak On 10 May 2016 11:27 pm, "Ajay Chander" wrote: > Hi

Re: Cluster Migration

2016-05-10 Thread Deepak Sharma
then apply > compression codec on it, save the rdd to another Hadoop cluster? > > Thank you, > Ajay > > On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote: > >> Hi Ajay >> You can look at wholeTextFiles method of rdd[string,string] and

Re: migration from Teradata to Spark SQL

2016-05-03 Thread Deepak Sharma
Hi Tapan I would suggest an architecture where you have different storage layer and data servng layer. Spark is still best for batch processing of data. So what i am suggesting here is you can have your data stored as it is in some hdfs raw layer , run your ELT in spark on this raw data and

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
Once you download hadoop and format the namenode , you can use start-dfs.sh to start hdfs. Then use 'jps' to sss if datanode/namenode services are up and running. Thanks Deepak On Mon, Apr 18, 2016 at 5:18 PM, My List wrote: > Hi , > > I am a newbie on Spark.I wanted to

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
binary format or will have to build it? > 3) Is there a basic tutorial for Hadoop on windows for the basic needs of > Spark. > > Thanks in Advance ! > > On Mon, Apr 18, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Once you download hadoop

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
re Galore on Spark. > Since I am starting afresh, what would you advice? > > On Mon, Apr 18, 2016 at 5:45 PM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> Binary for Spark means ts spark built against hadoop 2.6 >> It will not have any hadoop executables. &

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Deepak Sharma
as trying to > run big data stuff on windows. Have run in so much of issues that I could > just throw the laptop with windows out. > > Your view - Redhat, Ubuntu or Centos. > Does Redhat give a one year licence on purchase etc? > > Thanks > > On Mon, Apr 18, 2016 at

Processing millions of messages in milliseconds -- Architecture guide required

2016-04-18 Thread Deepak Sharma
Hi all, I am looking for an architecture to ingest 10 mils of messages in the micro batches of seconds. If anyone has worked on similar kind of architecture , can you please point me to any documentation around the same like what should be the architecture , which all components/big data

Use cases around image/video processing in spark

2016-08-10 Thread Deepak Sharma
Hi If anyone is using or knows about github repo that can help me get started with image and video processing using spark. The images/videos will be stored in s3 and i am planning to use s3 with Spark. In this case , how will spark achieve distributed processing? Any code base or references is

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
ll config to overcome this. > Tried almost everything i could after searching online. > > Any help from the mailing list would be appreciated. > > On Thu, Aug 4, 2016 at 7:43 AM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> I am facing the same issue with spark 1.5

Re: Spark jobs failing due to java.lang.OutOfMemoryError: PermGen space

2016-08-04 Thread Deepak Sharma
I am facing the same issue with spark 1.5.2 If the file size that's being processed by spark , is of size 10-12 MB , it throws out of memory . But if the same file is within 5 MB limit , it runs fine. I am using spark configuration with 7GB of memory and 3 cores for executors in the cluster of 8

Re: Storm HDFS bolt equivalent in Spark Streaming.

2016-07-19 Thread Deepak Sharma
In spark streaming , you have to decide the duration of micro batches to run. Once you get the micro batch , transform it as per your logic and then you can use saveAsTextFiles on your final RDD to write it to HDFS. Thanks Deepak On 20 Jul 2016 9:49 am, wrote:

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Deepak Sharma
I am using DAO in spark application to write the final computation to Cassandra and it performs well. What kinds of issues you foresee using DAO for hbase ? Thanks Deepak On 19 Jul 2016 10:04 pm, "Yu Wei" wrote: > Hi guys, > > > I write spark application and want to store

Re: RDD for loop vs foreach

2016-07-12 Thread Deepak Sharma
Hi Phil I guess for() is executed on the driver while foreach() will execute it in parallel. You can try this without collecting the rdd try both . foreach in this case would print on executors and you would not see anything on the driver console. Thanks Deepak On Tue, Jul 12, 2016 at 9:28 PM,

Re: One map per folder in spark or Hadoop

2016-07-07 Thread Deepak Sharma
You have to distribute the files in some distributed file system like hdfs. Or else copy the files to all executors local file system and make sure to mention the file scheme in the URI explicitly. Thanks Deepak On Thu, Jul 7, 2016 at 7:13 PM, Balachandar R.A. wrote:

Re: Is there a way to dynamic load files [ parquet or csv ] in the map function?

2016-07-08 Thread Deepak Sharma
Yes .You can do something like this : .map(x=>mapfunction(x)) Thanks Deepak On 9 Jul 2016 9:22 am, "charles li" wrote: > > hi, guys, is there a way to dynamic load files within the map function. > > i.e. > > Can I code as bellow: > > > ​ > > thanks a lot. > ​ > > > -- >

Re: Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
rote: > I found following links are good as I am using same. > > http://spark.apache.org/docs/latest/tuning.html > > https://spark-summit.org/2014/testing-spark-best-practices/ > > Regards, > Vaquar khan > > On 8 Aug 2016 10:11, "Deepak Sharma" <deepakmc.

Best practises around spark-scala

2016-08-08 Thread Deepak Sharma
Hi All, Can anyone please give any documents that may be there around spark-scala best practises? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread Deepak Sharma
Can you please post the code snippet and the error you are getting ? -Deepak On 9 Aug 2016 12:18 am, "manish jaiswal" wrote: > Hi, > > I am not able to read data from hive transactional table using sparksql. > (i don't want read via hive jdbc) > > > > Please help. >

Re: Spark join and large temp files

2016-08-08 Thread Deepak Sharma
Register you dataframes as temp tables and then try the join on the temp table. This should resolve your issue. Thanks Deepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab wrote: > Hello, > We have two parquet inputs of the following form: > > a: id:String, Name:String (1.5TB)

Long running tasks in stages

2016-08-06 Thread Deepak Sharma
I am doing join over 1 dataframe and a empty data frame. The first dataframe got almost 50k records. This operation nvere returns back and runs indefinitely. Is there any solution to get around this? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: Is Spark right for my use case?

2016-08-08 Thread Deepak Sharma
Hi Danellis For point 1 , spark streaming is something to look at. For point 2 , you can create DAO from cassandra on each stream processing.This may be costly operation though , but to do real time processing of data , you have to live with t. Point 3 is covered in point 2 above. Since you are

Re: What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Deepak Sharma
Hi Devi Please make sure the jdbc jar is in the spark classpath. With spark-submit , you can use --jars option to specify the sql server jdbc jar. Thanks Deepak On Mon, Aug 8, 2016 at 1:14 PM, Devi P.V wrote: > Hi all, > > I am trying to write a spark dataframe into MS-Sql

Re: What are using Spark for

2016-08-02 Thread Deepak Sharma
Yes.I am using spark for ETL and I am sure there are lot of other companies who are using spark for ETL. Thanks Deepak On 2 Aug 2016 11:40 pm, "Rohit L" wrote: > Does anyone use Spark for ETL? > > On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal wrote:

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
atic > write a size properly for what I already set in Alluxio 512MB per block. > > > On Jul 1, 2016, at 11:01 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > > Before writing coalesing your rdd to 1 . > It will create only 1 output file . > Multiple part file happens

Re: how to compare two avro format hive tables

2017-01-30 Thread Deepak Sharma
You can use spark testing base's rdd comparators. Create 2 different dataframes from these 2 hive tables. Convert them to rdd and use spark-testing-base compareRDD. Here is an example for rdd comparison: https://github.com/holdenk/spark-testing-base/wiki/RDDComparisons On Mon, Jan 30, 2017 at

Examples in graphx

2017-01-29 Thread Deepak Sharma
Hi There, Are there any examples of using GraphX along with any graph DB? I am looking to persist the graph in graph based DB and then read it back in spark , process using graphx. -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: spark architecture question -- Pleas Read

2017-01-29 Thread Deepak Sharma
The better way is to read the data directly into spark using spark sql read jdbc . Apply the udf's locally . Then save the data frame back to Oracle using dataframe's write jdbc. Thanks Deepak On Jan 29, 2017 7:15 PM, "Jörn Franke" wrote: > One alternative could be the

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark sql or hive context ? Or do you want to reuse the existing UDF jar for hive in spark ? Thanks Deepak On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in

Re: anyone from bangalore wants to work on spark projects along with me

2017-01-19 Thread Deepak Sharma
Yes. I will be there before 4 PM . Whats your contact number ? Thanks Deepak On Thu, Jan 19, 2017 at 2:38 PM, Sirisha Cheruvu wrote: > Are we meeting today?! > > On Jan 18, 2017 8:32 AM, "Sirisha Cheruvu" wrote: > >> Hi , >> >> Just thought of keeping my

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
On the sqlcontext or hivesqlcontext , you can register the function as udf below: *hiveSqlContext.udf.register("func_name",func(_:String))* Thanks Deepak On Wed, Jan 18, 2017 at 8:45 AM, Sirisha Cheruvu wrote: > Hey > > Can yu send me the source code of hive java udf which

Re: need a hive generic udf which also works on spark sql

2017-01-17 Thread Deepak Sharma
Did you tried this with spark-shell? Please try this. $spark-shell --jars /home/cloudera/Downloads/genudnvl2.jar On the spark shell: val hc = new org.apache.spark.sql.hive.HiveContext(sc) ; hc.sql("create temporary function nexr_nvl2 as ' com.nexr.platform.hive.udf.GenericUDFNVL2'");

Re: Spark ANSI SQL Support

2017-01-17 Thread Deepak Sharma
>From spark documentation page: Spark SQL can now run all 99 TPC-DS queries. On Jan 18, 2017 9:39 AM, "Rishabh Bhardwaj" wrote: > Hi All, > > Does Spark 2.0 Sql support full ANSI SQL query standards? > > Thanks, > Rishabh. >

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread DEEPAK SHARMA
a slides say that the default partitions is 2 however its 1 (looking at output of toDebugString). Appreciate any help. Thanks Deepak Sharma

Re: Calling udf in Spark

2016-09-08 Thread Deepak Sharma
No its not required for UDF. Its required when you convert from rdd to df. Thanks Deepak On 8 Sep 2016 2:25 pm, "Divya Gehlot" wrote: > Hi, > > Is it necessary to import sqlContext.implicits._ whenever define and > call UDF in Spark. > > > Thanks, > Divya > > >

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
Is it possible to execute any query using SQLContext even if the DB is secured using roles or tools such as Sentry? Thanks Deepak On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan wrote: > Hi All, > > In our YARN cluster, we have setup spark 1.6.1 , we plan to give

Re: Assign values to existing column in SparkR

2016-09-09 Thread Deepak Sharma
Data frames are immutable in nature , so i don't think you can directly assign or change values on the column. Thanks Deepak On Fri, Sep 9, 2016 at 10:59 PM, xingye wrote: > I have some questions about assign values to a spark dataframe. I want to > assign values to an

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your case. Multi tenancy means all the applications has to be submitted to different queues. Thanks Deepak On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot wrote: > Hi, > > I am on EMR cluster and My

Re: Ways to check Spark submit running

2016-09-13 Thread Deepak Sharma
Use yarn-client mode and you can see the logs n console after you submit. On Tue, Sep 13, 2016 at 11:47 AM, Divya Gehlot wrote: > Hi, > > Some how for time being I am unable to view Spark Web UI and Hadoop Web > UI. > Looking for other ways ,I can check my job is

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
;https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss,

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
h2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBx

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
One of the current best from what I've worked with is >> Citus. >> >> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com> >> wrote: >> > Hi Cody >> > Spark direct stream is just fine for this use case. >> > But why post

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
i Akhtar <ali.rac...@gmail.com> wrote: > > Is there an advantage to that vs directly consuming from Kafka? Nothing > is > > being done to the data except some light ETL and then storing it in > > Cassandra > > > > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma &l

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama To me it looks like issue with the SPN with which you are trying to connect to hive2 , i.e. hive@hostname. Are you able to connect to hive from spark-shell? Try getting the tkt using any other user keytab but not hadoop services keytab and then try running the spark submit. Thanks

Re: Convert RDD to JSON Rdd and append more information

2016-09-20 Thread Deepak Sharma
Enrich the RDDs first with more information and then map it to some case class , if you are using scala. You can then use play api's (play.api.libs.json.Writes/play.api.libs.json.Json) classes to convert the mapped case class to json. Thanks Deepak On Tue, Sep 20, 2016 at 6:42 PM, sujeet jog

Re: Spark 2.0 - Join statement compile error

2016-08-22 Thread Deepak Sharma
Hi Subhajit Try this in your join: *val* *df** = **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* On Tue, Aug 23, 2016 at 2:30 AM, Subhajit Purkayastha wrote: > *All,* > > > > *I

Re: Spark 2.0 - Join statement compile error

2016-08-23 Thread Deepak Sharma
On Tue, Aug 23, 2016 at 10:32 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > *val* *df** = > **sales_demand**.**join**(**product_master**,**sales_demand**.$"INVENTORY_ITEM_ID" > =**== **product_master**.$"INVENTORY_ITEM_ID",**"inner"**)* Ignore

Re: Optimized way to use spark as db to hdfs etl

2016-11-05 Thread Deepak Sharma
Hi Rohit You can use accumulators and increase it on every record processing. At last you can get the value of accumulator on driver , which will give you the count. HTH Deepak On Nov 5, 2016 20:09, "Rohit Verma" wrote: > I am using spark to read from database and

Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them? It may save you time. Thanks Deepak On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V wrote: > Hi all, > > I have 4 data frames with three columns, > > client_id,product_id,interest > > I want to combine these 4

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
amage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 11 November 2

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
This is waste of money I guess. On Nov 11, 2016 22:41, "Mich Talebzadeh" wrote: > starts at $4,000 per node per year all inclusive. > > With discount it can be halved but we are talking a node itself so if you > have 5 nodes in primary and 5 nodes in DR we are talking

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
There are 8 worker nodes in the cluster . Thanks Deepak On Dec 18, 2016 2:15 AM, "Holden Karau" <hol...@pigscanfly.ca> wrote: > How many workers are in the cluster? > > On Sat, Dec 17, 2016 at 12:23 PM Deepak Sharma <deepakmc...@gmail.com> > wrote: > &

foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
Hi All, I am iterating over data frame's paritions using df.foreachPartition . Upon each iteration of row , i am initializing DAO to insert the row into cassandra. Each of these iteration takes almost 1 and half minute to finish. In my workflow , this is part of an action and 100 partitions are

Re: foreachPartition's operation is taking long to finish

2016-12-17 Thread Deepak Sharma
On Sun, Dec 18, 2016 at 2:26 AM, vaquar khan wrote: > select * from indexInfo; > Hi Vaquar I do not see CF with the name indexInfo in any of the cassandra databases. Thank Deepak -- Thanks Deepak www.bigdatabig.com www.keosha.net

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
2016 at 1:49 PM, Deepak Sharma <deepakmc...@gmail.com> wrote: > This is the correct way to do it.The timestamp that you mentioned was not > correct: > > scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd") > ts1: org.apache.spark.sql.Column =

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
01| |3bc61951-0f49-43b...|1477983725292|2016-11-01| |688acc61-753f-4a3...|1479899459947|2016-11-23| |5ff1eb6c-14ec-471...|1479901374026|2016-11-23| ++-+--+ Thanks Deepak On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma <deepakmc...@gmail.com> wrote: >

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Deepak Sharma
This is how you can do it in scala: scala> val ts1 = from_unixtime($"ts", "-MM-dd") ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd) scala> val finaldf = df.withColumn("ts1",ts1) finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string, ts1: string] scala>

Re: Spark 2.0.2 , using DStreams in Spark Streaming . How do I create SQLContext? Please help

2016-11-30 Thread Deepak Sharma
In Spark > 2.0 , spark session was introduced that you can use to query hive as well. Just make sure you create spark session with enableHiveSupport() option. Thanks Deepak On Thu, Dec 1, 2016 at 12:27 PM, shyla deshpande wrote: > I am Spark 2.0.2 , using DStreams

Re: How to deal with string column data for spark mlib?

2016-12-20 Thread Deepak Sharma
You can read the source in a data frame. Then iterate over all rows with map and use something like below: df.map(x=>x(0).toString().toDouble) Thanks Deepak On Tue, Dec 20, 2016 at 3:05 PM, big data wrote: > our source data are string-based data, like this: > col1

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 09:52, Deepak S

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
sclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 27 December 2016 at 10:30, Deepak Sharma <deepakmc...@gmail.com> wrote: > >> It works for me with spark 1.6 (--jars) >> Please tr

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all respon

Re: Location for the additional jar files in Spark

2016-12-27 Thread Deepak Sharma
Hi Mich You can copy the jar to shared location and use --jars command line argument of spark-submit. Who so ever needs access to this jar , can refer to the shared path and access it using --jars argument. Thanks Deepak On Tue, Dec 27, 2016 at 3:03 PM, Mich Talebzadeh

Hive Context and SQL Context interoperability

2017-04-13 Thread Deepak Sharma
Hi All, I have registered temp tables using hive context and sql context both. Now when i try to join these 2 temp tables , 1 of the tables complain about not being found. Is there any setting or option so the tables in these 2 different contexts are visible to each other? -- Thanks Deepak

Re: Check if dataframe is empty

2017-03-06 Thread Deepak Sharma
If the df is empty , the .take would return java.util.NoSuchElementException. This can be done as below: df.rdd.isEmpty On Tue, Mar 7, 2017 at 9:33 AM, wrote: > Dataframe.take(1) is faster. > > > > *From:* ashaita...@nz.imshealth.com

Re: Check if dataframe is empty

2017-03-07 Thread Deepak Sharma
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath wrote: > df.take(1).isEmpty should work My bad. It will return empty array: emptydf.take(1) res0: Array[org.apache.spark.sql.Row] = Array() and applying isEmpty would return boolean emptydf.take(1).isEmpty res2:

Spark ES Connector -- AWS Managed ElasticSearch Services

2017-08-01 Thread Deepak Sharma
I am tying to connect to AWS managed ES service using Spark ES Connector , but am not able to. I am passing es.nodes and es.port along with es.nodes.wan.only set to true. But it fails with below error: 34 ERROR NetworkClient: Node [x.x.x.x:443] failed (The server x.x.x.x failed to respond); no

Re: How can i split dataset to multi dataset

2017-08-06 Thread Deepak Sharma
This can be mapped as below: dataset.map(x=>((x(0),x(1),x(2)),x) This works with Dataframe of rows but i haven't tried with dataset Thanks Deepak On Mon, Aug 7, 2017 at 8:21 AM, Jone Zhang wrote: > val schema = StructType( > Seq( > StructField("app",

Re: Help Required on Spark - Convert DataFrame to List with out using collect

2017-12-18 Thread Deepak Sharma
I am not sure about java but in scala it would be something like df.rdd.map{ x => MyClass(x.getString(0),.)} HTH --Deepak On Dec 19, 2017 09:25, "Sunitha Chennareddy" wrote: Hi All, I am new to Spark, I want to convert DataFrame to List with out using

Re: Spark based Data Warehouse

2017-11-11 Thread Deepak Sharma
I am looking for similar solution more aligned to data scientist group. The concern i have is about supporting complex aggregations at runtime . Thanks Deepak On Nov 12, 2017 12:51, "ashish rawat" wrote: > Hello Everyone, > > I was trying to understand if anyone here has

Re: Spark based Data Warehouse

2017-11-13 Thread Deepak Sharma
os for your >> end users; but it sounds like you’ll be using it for exploratory analysis. >> Spark is great for this ☺ >> >> >> >> -Pat >> >> >> >> >> >> *From: *Vadim Semenov <vadim.seme...@datadoghq.com> >> *Date: *Su

  1   2   >