Write data from Hbase using Spark Failing with NPE

2018-05-23 Thread Alchemist
aI am using Spark to write data to Hbase, I can read data just fine but write is failing with following exception. I found simila issue that got resolved by adding *site.xml and hbase JARs. But it is npot working for me.       JavaPairRDD  tablePuts =

Re: help in copying data from one azure subscription to another azure subscription

2018-05-23 Thread Pushkar.Gujar
What are you using for storing data in those subscriptions? Datalake or Blobs? There is Azure Data Factory already available that can do copy between these cloud storage without having to go through spark Thank you, *Pushkar Gujar* On Mon, May 21, 2018 at 8:59 AM, amit kumar singh

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread ayan guha
Curious question: what is the reason of using spark here? Why not simple sql-based ETL? On Thu, May 24, 2018 at 5:09 AM, Ajay wrote: > Do you worry about spark overloading the SQL server? We have had this > issue in the past where all spark slaves tend to send lots of

Re: Alternative for numpy in Spark Mlib

2018-05-23 Thread Suzen, Mehmet
You can use Breeze, which is part of spark distribution: https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra Check out the modules under import breeze._ On 23 May 2018 at 07:04, umargeek wrote: > Hi Folks, > > I am planning to rewrite one of my python

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Ajay
Do you worry about spark overloading the SQL server? We have had this issue in the past where all spark slaves tend to send lots of data at once to SQL and that slows down the latency of the rest of the system. We overcame this by using sqoop and running it in a controlled environment. On Wed,

Re: Submit many spark applications

2018-05-23 Thread Marcelo Vanzin
On Wed, May 23, 2018 at 12:04 PM, raksja wrote: > So InProcessLauncher wouldnt use the native memory, so will it overload the > mem of parent process? I will still use "native memory" (since the parent process will still use memory), just less of it. But yes, it will use

CMR: An open-source Data acquisition API for Spark is available

2018-05-23 Thread Thomas Fuller
Hi Folks, Today I've released my open-source CMR API, which is used to acquire data from several data providers directly in Spark. Currently the CMR API offers integration with the following: - Federal Reserve Bank of St. Louis - World Bank - TreasuryDirect.gov - OpenFIGI.com *Of note*: - The

Re: Submit many spark applications

2018-05-23 Thread raksja
Hi Marcelo, I'm facing same issue when making spark-submits from an ec2 instance and reaching native memory limit sooner. we have the #1, but we are still in spark 2.1.0, couldnt try #2. So InProcessLauncher wouldnt use the native memory, so will it overload the mem of parent process? Is

Re: Spark driver pod garbage collection

2018-05-23 Thread Anirudh Ramanathan
There's a flag to the controller manager that is in charge of retention policy for terminated or completed pods. https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/#options --terminated-pod-gc-threshold int32 Default: 12500 Number of terminated pods that

Cannot make Spark to honour the spark.jars.ivySettings config

2018-05-23 Thread Bruno Aranda
Hi, I am trying to use my own ivy settings file. For that, I am submitting to Spark using a command such as the following to test: spark-shell --packages some-org:some-artifact:102 --conf spark.jars.ivySettings=/home/hadoop/ivysettings.xml The idea is to be able to get the artifact from a

Spark driver pod garbage collection

2018-05-23 Thread purna pradeep
Hello, Currently I observe dead pods are not getting garbage collected (aka spark driver pods which have completed execution). So pods could sit in the namespace for weeks potentially. This makes listing, parsing, and reading pods slower and well as having junk sit on the cluster. I believe

Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-23 Thread Sushil Kotnala
You can use .option( "auto.offset.reset","earliest") while reading from kafka. With this, new stream will read from the first offset present for topic . On Wed, May 23, 2018 at 11:32 AM, karthikjay wrote: > Chris, > > Thank you for responding. I get it. > > But, if I am

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Super, just giving high level idea what i want to do. I have one source schema which is MS SQL Server 2008 and target is also MS SQL Server 2008. Currently there is c# based ETL application which does extract transform and load as customer specific schema including indexing etc. Thanks On Wed,

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Yes. Regards, Kedar Dixit -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
Thank you Kedar Dixit, Silvio Fiorito. Just one question that - even it's not an azure cloud MS-SQL Server. It should support MS-SQL Server installed on local machine. right ? Thank you. On Wed, May 23, 2018 at 6:18 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Try this

Re: Adding jars

2018-05-23 Thread kedarsdixit
This can help us to solve the immediate issue, however the ideally one should submit the jars in the beginning of the job. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Silvio Fiorito
Try this https://docs.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector From: Chetan Khatri Date: Wednesday, May 23, 2018 at 7:47 AM To: user Subject: Bulk / Fast Read and Write with MSSQL Server and Spark All, I am

Re: Adding jars

2018-05-23 Thread Sushil Kotnala
The purpose of broadcast variable is different. @Malveeka, could you please explain your usecase and issue. If the FAT/ Uber jar is not having required dependent jars then the spark job will fail at the start itself. What is your scenario in which you want to add new jars? Also, what you mean by

Re: Adding jars

2018-05-23 Thread kedarsdixit
In case of already running jobs, you can make use of broadcasters which will broadcast the jars to workers, if you want to change it on the fly you can rebroadcast it. You can explore broadcasters bit more to make use of. Regards, Kedar Dixit Data Science at Persistent Systems Ltd. -- Sent

Re: Adding jars

2018-05-23 Thread Sushil Kotnala
Hi With spark-submit we can start a new spark job, but it will not add new jar files in already running job. ~Sushil On Wed, May 23, 2018, 17:28 kedarsdixit wrote: > Hi, > > You can add dependencies in spark-submit as below: > > ./bin/spark-submit \ >

Re: Spark is not evenly distributing data

2018-05-23 Thread kedarsdixit
Hi, Can you elaborate more here? We don't understand the issue in detail. Regards, Kedar Dixit Data Science at Persistent Systems Ltd. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Adding jars

2018-05-23 Thread kedarsdixit
Hi, You can add dependencies in spark-submit as below: ./bin/spark-submit \ --class \ --master \ --deploy-mode \ --conf = \ *--jars * \ ... # other options \ [application-arguments] Hope this helps. Regards, Kedar Dixit Data Science at Persistent Systems Ltd -- Sent

Re: Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread kedarsdixit
Hi, I had came across this a while ago check if this is helpful. Regards, ~Kedar Dixit Data Science @ Persistent Systems Ltd. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Bulk / Fast Read and Write with MSSQL Server and Spark

2018-05-23 Thread Chetan Khatri
All, I am looking for approach to do bulk read / write with MSSQL Server and Apache Spark 2.2 , please let me know if any library / driver for the same. Thank you. Chetan

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread Jungtaek Lim
The issue looks like fixed in https://issues.apache.org/jira/browse/SPARK-23670, and likely 2.3.1 will include the fix. -Jungtaek Lim (HeartSaVioR) 2018년 5월 23일 (수) 오후 7:12, weand 님이 작성: > Thanks for clarification. So it really seem a Spark UI OOM Issue. > > After

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-23 Thread weand
Thanks for clarification. So it really seem a Spark UI OOM Issue. After setting: --conf spark.sql.ui.retainedExecutions=10 --conf spark.worker.ui.retainedExecutors=10 --conf spark.worker.ui.retainedDrivers=10 --conf spark.ui.retainedJobs=10 --conf spark.ui.retainedStages=10

Recall: spark sql in-clause problem

2018-05-23 Thread Shiva Prashanth Vallabhaneni
Shiva Prashanth Vallabhaneni would like to recall the message, "spark sql in-clause problem". Any comments or statements made in this email are not necessarily those of Tavant Technologies. The information transmitted is intended only for the person or entity

[Beginner][StructuredStreaming] Using Spark aggregation - WithWatermark on old data

2018-05-23 Thread karthikjay
I am doing the following aggregation on the data val channelChangesAgg = tunerDataJsonDF .withWatermark("ts2", "10 seconds") .groupBy(window(col("ts2"),"10 seconds"), col("env"), col("servicegroupid")) .agg(count("linetransactionid") as

Re: [structured-streaming]How to reset Kafka offset in readStream and read from beginning

2018-05-23 Thread karthikjay
Chris, Thank you for responding. I get it. But, if I am using a console sink without checkpoint location, I do not see any messages in the console in IntellijIDEA IDE. I do not explicitly specify checkpointLocation in this case. How do I clear the working directory data and force Spark to read

Fwd: XGBoost on PySpark

2018-05-23 Thread Aakash Basu
Guys any insight on the below? -- Forwarded message -- From: Aakash Basu Date: Sat, May 19, 2018 at 12:21 PM Subject: XGBoost on PySpark To: user Hi guys, I need help in implementing XG-Boost in PySpark. As per the