Re: How to authenticate to ADLS from within spark job on the fly

2017-08-19 Thread Patrick Alwell
This might help; I’ve built a REST API with livyServer: https://livy.incubator.apache.org/ From: Steve Loughran Date: Saturday, August 19, 2017 at 7:05 AM To: Imtiaz Ahmed Cc: "user@spark.apache.org" Subject: Re: How to

Re: GC overhead exceeded

2017-08-18 Thread Patrick Alwell
+1 what is the executor memory? You may need to adjust executor memory and cores. For the sake of simplicity; each executor can handle 5 concurrent tasks and should have 5 cores. So if your cluster has 100 cores, you’d have 20 executors. And if your cluster memory is 500gb, each executor would

Re: CSV write to S3 failing silently with partial completion

2017-09-07 Thread Patrick Alwell
Sounds like an S3 bug. Can you replicate locally with HDFS? Try using S3a protocol too; there is a jar you can leverage like so: spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py EMR can sometimes be buggy. :/ You could also try

Re: Spark based Data Warehouse

2017-11-12 Thread Patrick Alwell
Alcon, You can most certainly do this. I’ve done benchmarking with Spark SQL and the TPCDS queries using S3 as the filesystem. Zeppelin and Livy server work well for the dash boarding and concurrent query issues: https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/ Livy

Re: Spark Dataframe and HIVE

2018-02-09 Thread Patrick Alwell
Might sound silly, but are you using a Hive context? What errors do the Hive query results return? spark = SparkSession.builder.enableHiveSupport().getOrCreate() The second part of your questions, you are creating a temp table and then subsequently creating another table from that temp view.

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Patrick Alwell
Joren, Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI? You are using

Re: Spark EMR executor-core vs Vcores

2018-02-26 Thread Patrick Alwell
+1 AFAIK, vCores are not the same as Cores in AWS. https://samrueby.com/2015/01/12/what-are-amazon-aws-vcpus/ I’ve always understood it as cores = num concurrent threads These posts might help you with your research and why exceeding 5 cores per executor doesn’t make sense.

Re: Two different Hive instances running

2018-08-17 Thread Patrick Alwell
You probably need to take a look at your hive-site.xml and see what the location is for the Hive Metastore. As for beeline, you can explicitly use an instance of Hive server by passing in the JDBC url to the hiveServer when you launch the client; e.g. beeline –u “jdbc://example.com:5432” Try

Re: I can't save DataFrame from running Spark locally

2018-01-23 Thread Patrick Alwell
Spark cannot read locally from S3 without an S3a protocol; you’ll more than likely need a local copy of the data or you’ll need to utilize the proper jars to enable S3 communication from the edge to the datacenter.

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Patrick Alwell
I don’t think sql context is “deprecated” in this sense. It’s still accessible by earlier versions of Spark. But yes, at first glance it looks like you are correct. I don’t see a recordWriter method for parquet outside of the SQL package.