Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Michael Shtelma
You can also cache the data frame on disk, if it does not fit into memory. An alternative would be to write out data frame as parquet and then read it, you can check if in this case the whole pipeline works faster as with the standard cache. Best, Michael On Tue, Nov 20, 2018 at 9:14 AM Dipl.-In

Re: Read Avro Data using Spark Streaming

2018-11-14 Thread Michael Shtelma
Hi, you can use this project in order to read Avro using Spark Structured Streaming: https://github.com/AbsaOSS/ABRiS Spark 2.4 has also built in support for Avro, so you can use from_avro function in Spark 2.4. Best, Michael On Sat, Nov 3, 2018 at 4:34 AM Divya Narayan wrote: > Hi, > > I pr

Re: How to increase the parallelism of Spark Streaming application?

2018-11-07 Thread Michael Shtelma
If you configure to many Kafka partitions, you can run into memory issues. This will increase memory requirements for spark job a lot. Best, Michael On Wed, Nov 7, 2018 at 8:28 AM JF Chen wrote: > I have a Spark Streaming application which reads data from kafka and save > the the transformatio

Re: [Arrow][Dremio]

2018-05-14 Thread Michael Shtelma
Hi Xavier, Dremio is looking really interesting and has nice UI. I think the idea to replace SSIS or similar tools with Dremio is not so bad, but what about complex scenarios with a lot of code and transformations ? Is it possible to use Dremio via API and define own transformations and transforma

INSERT INTO TABLE_PARAMS fails during ANALYZE TABLE

2018-04-19 Thread Michael Shtelma
Hi everybody, I wanted to test CBO with enabled histograms. In order to do this, I have enabled property spark.sql.statistics.histogram.enabled. In this test derby was used as a database for hive metastore. The problem is, that in some cases, the values, that are inserted to table TABLE_PARAMS e

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma
ltiple directories on different disks. NOTE: In Spark 1.0 and later > this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or > LOCAL_DIRS (YARN) environment variables set by the cluster manager. > > Regards, > Gourav Sengupta > > > > > > On Mon, Mar 26, 201

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-27 Thread Michael Shtelma
you file a jira if this is a bug? > Thanks! > > On Sat, Mar 24, 2018 at 1:23 AM, Michael Shtelma wrote: >> >> Hi Maropu, >> >> the problem seems to be in FilterEstimation.scala on lines 50 and 52: >> >> https://github.com/apache/spark/blob/master/sql/catal

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma
guess you may have to set it through the hdfs > core-site.xml file. The property you need to set is "hadoop.tmp.dir" which > defaults to "/tmp/hadoop-${user.name}" > > Regards, > Keith. > > http://keith-chapman.com > > On Mon, Mar 19, 2018 at 1:05 PM,

Re: Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
maropu > > On Fri, Mar 23, 2018 at 6:20 PM, Michael Shtelma wrote: >> >> Hi all, >> >> I am using Spark 2.3 with activated cost-based optimizer and a couple >> of hive tables, that were analyzed previously. >> >> I am getting the following exceptio

Using CBO on Spark 2.3 with analyzed hive tables

2018-03-23 Thread Michael Shtelma
Hi all, I am using Spark 2.3 with activated cost-based optimizer and a couple of hive tables, that were analyzed previously. I am getting the following exception for different queries: java.lang.NumberFormatException at java.math.BigDecimal.(BigDecimal.java:494) at java.math.BigDecimal.(BigDec

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
h/appcache/application_1521110306769_0041/container_1521110306769_0041_01_04/tmp JVM is using the second Djava.io.tmpdir parameter and writing everything to the same directory as before. Best, Michael Sincerely, Michael Shtelma On Mon, Mar 19, 2018 at 6:38 PM, Keith Chapman wrote: > Can yo

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Chapman wrote: > Hi Michael, > > You could either set spark.local.dir through spark conf or java.io.tmpdir > system property. > > Regards, > Keith. > > http://keith-chapman.com > > On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote: >> >> Hi ev

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi everybody, I am running spark job on yarn, and my problem is that the blockmgr-* folders are being created under /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/* The size of this folder can grow to a significant size and does not really fit into /tmp file system for one job,

spark.sql call takes far too long

2018-01-24 Thread Michael Shtelma
Hi all, I have a problem with the performance of the sparkSession.sql call. It takes up to a couple of seconds for me right now. I have a lot of generated temporary tables, which are registered within the session and also a lot of temporary data frames. Is it possible, that the analysis/resolve/an

Re: Inner join with the table itself

2018-01-16 Thread Michael Shtelma
> Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me

Re: Inner join with the table itself

2018-01-15 Thread Michael Shtelma
me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Mon, Jan 15, 2018 a

Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma
Hi all, I am trying to compile my udf with janino copmpiler and then register it in spark and use it afterwards. Here is the code: String s = " \n" + "public class MyUDF implements org.apache.spark.sql.api.java.UDF1 {\n" + "@Override\n" + "public St