Inconsistent dataset behavior between file and in-memory versions

2019-09-12 Thread Dean Arnold
I have some code to recover a complex structured row from a dataset. The row contains several ARRAY fields (mostly Array(IntegerType)), which are populated with Array[java.lang.Integer], as that seems to be the only way the Spark row serializer will accept them. If the dataset is written out to a

Exception when reading multiline JSON file

2019-09-12 Thread Kumaresh AK
Hello Spark Community! I am new to Spark. I tried to read a multiline json file (has around 2M records and gzip size is about 2GB) and encountered an exception. It works if I convert the same file into jsonl before reading it via spark. Unfortunately the file is private and I cannot share it. Is

Re: Exception when reading multiline JSON file

2019-09-12 Thread Kevin Mellott
Hi Kumaresh, This is most likely an issue with the size of your Spark cluster not being large enough to accomplish the desired task. Hints for this type of situation are when the stack trace mentions things like a size limitation was exceeded and you lost a node. However, this is also a great

Re: Monitor Spark Applications

2019-09-12 Thread raman gugnani
Hi Alex, Thanks will check this out. Can it be done directly as spark also exposes the metrics or JVM. In this my one doubt is how to assign fixed JMX ports to driver and executors. @Alex, Is there any difference in fetching data via JMX or using banzaicloud jar. On Fri, 13 Sep 2019 at

Monitor Spark Applications

2019-09-12 Thread raman gugnani
Hi Team, I am new to spark. I am using spark on hortonworks dataplatform with amazon EC2 machines. I am running spark in cluster mode with yarn. I need to monitor individual JVMs and other Spark metrics with *prometheus*. Can anyone suggest the solution to do the same. -- Raman Gugnani

Re: Monitor Spark Applications

2019-09-12 Thread Alex Landa
Hi, We are starting to use https://github.com/banzaicloud/spark-metrics . Keep in mind that their solution is for Spark for K8s, to make it work for Spark on Yarn you have to copy the dependencies of the spark-metrics into Spark Jars folders on all the Spark machines (took me a while to figure).