Re: Spark parse fixed length file [Java]

2018-04-19 Thread lsn24
Thanks for the response JayeshLalwani. Clearly in my case the issue was with my approach, not with the memory. The job was taking much longer time even for smaller dataset. Thanks again! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark parse fixed length file [Java]

2018-04-19 Thread lsn24
I was able to solve it by writing a java method (to slice and dice data) and invoking the method/function from spark.map. This transformed the data way faster than my previous approach. Thanks geoHeil for the pointer. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark parse fixed length file [Java]

2018-04-15 Thread Lalwani, Jayesh
Is your input data partitioned? How much memory have you assigned to your executor? Have you looked at how much time is being spent in GC in the executor? Is Spark spilling the data into disk? It is likely that the partition is too big. Spark tries to read the whole partition into the memory

Re: Spark parse fixed length file [Java]

2018-04-13 Thread Georg Heiler
I am not 100% sure if spark is smart enough to achieve this using a single pass over the data. If not you could create a java udf for this which correctly parses all the columns at once. Otherwise you could enable Tungsten off heap memory which might speed things up. lsn24

Spark parse fixed length file [Java]

2018-04-13 Thread lsn24
Hello, We are running into issues while trying to process fixed length files using spark. The approach we took is as follows: 1. Read the .bz2 file into a dataset from hdfs using spark.read().textFile() API.Create a temporary view. Dataset rawDataset =