I am not 100% sure if spark is smart enough to achieve this using a single
pass over the data. If not you could create a java udf for this which
correctly parses all the columns at once.


Otherwise you could enable Tungsten off heap memory which might speed
things up.
lsn24 <lekshmi.s...@gmail.com> schrieb am Fr. 13. Apr. 2018 um 19:02:

> Hello,
>
>  We are running into issues while trying to process fixed length files
> using
> spark.
>
> The approach we took is as follows:
>
> 1. Read the .bz2 file  into a dataset from hdfs using
> spark.read().textFile() API.Create a temporary view.
>
>      Dataset<String> rawDataset = sparkSession.read().textFile(filePath);
>      rawDataset.createOrReplaceTempView(tempView);
>
> 2. Run a sql query on the view, to slice and dice the data the way we need
> it (using substring).
>
>  (SELECT
>                      TRIM(SUBSTRING(value,1 ,16)) AS record1 ,
>                      TRIM(SUBSTRING(value,17 ,8)) AS record2 ,
>                      TRIM(SUBSTRING(value,25 ,5)) AS record3 ,
>                      TRIM(SUBSTRING(value,30 ,16)) AS record4 ,
>                      CAST(SUBSTRING(value,46 ,8) AS BIGINT) AS record5 ,
>                      CAST(SUBSTRING(value,54 ,6) AS BIGINT) AS record6 ,
>                      CAST(SUBSTRING(value,60 ,3) AS BIGINT) AS record7 ,
>                      CAST(SUBSTRING(value,63 ,6) AS BIGINT) AS record8 ,
>                      TRIM(SUBSTRING(value,69 ,20)) AS record9 ,
>                      TRIM(SUBSTRING(value,89 ,40)) AS record10 ,
>                      TRIM(SUBSTRING(value,129 ,32)) AS record11 ,
>                      TRIM(SUBSTRING(value,161 ,19)) AS record12,
>                      TRIM(SUBSTRING(value,180 ,1)) AS record13 ,
>                      TRIM(SUBSTRING(value,181 ,9)) AS record14 ,
>                      TRIM(SUBSTRING(value,190 ,3)) AS record15 ,
>                      CAST(SUBSTRING(value,193 ,8) AS BIGINT) AS record16 ,
>                      CAST(SUBSTRING(value,201 ,8) AS BIGINT) AS record17
>                      FROM tempView)
>
> 3.Write the output of sql query to a parquet file.
>      loadDataset.write().mode(SaveMode.Append).parquet(outputDirectory);
>
> Problem :
>
>   The step #2 takes a longer time , if the length of line is ~2000
> characters. If each line in the file is only 1000 characters then it takes
> only 4 minutes to process 20 million lines. If we increase the line length
> to 2000 characters it takes 20 minutes to process 20 million lines.
>
>
> Is there a better way in spark to parse fixed length lines?
>
>
> *Note: *Spark version we use is 2.2.0 and we are using  Spark with Java.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to