Re: Help With unstructured text file with spark scala

Bitfox Wed, 09 Feb 2022 04:29:33 -0800

Hi

I am not sure about the total situation.
But if you want a scala integration I think it could use regex to match and
capture the keywords.
Here I wrote one you can modify by your end.


import scala.io.Source

import scala.collection.mutable.ArrayBuffer


val list1 = ArrayBuffer[(String,String,String)]()

val list2 = ArrayBuffer[(String,String)]()



val patt1 = """^(.*)#(.*)#([^#]*)$""".r

val patt2 = """^(.*)#([^#]*)$""".r


val file = "1.txt"

val lines = Source.fromFile(file).getLines()


for ( x <- lines ) {

  x match {

    case patt1(k,v,z) => list1 += ((k,v,z))

    case patt2(k,v) => list2 += ((k,v))

    case _ => println("no match")

  }

}



Now the list1 and list2 have the elements you wanted, you can convert them
to a dataframe easily.


Thanks.

On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa <danilosousa...@gmail.com>
wrote:

> Hello
>
>
> Yes, for this block I can open as csv with # delimiter, but have the block
> that is no csv format.
>
> This is the likely key value.
>
> We have two different layouts in the same file. This is the “problem”.
>
> Thanks for your time.
>
>
>
> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>
>
> On 9 Feb 2022, at 00:58, Bitfox <bit...@bitfox.top> wrote:
>
> Hello
>
> You can treat it as a csf file and load it from spark:
>
> >>> df = spark.read.format("csv").option("inferSchema",
> "true").option("header", "true").option("sep","#").load(csv_file)
> >>> df.show()
> +--------------------+-------------------+-----------------+
> |               Plano|Código Beneficiário|Nome Beneficiário|
> +--------------------+-------------------+-----------------+
> |58693 - NACIONAL ...|           65751353|       Jose Silva|
> |58693 - NACIONAL ...|           65751388|      Joana Silva|
> |58693 - NACIONAL ...|           65751353|     Felipe Silva|
> |58693 - NACIONAL ...|           65751388|      Julia Silva|
> +--------------------+-------------------+-----------------+
>
>
> cat csv_file:
>
> Plano#Código Beneficiário#Nome Beneficiário
> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>
> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>
>
> Regards
>
>
> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa <danilosousa...@gmail.com>
> wrote:
>
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>>
>> Dataframe need as:
>>
>> operadora filial unidade contrato empresa plano codigo_beneficiario
>> nome_beneficiario
>>
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>>
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>

Re: Help With unstructured text file with spark scala

Reply via email to