Hi, Danilo. Do you have a single large file, only? If so, I guess you can use tools like sed/awk to split it into more files based on layout, so you can read these files into Spark.
Em qua, 9 de fev de 2022 09:30, Bitfox <[email protected]> escreveu: > Hi > > I am not sure about the total situation. > But if you want a scala integration I think it could use regex to match > and capture the keywords. > Here I wrote one you can modify by your end. > > import scala.io.Source > > import scala.collection.mutable.ArrayBuffer > > > val list1 = ArrayBuffer[(String,String,String)]() > > val list2 = ArrayBuffer[(String,String)]() > > > > val patt1 = """^(.*)#(.*)#([^#]*)$""".r > > val patt2 = """^(.*)#([^#]*)$""".r > > > val file = "1.txt" > > val lines = Source.fromFile(file).getLines() > > > for ( x <- lines ) { > > x match { > > case patt1(k,v,z) => list1 += ((k,v,z)) > > case patt2(k,v) => list2 += ((k,v)) > > case _ => println("no match") > > } > > } > > > > Now the list1 and list2 have the elements you wanted, you can convert them > to a dataframe easily. > > > Thanks. > > On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa <[email protected]> > wrote: > >> Hello >> >> >> Yes, for this block I can open as csv with # delimiter, but have the >> block that is no csv format. >> >> This is the likely key value. >> >> We have two different layouts in the same file. This is the “problem”. >> >> Thanks for your time. >> >> >> >> Relação de Beneficiários Ativos e Excluídos >>> Carteira em#27/12/2019##Todos os Beneficiários >>> Operadora#AMIL >>> Filial#SÃO PAULO#Unidade#Guarulhos >>> >>> Contrato#123456 - Test >>> Empresa#Test >> >> >> On 9 Feb 2022, at 00:58, Bitfox <[email protected]> wrote: >> >> Hello >> >> You can treat it as a csf file and load it from spark: >> >> >>> df = spark.read.format("csv").option("inferSchema", >> "true").option("header", "true").option("sep","#").load(csv_file) >> >>> df.show() >> +--------------------+-------------------+-----------------+ >> | Plano|Código Beneficiário|Nome Beneficiário| >> +--------------------+-------------------+-----------------+ >> |58693 - NACIONAL ...| 65751353| Jose Silva| >> |58693 - NACIONAL ...| 65751388| Joana Silva| >> |58693 - NACIONAL ...| 65751353| Felipe Silva| >> |58693 - NACIONAL ...| 65751388| Julia Silva| >> +--------------------+-------------------+-----------------+ >> >> >> cat csv_file: >> >> Plano#Código Beneficiário#Nome Beneficiário >> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva >> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva >> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva >> >> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva >> >> >> Regards >> >> >> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa <[email protected]> >> wrote: >> >>> Hi >>> I have to transform unstructured text to dataframe. >>> Could anyone please help with Scala code ? >>> >>> Dataframe need as: >>> >>> operadora filial unidade contrato empresa plano codigo_beneficiario >>> nome_beneficiario >>> >>> Relação de Beneficiários Ativos e Excluídos >>> Carteira em#27/12/2019##Todos os Beneficiários >>> Operadora#AMIL >>> Filial#SÃO PAULO#Unidade#Guarulhos >>> >>> Contrato#123456 - Test >>> Empresa#Test >>> Plano#Código Beneficiário#Nome Beneficiário >>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva >>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva >>> >>> Contrato#898011000 - FUNDACAO GERDAU >>> Empresa#FUNDACAO GERDAU >>> Plano#Código Beneficiário#Nome Beneficiário >>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva >>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva >>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva >>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: [email protected] >>> >>> >>
