Re: Help With unstructured text file with spark scala

Danilo Sousa Mon, 21 Feb 2022 10:34:16 -0800

Yes, this a only single file.

Thanks Rafael Mendes.


> On 13 Feb 2022, at 07:13, Rafael Mendes <rafaelpir...@gmail.com> wrote:
> 
> Hi, Danilo.
> Do you have a single large file, only?
> If so, I guess you can use tools like sed/awk to split it into more files 
> based on layout, so you can read these files into Spark.
> 
> 
> Em qua, 9 de fev de 2022 09:30, Bitfox <bit...@bitfox.top> escreveu:
> Hi
> 
> I am not sure about the total situation.
> But if you want a scala integration I think it could use regex to match and 
> capture the keywords.
> Here I wrote one you can modify by your end.
> 
> import scala.io.Source
> import scala.collection.mutable.ArrayBuffer
> 
> val list1 = ArrayBuffer[(String,String,String)]()
> val list2 = ArrayBuffer[(String,String)]()
> 
> 
> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
> val patt2 = """^(.*)#([^#]*)$""".r
> 
> val file = "1.txt"
> val lines = Source.fromFile(file).getLines()
> 
> for ( x <- lines ) {
>   x match {
>     case patt1(k,v,z) => list1 += ((k,v,z))
>     case patt2(k,v) => list2 += ((k,v))
>     case _ => println("no match")
>   }
> }
> 
> 
> Now the list1 and list2 have the elements you wanted, you can convert them to 
> a dataframe easily.
> 
> Thanks.
> 
> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa <danilosousa...@gmail.com 
> <mailto:danilosousa...@gmail.com>> wrote:
> Hello
> 
> 
> Yes, for this block I can open as csv with # delimiter, but have the block 
> that is no csv format. 
> 
> This is the likely key value. 
> 
> We have two different layouts in the same file. This is the “problem”.
> 
> Thanks for your time.
> 
> 
> 
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>> 
>> Contrato#123456 - Test
>> Empresa#Test
> 
>> On 9 Feb 2022, at 00:58, Bitfox <bit...@bitfox.top 
>> <mailto:bit...@bitfox.top>> wrote:
>> 
>> Hello
>> 
>> You can treat it as a csf file and load it from spark:
>> 
>> >>> df = spark.read.format("csv").option("inferSchema", 
>> >>> "true").option("header", "true").option("sep","#").load(csv_file)
>> >>> df.show()
>> +--------------------+-------------------+-----------------+
>> |               Plano|Código Beneficiário|Nome Beneficiário|
>> +--------------------+-------------------+-----------------+
>> |58693 - NACIONAL ...|           65751353|       Jose Silva|
>> |58693 - NACIONAL ...|           65751388|      Joana Silva|
>> |58693 - NACIONAL ...|           65751353|     Felipe Silva|
>> |58693 - NACIONAL ...|           65751388|      Julia Silva|
>> +--------------------+-------------------+-----------------+
>> 
>> 
>> cat csv_file:
>> 
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> 
>> 
>> Regards
>> 
>> 
>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa <danilosousa...@gmail.com 
>> <mailto:danilosousa...@gmail.com>> wrote:
>> Hi
>> I have to transform unstructured text to dataframe.
>> Could anyone please help with Scala code ?
>> 
>> Dataframe need as:
>> 
>> operadora filial unidade contrato empresa plano codigo_beneficiario 
>> nome_beneficiario
>> 
>> Relação de Beneficiários Ativos e Excluídos
>> Carteira em#27/12/2019##Todos os Beneficiários
>> Operadora#AMIL
>> Filial#SÃO PAULO#Unidade#Guarulhos
>> 
>> Contrato#123456 - Test
>> Empresa#Test
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>> 
>> Contrato#898011000 - FUNDACAO GERDAU
>> Empresa#FUNDACAO GERDAU
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>

Re: Help With unstructured text file with spark scala

Reply via email to