Re: How to convert a Dataset to a Dataset?
Hi All, Simple in Java as well. You can get the Dataset Directly Dataset encodedString = df.select("Column") .where("") .as(Encoders.STRING()) .toDF(); On Mon, 6 Jun 2022 at 15:26, Christophe Préaud < christophe.pre...@kelkoogroup.com> wrote: > Hi Marc, > > I'm not much familiar with Spark on Java, but according to the doc > <https://spark.apache.org/docs/latest/sql-getting-started.html#creating-datasets>, > it should be: > Encoder stringEncoder = Encoders.STRING(); > dataset.as(stringEncoder); > > For the record, it is much simpler in Scala: > dataset.as[String] > > Of course, this will work if your DataFrame only contains one column of > type String, e.g.: > val df = spark.read.parquet("Cyrano_de_Bergerac_Acte_V.parquet") > df.printSchema > > root > |-- line: string (nullable = true) > > df.as[String] > > Otherwise, you will have to convert somehow the Row to a String, e.g. in > Scala: > case class Data(f1: String, f2: Int, f3: Long) > val df = Seq(Data("a", 1, 1L), Data("b", 2, 2L), Data("c", 3, 3L), > Data("d", 4, 4L), Data("e", 5, 5L)).toDF > val ds = df.map(_.mkString(",")).as[String] > ds.show > > +-+ > |value| > +-+ > |a,1,1| > |b,2,2| > |c,3,3| > |d,4,4| > |e,5,5| > +-+ > > Regards, > Christophe. > > On 6/4/22 14:38, marc nicole wrote: > > Hi, > How to convert a Dataset to a Dataset? > What i have tried is: > > List list = dataset.as(Encoders.STRING()).collectAsList(); > Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); > // But this line raises a org.apache.spark.sql.AnalysisException: Try to > map struct... to Tuple1, but failed as the number of fields does not line > up > > Type of columns being String > How to solve this? > > >
Re: How to convert a Dataset to a Dataset?
Hi Marc, I'm not much familiar with Spark on Java, but according to the doc <https://spark.apache.org/docs/latest/sql-getting-started.html#creating-datasets>, it should be: Encoder stringEncoder = Encoders.STRING(); dataset.as(stringEncoder); For the record, it is much simpler in Scala: dataset.as[String] Of course, this will work if your DataFrame only contains one column of type String, e.g.: val df = spark.read.parquet("Cyrano_de_Bergerac_Acte_V.parquet") df.printSchema root |-- line: string (nullable = true) df.as[String] Otherwise, you will have to convert somehow the Row to a String, e.g. in Scala: case class Data(f1: String, f2: Int, f3: Long) val df = Seq(Data("a", 1, 1L), Data("b", 2, 2L), Data("c", 3, 3L), Data("d", 4, 4L), Data("e", 5, 5L)).toDF val ds = df.map(_.mkString(",")).as[String] ds.show +-+ |value| +-+ |a,1,1| |b,2,2| |c,3,3| |d,4,4| |e,5,5| +-----+ Regards, Christophe. On 6/4/22 14:38, marc nicole wrote: > Hi, > How to convert a Dataset to a Dataset? > What i have tried is: > > List list = dataset.as > <http://dataset.as>(Encoders.STRING()).collectAsList(); Dataset > datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line > raises a org.apache.spark.sql.AnalysisException: Try to map struct... to > Tuple1, but failed as the number of fields does not line up > > Type of columns being String > How to solve this?
Re: How to convert a Dataset to a Dataset?
Yes, Thanks Enrico, that was greatly helpful! To note that i was looking at some similar option at the docs but couldn't stumble on one. Thanks. Le sam. 4 juin 2022 à 19:29, Enrico Minack a écrit : > You could use .option("nullValue", "+") to tell the parser that '+' refers > to "no value": > > spark.read > .option("inferSchema", "true") > .option("header", "true") > .option("nullvalue", "+") > .csv("path") > > Enrico > > > Am 04.06.22 um 18:54 schrieb marc nicole: > > c1 > > c2 > > c3 > > c4 > > c5 > > c6 > > 1.2 > > true > > A > > Z > > 120 > > + > > 1.3 > > false > > B > > X > > 130 > > F > > + > > true > > C > > Y > > 200 > > G > in the above table c1 has double values except on the last row so: > > Dataset dataset = > spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path"); > will yield StringType as a type for column c1 similarly for c6 > I want to return the true type of each column by first discarding the "+" > I use Dataset after filtering the rows (removing "+") because i > can re-read the new dataset using .csv() method. > Any better idea to do that ? > > Le sam. 4 juin 2022 à 18:40, Enrico Minack a > écrit : > >> Can you provide an example string (row) and the expected inferred schema? >> >> Enrico >> >> >> Am 04.06.22 um 18:36 schrieb marc nicole: >> >> How to do just that? i thought we only can inferSchema when we first read >> the dataset, or am i wrong? >> >> Le sam. 4 juin 2022 à 18:10, Sean Owen a écrit : >> >>> It sounds like you want to interpret the input as strings, do some >>> processing, then infer the schema. That has nothing to do with construing >>> the entire row as a string like "Row[foo=bar, baz=1]" >>> >>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: >>> >>>> Hi Sean, >>>> >>>> Thanks, actually I have a dataset where I want to inferSchema after >>>> discarding the specific String value of "+". I do this because the column >>>> would be considered StringType while if i remove that "+" value it will be >>>> considered DoubleType for example or something else. Basically I want to >>>> remove "+" from all dataset rows and then inferschema. >>>> Here my idea is to filter the rows not equal to "+" for the target >>>> columns (potentially all of them) and then use spark.read().csv() to read >>>> the new filtered dataset with the option inferSchema which would then yield >>>> correct column types. >>>> What do you think? >>>> >>>> Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : >>>> >>>>> I don't think you want to do that. You get a string representation of >>>>> structured data without the structure, at best. This is part of the reason >>>>> it doesn't work directly this way. >>>>> You can use a UDF to call .toString on the Row of course, but, again >>>>> what are you really trying to do? >>>>> >>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> How to convert a Dataset to a Dataset? >>>>>> What i have tried is: >>>>>> >>>>>> List list = dataset.as(Encoders.STRING()).collectAsList(); >>>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); >>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>>>>> map struct... to Tuple1, but failed as the number of fields does not line >>>>>> up >>>>>> >>>>>> Type of columns being String >>>>>> How to solve this? >>>>>> >>>>> >> >
Re: How to convert a Dataset to a Dataset?
You could use .option("nullValue", "+") to tell the parser that '+' refers to "no value": spark.read .option("inferSchema", "true") .option("header", "true") .option("nullvalue", "+") .csv("path") Enrico Am 04.06.22 um 18:54 schrieb marc nicole: c1 c2 c3 c4 c5 c6 1.2 true A Z 120 + 1.3 false B X 130 F + true C Y 200 G in the above table c1 has double values except on the last row so: Dataset dataset = spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path"); will yield StringType as a type for column c1 similarly for c6 I want to return the true type of each column by first discarding the "+" I use Dataset after filtering the rows (removing "+") because i can re-read the new dataset using .csv() method. Any better idea to do that ? Le sam. 4 juin 2022 à 18:40, Enrico Minack a écrit : Can you provide an example string (row) and the expected inferred schema? Enrico Am 04.06.22 um 18:36 schrieb marc nicole: How to do just that? i thought we only can inferSchema when we first read the dataset, or am i wrong? Le sam. 4 juin 2022 à 18:10, Sean Owen a écrit : It sounds like you want to interpret the input as strings, do some processing, then infer the schema. That has nothing to do with construing the entire row as a string like "Row[foo=bar, baz=1]" On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: Hi Sean, Thanks, actually I have a dataset where I want to inferSchema after discarding the specific String value of "+". I do this because the column would be considered StringType while if i remove that "+" value it will be considered DoubleType for example or something else. Basically I want to remove "+" from all dataset rows and then inferschema. Here my idea is to filter the rows not equal to "+" for the target columns (potentially all of them) and then use spark.read().csv() to read the new filtered dataset with the option inferSchema which would then yield correct column types. What do you think? Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : I don't think you want to do that. You get a string representation of structured data without the structure, at best. This is part of the reason it doesn't work directly this way. You can use a UDF to call .toString on the Row of course, but, again what are you really trying to do? On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: Hi, How to convert a Dataset to a Dataset? What i have tried is: List list = dataset.as <http://dataset.as>(Encoders.STRING()).collectAsList(); Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line raises a org.apache.spark.sql.AnalysisException: Try to map struct... to Tuple1, but failed as the number of fields does not line up Type of columns being String How to solve this?
Re: How to convert a Dataset to a Dataset?
c1 c2 c3 c4 c5 c6 1.2 true A Z 120 + 1.3 false B X 130 F + true C Y 200 G in the above table c1 has double values except on the last row so: Dataset dataset = spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path"); will yield StringType as a type for column c1 similarly for c6 I want to return the true type of each column by first discarding the "+" I use Dataset after filtering the rows (removing "+") because i can re-read the new dataset using .csv() method. Any better idea to do that ? Le sam. 4 juin 2022 à 18:40, Enrico Minack a écrit : > Can you provide an example string (row) and the expected inferred schema? > > Enrico > > > Am 04.06.22 um 18:36 schrieb marc nicole: > > How to do just that? i thought we only can inferSchema when we first read > the dataset, or am i wrong? > > Le sam. 4 juin 2022 à 18:10, Sean Owen a écrit : > >> It sounds like you want to interpret the input as strings, do some >> processing, then infer the schema. That has nothing to do with construing >> the entire row as a string like "Row[foo=bar, baz=1]" >> >> On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: >> >>> Hi Sean, >>> >>> Thanks, actually I have a dataset where I want to inferSchema after >>> discarding the specific String value of "+". I do this because the column >>> would be considered StringType while if i remove that "+" value it will be >>> considered DoubleType for example or something else. Basically I want to >>> remove "+" from all dataset rows and then inferschema. >>> Here my idea is to filter the rows not equal to "+" for the target >>> columns (potentially all of them) and then use spark.read().csv() to read >>> the new filtered dataset with the option inferSchema which would then yield >>> correct column types. >>> What do you think? >>> >>> Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : >>> >>>> I don't think you want to do that. You get a string representation of >>>> structured data without the structure, at best. This is part of the reason >>>> it doesn't work directly this way. >>>> You can use a UDF to call .toString on the Row of course, but, again >>>> what are you really trying to do? >>>> >>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: >>>> >>>>> Hi, >>>>> How to convert a Dataset to a Dataset? >>>>> What i have tried is: >>>>> >>>>> List list = dataset.as(Encoders.STRING()).collectAsList(); >>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); >>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>>>> map struct... to Tuple1, but failed as the number of fields does not line >>>>> up >>>>> >>>>> Type of columns being String >>>>> How to solve this? >>>>> >>>> >
Re: How to convert a Dataset to a Dataset?
Can you provide an example string (row) and the expected inferred schema? Enrico Am 04.06.22 um 18:36 schrieb marc nicole: How to do just that? i thought we only can inferSchema when we first read the dataset, or am i wrong? Le sam. 4 juin 2022 à 18:10, Sean Owen a écrit : It sounds like you want to interpret the input as strings, do some processing, then infer the schema. That has nothing to do with construing the entire row as a string like "Row[foo=bar, baz=1]" On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: Hi Sean, Thanks, actually I have a dataset where I want to inferSchema after discarding the specific String value of "+". I do this because the column would be considered StringType while if i remove that "+" value it will be considered DoubleType for example or something else. Basically I want to remove "+" from all dataset rows and then inferschema. Here my idea is to filter the rows not equal to "+" for the target columns (potentially all of them) and then use spark.read().csv() to read the new filtered dataset with the option inferSchema which would then yield correct column types. What do you think? Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : I don't think you want to do that. You get a string representation of structured data without the structure, at best. This is part of the reason it doesn't work directly this way. You can use a UDF to call .toString on the Row of course, but, again what are you really trying to do? On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: Hi, How to convert a Dataset to a Dataset? What i have tried is: List list = dataset.as <http://dataset.as>(Encoders.STRING()).collectAsList(); Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line raises a org.apache.spark.sql.AnalysisException: Try to map struct... to Tuple1, but failed as the number of fields does not line up Type of columns being String How to solve this?
Re: How to convert a Dataset to a Dataset?
How to do just that? i thought we only can inferSchema when we first read the dataset, or am i wrong? Le sam. 4 juin 2022 à 18:10, Sean Owen a écrit : > It sounds like you want to interpret the input as strings, do some > processing, then infer the schema. That has nothing to do with construing > the entire row as a string like "Row[foo=bar, baz=1]" > > On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: > >> Hi Sean, >> >> Thanks, actually I have a dataset where I want to inferSchema after >> discarding the specific String value of "+". I do this because the column >> would be considered StringType while if i remove that "+" value it will be >> considered DoubleType for example or something else. Basically I want to >> remove "+" from all dataset rows and then inferschema. >> Here my idea is to filter the rows not equal to "+" for the target >> columns (potentially all of them) and then use spark.read().csv() to read >> the new filtered dataset with the option inferSchema which would then yield >> correct column types. >> What do you think? >> >> Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : >> >>> I don't think you want to do that. You get a string representation of >>> structured data without the structure, at best. This is part of the reason >>> it doesn't work directly this way. >>> You can use a UDF to call .toString on the Row of course, but, again >>> what are you really trying to do? >>> >>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: >>> >>>> Hi, >>>> How to convert a Dataset to a Dataset? >>>> What i have tried is: >>>> >>>> List list = dataset.as(Encoders.STRING()).collectAsList(); >>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); >>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>>> map struct... to Tuple1, but failed as the number of fields does not line >>>> up >>>> >>>> Type of columns being String >>>> How to solve this? >>>> >>>
Re: How to convert a Dataset to a Dataset?
It sounds like you want to interpret the input as strings, do some processing, then infer the schema. That has nothing to do with construing the entire row as a string like "Row[foo=bar, baz=1]" On Sat, Jun 4, 2022 at 10:32 AM marc nicole wrote: > Hi Sean, > > Thanks, actually I have a dataset where I want to inferSchema after > discarding the specific String value of "+". I do this because the column > would be considered StringType while if i remove that "+" value it will be > considered DoubleType for example or something else. Basically I want to > remove "+" from all dataset rows and then inferschema. > Here my idea is to filter the rows not equal to "+" for the target columns > (potentially all of them) and then use spark.read().csv() to read the new > filtered dataset with the option inferSchema which would then yield correct > column types. > What do you think? > > Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : > >> I don't think you want to do that. You get a string representation of >> structured data without the structure, at best. This is part of the reason >> it doesn't work directly this way. >> You can use a UDF to call .toString on the Row of course, but, again >> what are you really trying to do? >> >> On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: >> >>> Hi, >>> How to convert a Dataset to a Dataset? >>> What i have tried is: >>> >>> List list = dataset.as(Encoders.STRING()).collectAsList(); >>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); >>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >>> map struct... to Tuple1, but failed as the number of fields does not line >>> up >>> >>> Type of columns being String >>> How to solve this? >>> >>
Re: How to convert a Dataset to a Dataset?
Hi Sean, Thanks, actually I have a dataset where I want to inferSchema after discarding the specific String value of "+". I do this because the column would be considered StringType while if i remove that "+" value it will be considered DoubleType for example or something else. Basically I want to remove "+" from all dataset rows and then inferschema. Here my idea is to filter the rows not equal to "+" for the target columns (potentially all of them) and then use spark.read().csv() to read the new filtered dataset with the option inferSchema which would then yield correct column types. What do you think? Le sam. 4 juin 2022 à 15:56, Sean Owen a écrit : > I don't think you want to do that. You get a string representation of > structured data without the structure, at best. This is part of the reason > it doesn't work directly this way. > You can use a UDF to call .toString on the Row of course, but, again > what are you really trying to do? > > On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: > >> Hi, >> How to convert a Dataset to a Dataset? >> What i have tried is: >> >> List list = dataset.as(Encoders.STRING()).collectAsList(); >> Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); >> // But this line raises a org.apache.spark.sql.AnalysisException: Try to >> map struct... to Tuple1, but failed as the number of fields does not line >> up >> >> Type of columns being String >> How to solve this? >> >
Re: How to convert a Dataset to a Dataset?
I don't think you want to do that. You get a string representation of structured data without the structure, at best. This is part of the reason it doesn't work directly this way. You can use a UDF to call .toString on the Row of course, but, again what are you really trying to do? On Sat, Jun 4, 2022 at 7:35 AM marc nicole wrote: > Hi, > How to convert a Dataset to a Dataset? > What i have tried is: > > List list = dataset.as(Encoders.STRING()).collectAsList(); > Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); > // But this line raises a org.apache.spark.sql.AnalysisException: Try to > map struct... to Tuple1, but failed as the number of fields does not line > up > > Type of columns being String > How to solve this? >
How to convert a Dataset to a Dataset?
Hi, How to convert a Dataset to a Dataset? What i have tried is: List list = dataset.as(Encoders.STRING()).collectAsList(); Dataset datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line raises a org.apache.spark.sql.AnalysisException: Try to map struct... to Tuple1, but failed as the number of fields does not line up Type of columns being String How to solve this?