Re: How to convert a Dataset to a Dataset?

2022-06-06 Thread Stelios Philippou
Hi All,

Simple in Java as well.
You can get the Dataset Directly

Dataset encodedString = df.select("Column")
.where("")
.as(Encoders.STRING())
.toDF();

On Mon, 6 Jun 2022 at 15:26, Christophe Préaud <
christophe.pre...@kelkoogroup.com> wrote:

> Hi Marc,
>
> I'm not much familiar with Spark on Java, but according to the doc
> <https://spark.apache.org/docs/latest/sql-getting-started.html#creating-datasets>,
> it should be:
> Encoder stringEncoder = Encoders.STRING();
> dataset.as(stringEncoder);
>
> For the record, it is much simpler in Scala:
> dataset.as[String]
>
> Of course, this will work if your DataFrame only contains one column of
> type String, e.g.:
> val df = spark.read.parquet("Cyrano_de_Bergerac_Acte_V.parquet")
> df.printSchema
>
> root
>  |-- line: string (nullable = true)
>
> df.as[String]
>
> Otherwise, you will have to convert somehow the Row to a String, e.g. in
> Scala:
> case class Data(f1: String, f2: Int, f3: Long)
> val df = Seq(Data("a", 1, 1L), Data("b", 2, 2L), Data("c", 3, 3L),
> Data("d", 4, 4L), Data("e", 5, 5L)).toDF
> val ds = df.map(_.mkString(",")).as[String]
> ds.show
>
> +-+
> |value|
> +-+
> |a,1,1|
> |b,2,2|
> |c,3,3|
> |d,4,4|
> |e,5,5|
> +-+
>
> Regards,
> Christophe.
>
> On 6/4/22 14:38, marc nicole wrote:
>
> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as(Encoders.STRING()).collectAsList();
> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
> map struct... to Tuple1, but failed as the number of fields does not line
> up
>
> Type of columns being String
> How to solve this?
>
>
>


Re: How to convert a Dataset to a Dataset?

2022-06-06 Thread Christophe Préaud
Hi Marc,

I'm not much familiar with Spark on Java, but according to the doc 
<https://spark.apache.org/docs/latest/sql-getting-started.html#creating-datasets>,
 it should be:
Encoder stringEncoder = Encoders.STRING();
dataset.as(stringEncoder);


For the record, it is much simpler in Scala:
dataset.as[String]


Of course, this will work if your DataFrame only contains one column of type 
String, e.g.:
val df = spark.read.parquet("Cyrano_de_Bergerac_Acte_V.parquet")
df.printSchema

root
 |-- line: string (nullable = true)

df.as[String]


Otherwise, you will have to convert somehow the Row to a String, e.g. in Scala:
case class Data(f1: String, f2: Int, f3: Long)
val df = Seq(Data("a", 1, 1L), Data("b", 2, 2L), Data("c", 3, 3L), Data("d", 4, 
4L), Data("e", 5, 5L)).toDF
val ds = df.map(_.mkString(",")).as[String]
ds.show

+-+
|value|
+-+
|a,1,1|
|b,2,2|
|c,3,3|
|d,4,4|
|e,5,5|
+-----+


Regards,
Christophe.

On 6/4/22 14:38, marc nicole wrote:
> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as 
> <http://dataset.as>(Encoders.STRING()).collectAsList(); Dataset 
> datasetSt = spark.createDataset(list, Encoders.STRING()); // But this line 
> raises a org.apache.spark.sql.AnalysisException: Try to map struct... to 
> Tuple1, but failed as the number of fields does not line up 
>
> Type of columns being String
> How to solve this?



Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Yes, Thanks Enrico, that was greatly helpful!
To note that i was looking at some similar option at the docs but couldn't
stumble on one.
Thanks.

Le sam. 4 juin 2022 à 19:29, Enrico Minack  a
écrit :

> You could use .option("nullValue", "+") to tell the parser that '+' refers
> to "no value":
>
> spark.read
>  .option("inferSchema", "true")
>  .option("header", "true")
>  .option("nullvalue", "+")
>  .csv("path")
>
> Enrico
>
>
> Am 04.06.22 um 18:54 schrieb marc nicole:
>
> c1
>
> c2
>
> c3
>
> c4
>
> c5
>
> c6
>
> 1.2
>
> true
>
> A
>
> Z
>
> 120
>
> +
>
> 1.3
>
> false
>
> B
>
> X
>
> 130
>
> F
>
> +
>
> true
>
> C
>
> Y
>
> 200
>
> G
> in the above table c1 has double values except on the last row so:
>
> Dataset dataset =
> spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
> will yield StringType as a type for column c1 similarly for c6
> I want to return the true type of each column by first discarding the "+"
> I use Dataset after filtering the rows (removing "+") because i
> can re-read the new dataset using .csv() method.
> Any better idea to do that ?
>
> Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
> écrit :
>
>> Can you provide an example string (row) and the expected inferred schema?
>>
>> Enrico
>>
>>
>> Am 04.06.22 um 18:36 schrieb marc nicole:
>>
>> How to do just that? i thought we only can inferSchema when we first read
>> the dataset, or am i wrong?
>>
>> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>>
>>> It sounds like you want to interpret the input as strings, do some
>>> processing, then infer the schema. That has nothing to do with construing
>>> the entire row as a string like "Row[foo=bar, baz=1]"
>>>
>>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>>
>>>> Hi Sean,
>>>>
>>>> Thanks, actually I have a dataset where I want to inferSchema after
>>>> discarding the specific String value of "+". I do this because the column
>>>> would be considered StringType while if i remove that "+" value it will be
>>>> considered DoubleType for example or something else. Basically I want to
>>>> remove "+" from all dataset rows and then inferschema.
>>>> Here my idea is to filter the rows not equal to "+" for the target
>>>> columns (potentially all of them) and then use spark.read().csv() to read
>>>> the new filtered dataset with the option inferSchema which would then yield
>>>> correct column types.
>>>> What do you think?
>>>>
>>>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>>>
>>>>> I don't think you want to do that. You get a string representation of
>>>>> structured data without the structure, at best. This is part of the reason
>>>>> it doesn't work directly this way.
>>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>>> what are you really trying to do?
>>>>>
>>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> How to convert a Dataset to a Dataset?
>>>>>> What i have tried is:
>>>>>>
>>>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>>> up
>>>>>>
>>>>>> Type of columns being String
>>>>>> How to solve this?
>>>>>>
>>>>>
>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Enrico Minack
You could use .option("nullValue", "+") to tell the parser that '+' 
refers to "no value":


spark.read
.option("inferSchema", "true")
.option("header", "true")
.option("nullvalue", "+")
 .csv("path")

Enrico


Am 04.06.22 um 18:54 schrieb marc nicole:


c1



c2



c3



c4



c5



c6

1.2



true



A



Z



120



+

1.3



false



B



X



130



F

+



true



C



Y



200



G

in the above table c1 has double values except on the last row so:

Dataset dataset = 
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");

will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset after filtering the rows (removing "+") because 
i can re-read the new dataset using .csv() method.

Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack  a 
écrit :


Can you provide an example string (row) and the expected inferred
schema?

Enrico


Am 04.06.22 um 18:36 schrieb marc nicole:

How to do just that? i thought we only can inferSchema when we
first read the dataset, or am i wrong?

Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

It sounds like you want to interpret the input as strings, do
some processing, then infer the schema. That has nothing to
do with construing the entire row as a string like
"Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole
 wrote:

Hi Sean,

Thanks, actually I have a dataset where I want to
inferSchema after discarding the specific String value of
"+". I do this because the column would be considered
StringType while if i remove that "+" value it will be
considered DoubleType for example or something else.
Basically I want to remove "+" from all dataset rows and
then inferschema.
Here my idea is to filter the rows not equal to "+" for
the target columns (potentially all of them) and then use
spark.read().csv() to read the new filtered dataset with
the option inferSchema which would then yield correct
column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen 
a écrit :

I don't think you want to do that. You get a string
representation of structured data without the
structure, at best. This is part of the reason it
doesn't work directly this way.
You can use a UDF to call .toString on the Row of
course, but, again what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole
 wrote:

Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as
<http://dataset.as>(Encoders.STRING()).collectAsList();
Dataset datasetSt =
spark.createDataset(list, Encoders.STRING()); //
But this line raises
a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number
of fields does not line up

Type of columns being String
How to solve this?





Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
c1

c2

c3

c4

c5

c6

1.2

true

A

Z

120

+

1.3

false

B

X

130

F

+

true

C

Y

200

G
in the above table c1 has double values except on the last row so:

Dataset dataset =
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset after filtering the rows (removing "+") because i can
re-read the new dataset using .csv() method.
Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack  a
écrit :

> Can you provide an example string (row) and the expected inferred schema?
>
> Enrico
>
>
> Am 04.06.22 um 18:36 schrieb marc nicole:
>
> How to do just that? i thought we only can inferSchema when we first read
> the dataset, or am i wrong?
>
> Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :
>
>> It sounds like you want to interpret the input as strings, do some
>> processing, then infer the schema. That has nothing to do with construing
>> the entire row as a string like "Row[foo=bar, baz=1]"
>>
>> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>>
>>> Hi Sean,
>>>
>>> Thanks, actually I have a dataset where I want to inferSchema after
>>> discarding the specific String value of "+". I do this because the column
>>> would be considered StringType while if i remove that "+" value it will be
>>> considered DoubleType for example or something else. Basically I want to
>>> remove "+" from all dataset rows and then inferschema.
>>> Here my idea is to filter the rows not equal to "+" for the target
>>> columns (potentially all of them) and then use spark.read().csv() to read
>>> the new filtered dataset with the option inferSchema which would then yield
>>> correct column types.
>>> What do you think?
>>>
>>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>>
>>>> I don't think you want to do that. You get a string representation of
>>>> structured data without the structure, at best. This is part of the reason
>>>> it doesn't work directly this way.
>>>> You can use a UDF to call .toString on the Row of course, but, again
>>>> what are you really trying to do?
>>>>
>>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>>>
>>>>> Hi,
>>>>> How to convert a Dataset to a Dataset?
>>>>> What i have tried is:
>>>>>
>>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>>> up
>>>>>
>>>>> Type of columns being String
>>>>> How to solve this?
>>>>>
>>>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Enrico Minack

Can you provide an example string (row) and the expected inferred schema?

Enrico


Am 04.06.22 um 18:36 schrieb marc nicole:
How to do just that? i thought we only can inferSchema when we first 
read the dataset, or am i wrong?


Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

It sounds like you want to interpret the input as strings, do some
processing, then infer the schema. That has nothing to do with
construing the entire row as a string like "Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole 
wrote:

Hi Sean,

Thanks, actually I have a dataset where I want to inferSchema
after discarding the specific String value of "+". I do this
because the column would be considered StringType while if i
remove that "+" value it will be considered DoubleType for
example or something else. Basically I want to remove "+" from
all dataset rows and then inferschema.
Here my idea is to filter the rows not equal to "+" for the
target columns (potentially all of them) and then use
spark.read().csv() to read the new filtered dataset with the
option inferSchema which would then yield correct column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen  a
écrit :

I don't think you want to do that. You get a string
representation of structured data without the structure,
at best. This is part of the reason it doesn't work
directly this way.
You can use a UDF to call .toString on the Row of course,
but, again what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole
         wrote:

        Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as
<http://dataset.as>(Encoders.STRING()).collectAsList();
Dataset datasetSt = spark.createDataset(list,
Encoders.STRING()); // But this line raises
a org.apache.spark.sql.AnalysisException: Try to map
struct... to Tuple1, but failed as the number of
fields does not line up

Type of columns being String
How to solve this?



Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
How to do just that? i thought we only can inferSchema when we first read
the dataset, or am i wrong?

Le sam. 4 juin 2022 à 18:10, Sean Owen  a écrit :

> It sounds like you want to interpret the input as strings, do some
> processing, then infer the schema. That has nothing to do with construing
> the entire row as a string like "Row[foo=bar, baz=1]"
>
> On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:
>
>> Hi Sean,
>>
>> Thanks, actually I have a dataset where I want to inferSchema after
>> discarding the specific String value of "+". I do this because the column
>> would be considered StringType while if i remove that "+" value it will be
>> considered DoubleType for example or something else. Basically I want to
>> remove "+" from all dataset rows and then inferschema.
>> Here my idea is to filter the rows not equal to "+" for the target
>> columns (potentially all of them) and then use spark.read().csv() to read
>> the new filtered dataset with the option inferSchema which would then yield
>> correct column types.
>> What do you think?
>>
>> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>>
>>> I don't think you want to do that. You get a string representation of
>>> structured data without the structure, at best. This is part of the reason
>>> it doesn't work directly this way.
>>> You can use a UDF to call .toString on the Row of course, but, again
>>> what are you really trying to do?
>>>
>>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>>
>>>> Hi,
>>>> How to convert a Dataset to a Dataset?
>>>> What i have tried is:
>>>>
>>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>>> map struct... to Tuple1, but failed as the number of fields does not line
>>>> up
>>>>
>>>> Type of columns being String
>>>> How to solve this?
>>>>
>>>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
It sounds like you want to interpret the input as strings, do some
processing, then infer the schema. That has nothing to do with construing
the entire row as a string like "Row[foo=bar, baz=1]"

On Sat, Jun 4, 2022 at 10:32 AM marc nicole  wrote:

> Hi Sean,
>
> Thanks, actually I have a dataset where I want to inferSchema after
> discarding the specific String value of "+". I do this because the column
> would be considered StringType while if i remove that "+" value it will be
> considered DoubleType for example or something else. Basically I want to
> remove "+" from all dataset rows and then inferschema.
> Here my idea is to filter the rows not equal to "+" for the target columns
> (potentially all of them) and then use spark.read().csv() to read the new
> filtered dataset with the option inferSchema which would then yield correct
> column types.
> What do you think?
>
> Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :
>
>> I don't think you want to do that. You get a string representation of
>> structured data without the structure, at best. This is part of the reason
>> it doesn't work directly this way.
>> You can use a UDF to call .toString on the Row of course, but, again
>> what are you really trying to do?
>>
>> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>>
>>> Hi,
>>> How to convert a Dataset to a Dataset?
>>> What i have tried is:
>>>
>>> List list = dataset.as(Encoders.STRING()).collectAsList();
>>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>>> map struct... to Tuple1, but failed as the number of fields does not line
>>> up
>>>
>>> Type of columns being String
>>> How to solve this?
>>>
>>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi Sean,

Thanks, actually I have a dataset where I want to inferSchema after
discarding the specific String value of "+". I do this because the column
would be considered StringType while if i remove that "+" value it will be
considered DoubleType for example or something else. Basically I want to
remove "+" from all dataset rows and then inferschema.
Here my idea is to filter the rows not equal to "+" for the target columns
(potentially all of them) and then use spark.read().csv() to read the new
filtered dataset with the option inferSchema which would then yield correct
column types.
What do you think?

Le sam. 4 juin 2022 à 15:56, Sean Owen  a écrit :

> I don't think you want to do that. You get a string representation of
> structured data without the structure, at best. This is part of the reason
> it doesn't work directly this way.
> You can use a UDF to call .toString on the Row of course, but, again
> what are you really trying to do?
>
> On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:
>
>> Hi,
>> How to convert a Dataset to a Dataset?
>> What i have tried is:
>>
>> List list = dataset.as(Encoders.STRING()).collectAsList();
>> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
>> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
>> map struct... to Tuple1, but failed as the number of fields does not line
>> up
>>
>> Type of columns being String
>> How to solve this?
>>
>


Re: How to convert a Dataset to a Dataset?

2022-06-04 Thread Sean Owen
I don't think you want to do that. You get a string representation of
structured data without the structure, at best. This is part of the reason
it doesn't work directly this way.
You can use a UDF to call .toString on the Row of course, but, again
what are you really trying to do?

On Sat, Jun 4, 2022 at 7:35 AM marc nicole  wrote:

> Hi,
> How to convert a Dataset to a Dataset?
> What i have tried is:
>
> List list = dataset.as(Encoders.STRING()).collectAsList();
> Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
> // But this line raises a org.apache.spark.sql.AnalysisException: Try to
> map struct... to Tuple1, but failed as the number of fields does not line
> up
>
> Type of columns being String
> How to solve this?
>


How to convert a Dataset to a Dataset?

2022-06-04 Thread marc nicole
Hi,
How to convert a Dataset to a Dataset?
What i have tried is:

List list = dataset.as(Encoders.STRING()).collectAsList();
Dataset datasetSt = spark.createDataset(list, Encoders.STRING());
// But this line raises a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number of fields does not line
up

Type of columns being String
How to solve this?