Re: How reading works?

2022-07-13 Thread Sid
Yeah, I understood that now.

Thanks for the explanation, Bjorn.

Sid

On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen 
wrote:

> Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.
>
> duplicate column = duplicate rows
>
>
> tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen <
> bjornjorgen...@gmail.com>:
>
>> "*but I am getting the issue of the duplicate column which was present
>> in the old dataset.*"
>>
>> So you have answered your question!
>>
>> spark.read.option("multiline","true").json("path").filter(
>> col("edl_timestamp")>last_saved_timestamp) As you have figured out,
>> spark read all the json files in "path" then filter.
>>
>> There are some file formats that can have filters before reading files.
>> The one that I know about is Parquet. Like this link explains Spark:
>> Understand the Basic of Pushed Filter and Partition Filter Using Parquet
>> File
>> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>
>>
>>
>>
>>
>>
>> tir. 5. jul. 2022 kl. 21:21 skrev Sid :
>>
>>> Hi Team,
>>>
>>> I still need help in understanding how reading works exactly?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Mon, Jun 20, 2022 at 2:23 PM Sid  wrote:
>>>
>>>> Hi Team,
>>>>
>>>> Can somebody help?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>> On Sun, Jun 19, 2022 at 3:51 PM Sid  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I already have a partitioned JSON dataset in s3 like the below:
>>>>>
>>>>> edl_timestamp=2022090800
>>>>>
>>>>> Now, the problem is, in the earlier 10 days of data collection there
>>>>> was a duplicate columns issue due to which we couldn't read the data.
>>>>>
>>>>> Now the latest 10 days of data are proper. So, I am trying to do
>>>>> something like the below:
>>>>>
>>>>>
>>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>>>
>>>>> but I am getting the issue of the duplicate column which was present
>>>>> in the old dataset. So, I am trying to understand how the spark reads the
>>>>> data. Does it full dataset and filter on the basis of the last saved
>>>>> timestamp or does it filter only what is required? If the second case is
>>>>> true, then it should have read the data since the latest data is correct.
>>>>>
>>>>> So just trying to understand. Could anyone help here?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>>
>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
Ehh.. What is "*duplicate column*" ? I don't think Spark supports that.

duplicate column = duplicate rows


tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen :

> "*but I am getting the issue of the duplicate column which was present in
> the old dataset.*"
>
> So you have answered your question!
>
> spark.read.option("multiline","true").json("path").filter(
> col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark
> read all the json files in "path" then filter.
>
> There are some file formats that can have filters before reading files.
> The one that I know about is Parquet. Like this link explains Spark:
> Understand the Basic of Pushed Filter and Partition Filter Using Parquet
> File
> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>
>
>
>
>
>
> tir. 5. jul. 2022 kl. 21:21 skrev Sid :
>
>> Hi Team,
>>
>> I still need help in understanding how reading works exactly?
>>
>> Thanks,
>> Sid
>>
>> On Mon, Jun 20, 2022 at 2:23 PM Sid  wrote:
>>
>>> Hi Team,
>>>
>>> Can somebody help?
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Sun, Jun 19, 2022 at 3:51 PM Sid  wrote:
>>>
>>>> Hi,
>>>>
>>>> I already have a partitioned JSON dataset in s3 like the below:
>>>>
>>>> edl_timestamp=2022090800
>>>>
>>>> Now, the problem is, in the earlier 10 days of data collection there
>>>> was a duplicate columns issue due to which we couldn't read the data.
>>>>
>>>> Now the latest 10 days of data are proper. So, I am trying to do
>>>> something like the below:
>>>>
>>>>
>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>>
>>>> but I am getting the issue of the duplicate column which was present in
>>>> the old dataset. So, I am trying to understand how the spark reads the
>>>> data. Does it full dataset and filter on the basis of the last saved
>>>> timestamp or does it filter only what is required? If the second case is
>>>> true, then it should have read the data since the latest data is correct.
>>>>
>>>> So just trying to understand. Could anyone help here?
>>>>
>>>> Thanks,
>>>> Sid
>>>>
>>>>
>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: How reading works?

2022-07-05 Thread Bjørn Jørgensen
"*but I am getting the issue of the duplicate column which was present in
the old dataset.*"

So you have answered your question!

spark.read.option("multiline","true").json("path").filter(
col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark
read all the json files in "path" then filter.

There are some file formats that can have filters before reading files. The
one that I know about is Parquet. Like this link explains Spark: Understand
the Basic of Pushed Filter and Partition Filter Using Parquet File
<https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd>





tir. 5. jul. 2022 kl. 21:21 skrev Sid :

> Hi Team,
>
> I still need help in understanding how reading works exactly?
>
> Thanks,
> Sid
>
> On Mon, Jun 20, 2022 at 2:23 PM Sid  wrote:
>
>> Hi Team,
>>
>> Can somebody help?
>>
>> Thanks,
>> Sid
>>
>> On Sun, Jun 19, 2022 at 3:51 PM Sid  wrote:
>>
>>> Hi,
>>>
>>> I already have a partitioned JSON dataset in s3 like the below:
>>>
>>> edl_timestamp=2022090800
>>>
>>> Now, the problem is, in the earlier 10 days of data collection there was
>>> a duplicate columns issue due to which we couldn't read the data.
>>>
>>> Now the latest 10 days of data are proper. So, I am trying to do
>>> something like the below:
>>>
>>>
>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>>
>>> but I am getting the issue of the duplicate column which was present in
>>> the old dataset. So, I am trying to understand how the spark reads the
>>> data. Does it full dataset and filter on the basis of the last saved
>>> timestamp or does it filter only what is required? If the second case is
>>> true, then it should have read the data since the latest data is correct.
>>>
>>> So just trying to understand. Could anyone help here?
>>>
>>> Thanks,
>>> Sid
>>>
>>>
>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: How reading works?

2022-07-05 Thread Sid
Hi Team,

I still need help in understanding how reading works exactly?

Thanks,
Sid

On Mon, Jun 20, 2022 at 2:23 PM Sid  wrote:

> Hi Team,
>
> Can somebody help?
>
> Thanks,
> Sid
>
> On Sun, Jun 19, 2022 at 3:51 PM Sid  wrote:
>
>> Hi,
>>
>> I already have a partitioned JSON dataset in s3 like the below:
>>
>> edl_timestamp=2022090800
>>
>> Now, the problem is, in the earlier 10 days of data collection there was
>> a duplicate columns issue due to which we couldn't read the data.
>>
>> Now the latest 10 days of data are proper. So, I am trying to do
>> something like the below:
>>
>>
>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>>
>> but I am getting the issue of the duplicate column which was present in
>> the old dataset. So, I am trying to understand how the spark reads the
>> data. Does it full dataset and filter on the basis of the last saved
>> timestamp or does it filter only what is required? If the second case is
>> true, then it should have read the data since the latest data is correct.
>>
>> So just trying to understand. Could anyone help here?
>>
>> Thanks,
>> Sid
>>
>>
>>


Re: How reading works?

2022-06-20 Thread Sid
Hi Team,

Can somebody help?

Thanks,
Sid

On Sun, Jun 19, 2022 at 3:51 PM Sid  wrote:

> Hi,
>
> I already have a partitioned JSON dataset in s3 like the below:
>
> edl_timestamp=2022090800
>
> Now, the problem is, in the earlier 10 days of data collection there was a
> duplicate columns issue due to which we couldn't read the data.
>
> Now the latest 10 days of data are proper. So, I am trying to do
> something like the below:
>
>
> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)
>
> but I am getting the issue of the duplicate column which was present in
> the old dataset. So, I am trying to understand how the spark reads the
> data. Does it full dataset and filter on the basis of the last saved
> timestamp or does it filter only what is required? If the second case is
> true, then it should have read the data since the latest data is correct.
>
> So just trying to understand. Could anyone help here?
>
> Thanks,
> Sid
>
>
>


How reading works?

2022-06-19 Thread Sid
Hi,

I already have a partitioned JSON dataset in s3 like the below:

edl_timestamp=2022090800

Now, the problem is, in the earlier 10 days of data collection there was a
duplicate columns issue due to which we couldn't read the data.

Now the latest 10 days of data are proper. So, I am trying to do
something like the below:

spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp)

but I am getting the issue of the duplicate column which was present in the
old dataset. So, I am trying to understand how the spark reads the data.
Does it full dataset and filter on the basis of the last saved timestamp or
does it filter only what is required? If the second case is true, then it
should have read the data since the latest data is correct.

So just trying to understand. Could anyone help here?

Thanks,
Sid