Re: How reading works?
Yeah, I understood that now. Thanks for the explanation, Bjorn. Sid On Wed, Jul 6, 2022 at 1:46 AM Bjørn Jørgensen wrote: > Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. > > duplicate column = duplicate rows > > > tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen < > bjornjorgen...@gmail.com>: > >> "*but I am getting the issue of the duplicate column which was present >> in the old dataset.*" >> >> So you have answered your question! >> >> spark.read.option("multiline","true").json("path").filter( >> col("edl_timestamp")>last_saved_timestamp) As you have figured out, >> spark read all the json files in "path" then filter. >> >> There are some file formats that can have filters before reading files. >> The one that I know about is Parquet. Like this link explains Spark: >> Understand the Basic of Pushed Filter and Partition Filter Using Parquet >> File >> <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd> >> >> >> >> >> >> tir. 5. jul. 2022 kl. 21:21 skrev Sid : >> >>> Hi Team, >>> >>> I still need help in understanding how reading works exactly? >>> >>> Thanks, >>> Sid >>> >>> On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: >>> >>>> Hi Team, >>>> >>>> Can somebody help? >>>> >>>> Thanks, >>>> Sid >>>> >>>> On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: >>>> >>>>> Hi, >>>>> >>>>> I already have a partitioned JSON dataset in s3 like the below: >>>>> >>>>> edl_timestamp=2022090800 >>>>> >>>>> Now, the problem is, in the earlier 10 days of data collection there >>>>> was a duplicate columns issue due to which we couldn't read the data. >>>>> >>>>> Now the latest 10 days of data are proper. So, I am trying to do >>>>> something like the below: >>>>> >>>>> >>>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >>>>> >>>>> but I am getting the issue of the duplicate column which was present >>>>> in the old dataset. So, I am trying to understand how the spark reads the >>>>> data. Does it full dataset and filter on the basis of the last saved >>>>> timestamp or does it filter only what is required? If the second case is >>>>> true, then it should have read the data since the latest data is correct. >>>>> >>>>> So just trying to understand. Could anyone help here? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> >>>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> > > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >
Re: How reading works?
Ehh.. What is "*duplicate column*" ? I don't think Spark supports that. duplicate column = duplicate rows tir. 5. jul. 2022 kl. 22:13 skrev Bjørn Jørgensen : > "*but I am getting the issue of the duplicate column which was present in > the old dataset.*" > > So you have answered your question! > > spark.read.option("multiline","true").json("path").filter( > col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark > read all the json files in "path" then filter. > > There are some file formats that can have filters before reading files. > The one that I know about is Parquet. Like this link explains Spark: > Understand the Basic of Pushed Filter and Partition Filter Using Parquet > File > <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd> > > > > > > tir. 5. jul. 2022 kl. 21:21 skrev Sid : > >> Hi Team, >> >> I still need help in understanding how reading works exactly? >> >> Thanks, >> Sid >> >> On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: >> >>> Hi Team, >>> >>> Can somebody help? >>> >>> Thanks, >>> Sid >>> >>> On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: >>> >>>> Hi, >>>> >>>> I already have a partitioned JSON dataset in s3 like the below: >>>> >>>> edl_timestamp=2022090800 >>>> >>>> Now, the problem is, in the earlier 10 days of data collection there >>>> was a duplicate columns issue due to which we couldn't read the data. >>>> >>>> Now the latest 10 days of data are proper. So, I am trying to do >>>> something like the below: >>>> >>>> >>>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >>>> >>>> but I am getting the issue of the duplicate column which was present in >>>> the old dataset. So, I am trying to understand how the spark reads the >>>> data. Does it full dataset and filter on the basis of the last saved >>>> timestamp or does it filter only what is required? If the second case is >>>> true, then it should have read the data since the latest data is correct. >>>> >>>> So just trying to understand. Could anyone help here? >>>> >>>> Thanks, >>>> Sid >>>> >>>> >>>> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 > -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297
Re: How reading works?
"*but I am getting the issue of the duplicate column which was present in the old dataset.*" So you have answered your question! spark.read.option("multiline","true").json("path").filter( col("edl_timestamp")>last_saved_timestamp) As you have figured out, spark read all the json files in "path" then filter. There are some file formats that can have filters before reading files. The one that I know about is Parquet. Like this link explains Spark: Understand the Basic of Pushed Filter and Partition Filter Using Parquet File <https://medium.com/@songkunjump/spark-understand-the-basic-of-pushed-filter-and-partition-filter-using-parquet-file-3e5789e260bd> tir. 5. jul. 2022 kl. 21:21 skrev Sid : > Hi Team, > > I still need help in understanding how reading works exactly? > > Thanks, > Sid > > On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: > >> Hi Team, >> >> Can somebody help? >> >> Thanks, >> Sid >> >> On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: >> >>> Hi, >>> >>> I already have a partitioned JSON dataset in s3 like the below: >>> >>> edl_timestamp=2022090800 >>> >>> Now, the problem is, in the earlier 10 days of data collection there was >>> a duplicate columns issue due to which we couldn't read the data. >>> >>> Now the latest 10 days of data are proper. So, I am trying to do >>> something like the below: >>> >>> >>> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >>> >>> but I am getting the issue of the duplicate column which was present in >>> the old dataset. So, I am trying to understand how the spark reads the >>> data. Does it full dataset and filter on the basis of the last saved >>> timestamp or does it filter only what is required? If the second case is >>> true, then it should have read the data since the latest data is correct. >>> >>> So just trying to understand. Could anyone help here? >>> >>> Thanks, >>> Sid >>> >>> >>> -- Bjørn Jørgensen Vestre Aspehaug 4, 6010 Ålesund Norge +47 480 94 297
Re: How reading works?
Hi Team, I still need help in understanding how reading works exactly? Thanks, Sid On Mon, Jun 20, 2022 at 2:23 PM Sid wrote: > Hi Team, > > Can somebody help? > > Thanks, > Sid > > On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > >> Hi, >> >> I already have a partitioned JSON dataset in s3 like the below: >> >> edl_timestamp=2022090800 >> >> Now, the problem is, in the earlier 10 days of data collection there was >> a duplicate columns issue due to which we couldn't read the data. >> >> Now the latest 10 days of data are proper. So, I am trying to do >> something like the below: >> >> >> spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) >> >> but I am getting the issue of the duplicate column which was present in >> the old dataset. So, I am trying to understand how the spark reads the >> data. Does it full dataset and filter on the basis of the last saved >> timestamp or does it filter only what is required? If the second case is >> true, then it should have read the data since the latest data is correct. >> >> So just trying to understand. Could anyone help here? >> >> Thanks, >> Sid >> >> >>
Re: How reading works?
Hi Team, Can somebody help? Thanks, Sid On Sun, Jun 19, 2022 at 3:51 PM Sid wrote: > Hi, > > I already have a partitioned JSON dataset in s3 like the below: > > edl_timestamp=2022090800 > > Now, the problem is, in the earlier 10 days of data collection there was a > duplicate columns issue due to which we couldn't read the data. > > Now the latest 10 days of data are proper. So, I am trying to do > something like the below: > > > spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) > > but I am getting the issue of the duplicate column which was present in > the old dataset. So, I am trying to understand how the spark reads the > data. Does it full dataset and filter on the basis of the last saved > timestamp or does it filter only what is required? If the second case is > true, then it should have read the data since the latest data is correct. > > So just trying to understand. Could anyone help here? > > Thanks, > Sid > > >
How reading works?
Hi, I already have a partitioned JSON dataset in s3 like the below: edl_timestamp=2022090800 Now, the problem is, in the earlier 10 days of data collection there was a duplicate columns issue due to which we couldn't read the data. Now the latest 10 days of data are proper. So, I am trying to do something like the below: spark.read.option("multiline","true").json("path").filter(col("edl_timestamp")>last_saved_timestamp) but I am getting the issue of the duplicate column which was present in the old dataset. So, I am trying to understand how the spark reads the data. Does it full dataset and filter on the basis of the last saved timestamp or does it filter only what is required? If the second case is true, then it should have read the data since the latest data is correct. So just trying to understand. Could anyone help here? Thanks, Sid