Re: input file size
Wow! That was exactly what I was looking for. I hadn't even thought of File.length and thanks to your tips, the solution is now on a silver platter in front of me. Thank you very much! Am So., 19. Juni 2022 um 12:03 Uhr schrieb marc nicole : > Reasoning in files (vs datasets as i first thought of this question), I > think this is more adequate in Spark: > >> org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); > > it will yield same result as > >> new File("filePath").length(); > > > Le dim. 19 juin 2022 à 11:11, Enrico Minack a > écrit : > >> Maybe a >> >> .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else >> Iterator.empty) >> >> might be faster than the >> >> .distinct.as[String] >> >> >> Enrico >> >> >> Am 19.06.22 um 08:59 schrieb Enrico Minack: >> >> Given you already know your input files (input_file_name), why not >> getting their size and summing this up? >> >> import java.io.Fileimport java.net.URIimport >> org.apache.spark.sql.functions.input_file_name >> ds.select(input_file_name.as("filename")) >> .distinct.as[String] >> .map(filename => new File(new URI(filename).getPath).length) >> .select(sum($"value")) >> .show() >> >> >> Enrico >> >> >> Am 19.06.22 um 03:16 schrieb Yong Walt: >> >> import java.io.Fileval someFile = new File("somefile.txt")val fileSize = >> someFile.length >> >> This one? >> >> >> On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: >> >>> Hello Community, >>> >>> I am working on optimizations for file sizes and number of files. In the >>> data frame there is a function input_file_name which returns the file >>> name. I miss a counterpart to get the size of the file. Just the size, >>> like "ls -l" returns. Is there something like that? >>> >>> Kind regards, >>> Markus >>> >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >>
Re: input file size
Reasoning in files (vs datasets as i first thought of this question), I think this is more adequate in Spark: > org.apache.spark.util.Utils.getFileLength(new File("filePath"),null); it will yield same result as > new File("filePath").length(); Le dim. 19 juin 2022 à 11:11, Enrico Minack a écrit : > Maybe a > > .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else > Iterator.empty) > > might be faster than the > > .distinct.as[String] > > > Enrico > > > Am 19.06.22 um 08:59 schrieb Enrico Minack: > > Given you already know your input files (input_file_name), why not getting > their size and summing this up? > > import java.io.Fileimport java.net.URIimport > org.apache.spark.sql.functions.input_file_name > ds.select(input_file_name.as("filename")) > .distinct.as[String] > .map(filename => new File(new URI(filename).getPath).length) > .select(sum($"value")) > .show() > > > Enrico > > > Am 19.06.22 um 03:16 schrieb Yong Walt: > > import java.io.Fileval someFile = new File("somefile.txt")val fileSize = > someFile.length > > This one? > > > On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > >> Hello Community, >> >> I am working on optimizations for file sizes and number of files. In the >> data frame there is a function input_file_name which returns the file >> name. I miss a counterpart to get the size of the file. Just the size, >> like "ls -l" returns. Is there something like that? >> >> Kind regards, >> Markus >> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > >
Re: input file size
Maybe a .as[String].mapPartitions(it => if (it.hasNext) Iterator(it.next) else Iterator.empty) might be faster than the .distinct.as[String] Enrico Am 19.06.22 um 08:59 schrieb Enrico Minack: Given you already know your input files (input_file_name), why not getting their size and summing this up? |import java.io.File ||import java.net.URI| |import| org.apache.spark.sql.functions.input_file_name |ds.select(input_file_name.as("filename")) .distinct.as[String] .map(filename => new File(new URI(filename).getPath).length) .select(sum($"value")) .show()| || Enrico Am 19.06.22 um 03:16 schrieb Yong Walt: |import java.io.File val someFile = new File("somefile.txt") val fileSize = someFile.length| This one? On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: Hello Community, I am working on optimizations for file sizes and number of files. In the data frame there is a function input_file_name which returns the file name. I miss a counterpart to get the size of the file. Just the size, like "ls -l" returns. Is there something like that? Kind regards, Markus - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: input file size
Hi, Just so that we understand the intention why do you need to know the file size? Are you not using splittable file format? If you use spark streaming to read the files, using just once, then you will be able to get the metadata of the files I believe. Regards, Gourav Sengupta On Sun, Jun 19, 2022 at 8:00 AM Enrico Minack wrote: > Given you already know your input files (input_file_name), why not getting > their size and summing this up? > > import java.io.Fileimport java.net.URIimport > org.apache.spark.sql.functions.input_file_name > ds.select(input_file_name.as("filename")) > .distinct.as[String] > .map(filename => new File(new URI(filename).getPath).length) > .select(sum($"value")) > .show() > > > Enrico > > > Am 19.06.22 um 03:16 schrieb Yong Walt: > > import java.io.Fileval someFile = new File("somefile.txt")val fileSize = > someFile.length > > This one? > > > On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > >> Hello Community, >> >> I am working on optimizations for file sizes and number of files. In the >> data frame there is a function input_file_name which returns the file >> name. I miss a counterpart to get the size of the file. Just the size, >> like "ls -l" returns. Is there something like that? >> >> Kind regards, >> Markus >> >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: input file size
Given you already know your input files (input_file_name), why not getting their size and summing this up? |import java.io.File ||import java.net.URI| |import| org.apache.spark.sql.functions.input_file_name |ds.select(input_file_name.as("filename")) .distinct.as[String] .map(filename => new File(new URI(filename).getPath).length) .select(sum($"value")) .show()| || Enrico Am 19.06.22 um 03:16 schrieb Yong Walt: |import java.io.File val someFile = new File("somefile.txt") val fileSize = someFile.length| This one? On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: Hello Community, I am working on optimizations for file sizes and number of files. In the data frame there is a function input_file_name which returns the file name. I miss a counterpart to get the size of the file. Just the size, like "ls -l" returns. Is there something like that? Kind regards, Markus - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: input file size
import java.io.Fileval someFile = new File("somefile.txt")val fileSize = someFile.length This one? On Sun, Jun 19, 2022 at 4:33 AM mbreuer wrote: > Hello Community, > > I am working on optimizations for file sizes and number of files. In the > data frame there is a function input_file_name which returns the file > name. I miss a counterpart to get the size of the file. Just the size, > like "ls -l" returns. Is there something like that? > > Kind regards, > Markus > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: input file size
Hi, I found this ( https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html) that may be helpful, i use Java: > org.apache.spark.util.SizeEstimator.estimate(dataset)); Le sam. 18 juin 2022 à 22:33, mbreuer a écrit : > Hello Community, > > I am working on optimizations for file sizes and number of files. In the > data frame there is a function input_file_name which returns the file > name. I miss a counterpart to get the size of the file. Just the size, > like "ls -l" returns. Is there something like that? > > Kind regards, > Markus > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >