Re: partitionBy creating lot of small files

2022-06-04 Thread Enrico Minack
You refer to df.write.partitionBy, which creates for each value of "col" a directory, and in worst-case writes one file per DataFrame partition. So the number of output files is controlled by cardinality of "col", which is your data and hence out of control, and the number of partitions of

partitionBy creating lot of small files

2022-06-04 Thread Nikhil Goyal
Hi all, Is there a way to use dataframe.partitionBy("col") and control the number of output files without doing a full repartition? The thing is some partitions have more data while some have less. Doing a .repartition is a costly operation. We want to control the size of the output files. Is it

Re: Dealing with large number of small files

2022-04-27 Thread Sid
Yes, It created a list of records separated by , and it was created faster as well. On Wed, 27 Apr 2022, 13:42 Gourav Sengupta, wrote: > Hi, > did that result in valid JSON in the output file? > > Regards, > Gourav Sengupta > > On Tue, Apr 26, 2022 at 8:18 PM Sid wrote: > >> I have .txt

Re: Dealing with large number of small files

2022-04-27 Thread Gourav Sengupta
Hi, did that result in valid JSON in the output file? Regards, Gourav Sengupta On Tue, Apr 26, 2022 at 8:18 PM Sid wrote: > I have .txt files with JSON inside it. It is generated by some API calls > by the Client. > > On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen > wrote: > >> What is that

Re: Dealing with large number of small files

2022-04-26 Thread Sid
I have .txt files with JSON inside it. It is generated by some API calls by the Client. On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen wrote: > What is that you have? Is it txt files or json files? > Or do you have txt files with JSON inside? > > > > tir. 26. apr. 2022 kl. 20:41 skrev Sid : >

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
What is that you have? Is it txt files or json files? Or do you have txt files with JSON inside? tir. 26. apr. 2022 kl. 20:41 skrev Sid : > Thanks for your time, everyone :) > > Much appreciated. > > I solved it using jq utility since I was dealing with JSON. I have solved > it using below

Re: Dealing with large number of small files

2022-04-26 Thread Sid
Thanks for your time, everyone :) Much appreciated. I solved it using jq utility since I was dealing with JSON. I have solved it using below script: find . -name '*.txt' -exec cat '{}' + | jq -s '.' > output.txt Thanks, Sid On Tue, Apr 26, 2022 at 9:37 PM Bjørn Jørgensen wrote: > and the

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
and the bash script seems to read txt files not json for f in Agent/*.txt; do cat ${f} >> merged.json;done; tir. 26. apr. 2022 kl. 18:03 skrev Gourav Sengupta < gourav.sengu...@gmail.com>: > Hi, > > what is the version of spark are you using? And where is the data stored. > > I am not quite

Re: Dealing with large number of small files

2022-04-26 Thread Gourav Sengupta
Hi, what is the version of spark are you using? And where is the data stored. I am not quite sure that just using a bash script will help because concatenating all the files into a single file creates a valid JSON. Regards, Gourav On Tue, Apr 26, 2022 at 3:44 PM Sid wrote: > Hello, > > Can

Re: Dealing with large number of small files

2022-04-26 Thread Artemis User
Most likely your JSON files are not formatted correctly.  Please see the Spark doc on specific formatting requirement for JSON data. https://spark.apache.org/docs/latest/sql-data-sources-json.html. On 4/26/22 10:43 AM, Sid wrote: Hello, Can somebody help me with the below problem?

Re: Dealing with large number of small files

2022-04-26 Thread Bjørn Jørgensen
df = spark.read.json("/*.json") use the *.json tir. 26. apr. 2022 kl. 16:44 skrev Sid : > Hello, > > Can somebody help me with the below problem? > > > https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark > > > Thanks, > Sid > -- Bjørn

Dealing with large number of small files

2022-04-26 Thread Sid
Hello, Can somebody help me with the below problem? https://stackoverflow.com/questions/72015557/dealing-with-large-number-of-small-json-files-using-pyspark Thanks, Sid

RE: [Spark SQL] Does Spark group small files

2018-11-14 Thread Lienhart, Pierre (DI IZ) - AF (ext)
Hello Yann, From my understanding, when reading small files Spark will group them and load the content of each batch into the same partition so you won’t end up with 1 partition per file resulting in a huge number of very small partitions. This behavior is controlled

Re: [Spark SQL] Does Spark group small files

2018-11-13 Thread Silvio Fiorito
Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using

[Spark SQL] Does Spark group small files

2018-11-13 Thread Yann Moisan
Hello, I'm using Spark 2.3.1. I have a job that reads 5.000 small parquet files into s3. When I do a mapPartitions followed by a collect, only *278* tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread gpatcham
When I run spark.read.orc("hdfs://test").filter("conv_date = 20181025").count with "spark.sql.orc.filterPushdown=true" I see below in executors logs. Predicate push down is happening 18/11/01 17:31:17 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = (IS_NULL conv_date) leaf-1 = (EQUALS

Re: Apache Spark orc read performance when reading large number of small files

2018-11-01 Thread Jörn Franke
A lot of small files is very inefficient itself and predicate push down will not help you much there unless you merge them into one large file (one large file can be much more efficiently processed). How did you validate that predicate pushdown did not work on Hive? You Hive Version is also

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
spark version 2.2.0 Hive version 1.1.0 There are lot of small files Spark code : "spark.sql.orc.enabled": "true", "spark.sql.orc.filterPushdown": "true val logs =spark.read.schema(schema).orc("hdfs://test/date=201810").filter("date &

Re: Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread Jörn Franke
How large are they? A lot of (small) files will cause significant delay in progressing - try to merge as much as possible into one file. Can you please share full source code in Hive and Spark as well as the versions you are using? > Am 31.10.2018 um 18:23 schrieb gpatcham : > >

Apache Spark orc read performance when reading large number of small files

2018-10-31 Thread gpatcham
When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is doing under hoods when spark.read.orc is

Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
. For instance, if your spark application is working on 5 partitions, you can repartition to 1, this will again reduce the number of files to 5x. You can create staging to hold small files and once a decent amount of data is accumulated you can prepare large files and load to your final hive table

Spark Streaming Small files in Hive

2017-10-29 Thread KhajaAsmath Mohammed
Hi, I am using spark streaming to write data back into hive with the below code snippet eventHubsWindowedStream.map(x => EventContent(new String(x))) .foreachRDD(rdd => { val sparkSession = SparkSession .builder.enableHiveSupport.getOrCreate import

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
mega file :) but i did not have to do it in my life, yet, so, maybe i am wrong. Please, take a look to this post and let us know about how you deal with it. https://stuartsierra.com/2008/04/24/a-million-little-files http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ "One

Re: Small files

2016-09-12 Thread ayan guha
Hi Thanks for your mail. I have read few of those posts. But always I see solutions assume data is on hdfs already. My problem is to get data on to HDFS for the first time. One way I can think of is to load small files on each cluster machines on the same folder. For example load file 1-0.3 mil

Re: Small files

2016-09-12 Thread Alonso Isidoro Roman
That is a good question Ayan. A few searches on so returns me: http://stackoverflow.com/questions/31009834/merge-multiple-small-files-in-to-few-larger-files-in-spark http://stackoverflow.com/questions/29025147/how-can-i-merge-spark-results-files-without-repartition-and-copymerge good luck

Small files

2016-09-12 Thread ayan guha
Hi I have a general question: I have 1.6 mil small files, about 200G all put together. I want to put them on hdfs for spark processing. I know sequence file is the way to go because putting small files on hdfs is not correct practice. Also, I can write a code to consolidate small files to seq

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson
vidson <a...@santacruzintegration.com>, Pedro Rodriguez <ski.rodrig...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming. > Hi Pedro > > I did some experi

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson
Rodriguez <ski.rodrig...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: performance problem when reading lots of small files created by spark streaming. > Hi Pedro > > Thanks for the explanation. I started watching your repo. In the short term I > thi

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Gourav Sengupta
There is an option to join small files up. If you are unable to find it just let me know. Regards, Gourav On Thu, Jul 28, 2016 at 4:58 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > Hi Pedro > > Thanks for the explanation. I started watching your repo. In the short

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-28 Thread Andy Davidson
Hi Pedro Thanks for the explanation. I started watching your repo. In the short term I think I am going to try concatenating my small files into 64MB and using HDFS. My spark streaming app is implemented Java and uses data frames. It writes to s3. My batch processing is written in python It reads

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
).textFileByPrefix("bucket", "file1", "folder2").regularRDDOperationsHere or import implicits and do sc.s3.textFileByPrefix At present, I am battle testing and benchmarking it at my current job and results are promising with significant improvements to jobs dealing with many fil

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson
the files to a normal file system and then using Œhadoop fs put¹ to copy the files to hdfs how ever this takes several hours and is no where near completion. It appears hdfs does not deal with small files well. I am considering copying the files from s3 to a normal file system on one of my workers

SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Ajay Srivastava
Hi, This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark 2.0.But resolution is not mentioned there. In our use case, there are big as well as many small parquet files which are being queried using spark sql.Can someone please explain what is the fix and how I can use it

Re: spark parquet too many small files ?

2016-07-02 Thread sri hari kali charan Tummala
ropu > > On Fri, Jul 1, 2016 at 7:39 PM, kali.tumm...@gmail.com < > kali.tumm...@gmail.com> wrote: > >> I found the jira for the issue will there be a fix in future ? or no fix ? >> >> https://issues.apache.org/jira/browse/SPARK-6221 >> >> >> &

Re: spark parquet too many small files ?

2016-07-02 Thread Takeshi Yamamuro
owse/SPARK-6221 > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html > Sent from the Apache Spark User Lis

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
I found the jira for the issue will there be a fix in future ? or no fix ? https://issues.apache.org/jira/browse/SPARK-6221 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27267.html Sent from the Apache Spark

Re: spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
t; Neelesh S. Salian > Cloudera > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264p27265.html > To unsubscribe from spark parquet too many small f

Re: spark parquet too many small files ?

2016-07-01 Thread nsalian
-too-many-small-files-tp27264p27265.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

spark parquet too many small files ?

2016-07-01 Thread kali.tumm...@gmail.com
. Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-tp27264.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
Many small files could cause technical issues in both hdfs and spark though, they do not generate many stages and tasks in the recent version of spark. // maropu On Fri, May 20, 2016 at 2:41 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > For logs file I would suggest save as gziped

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
t files instead of > keeping lots of small files in the HDFS. Please refer to [1] for more info. > > We also encountered the same issue with the slow query, and it was indeed > caused by the many small parquet files. In our case, we were processing large > data sets with batch jobs instea

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Deng Ching-Mallete
IMO, it might be better to merge or compact the parquet files instead of keeping lots of small files in the HDFS. Please refer to [1] for more info. We also encountered the same issue with the slow query, and it was indeed caused by the many small parquet files. In our case, we were processing

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
g message into parquet file > every 10 mins. > Now, when I query the parquet, it usually takes hundreds of thousands of > stages to compute a single count. > I looked into the parquet file’s path and find a great amount of small > files. > > Do the small files caused the problem? Can I merg

Is there a way to merge parquet small files?

2016-05-19 Thread 王晓龙/01111515
I’m using a spark streaming program to store log message into parquet file every 10 mins. Now, when I query the parquet, it usually takes hundreds of thousands of stages to compute a single count. I looked into the parquet file’s path and find a great amount of small files. Do the small files

RE: Join over many small files

2015-09-24 Thread Tracewski, Lukasz
will be much welcomed. Thanks! Lucas From: ayan guha [mailto:guha.a...@gmail.com] Sent: 24 September 2015 00:19 To: Tracewski, Lukasz (KFDB 3) Cc: user@spark.apache.org Subject: Re: Join over many small files I think this can be a good case for using sequence file format to pack many files to few

Join over many small files

2015-09-23 Thread Tracewski, Lukasz
Hi all, I would like you to ask for an advise on how to efficiently make a join operation in Spark with tens of thousands of tiny files. A single file has a few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows. To give you impression how they look like: File 01 ID |

Re: Join over many small files

2015-09-23 Thread ayan guha
I think this can be a good case for using sequence file format to pack many files to few sequence files with file name as key andd content as value. Then read it as RDD and produce tuples like you mentioned (key=fileno+id, value=value). After that, it is a simple map operation to generate the diff

Fwd: [Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-08-07 Thread Roberto Coluccio
backed on S3 with large amount of small files To: user@spark.apache.org Hello Spark community, I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext, in particular SELECTing data from an EXTERNAL TABLE

Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job configuration but hive is not managing the files. Do I need to set the hive fields in around place? How do you set Hive

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Michael Armbrust
This feature isn't currently supported. On Wed, Aug 5, 2015 at 8:43 AM, Brandon White bwwintheho...@gmail.com wrote: Hello, I would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job

Re: Spark SQL Hive - merge small files

2015-08-05 Thread Brandon White
would love to have hive merge the small files in my managed hive context after every query. Right now, I am setting the hive configuration in my Spark Job configuration but hive is not managing the files. Do I need to set the hive fields in around place? How do you set Hive configurations in Spark

[Spark + Hive + EMR + S3] Issue when reading from Hive external table backed on S3 with large amount of small files

2015-07-25 Thread Roberto Coluccio
Hello Spark community, I currently have a Spark 1.3.1 batch driver, deployed in YARN-cluster mode on an EMR cluster (AMI 3.7.0) that reads input data through an HiveContext, in particular SELECTing data from an EXTERNAL TABLE backed on S3. Such table has dynamic partitions and contains *hundreds

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Davies Liu
(list_of_filenames) appear to not perform well on small files, why? *sc.wholeTextFiles(path_to_files) performs better than sc.textfile, but does not support bzipped files. However, also wholeTextFiles does not nearly provide the speed of the Python script. *The initialization of a Spark Context takes

Re: Spark on very small files, appropriate use case?

2015-02-10 Thread Kelvin Chu
of the memory cache which could be much faster. And, in general, small files hurt I/O performance. On Tue, Feb 10, 2015 at 12:52 PM, Davies Liu dav...@databricks.com wrote: Spark is an framework to do things in parallel very easy, it definitely will help your cases. def read_file(path

Spark on very small files, appropriate use case?

2015-02-10 Thread soupacabana
. My preliminary findings and my questions: *Even only counting the number of log lines with Spark is about 10 times slower than the entire transformation done by the Python script. *sc.textfile(list_of_filenames) appear to not perform well on small files, why? *sc.wholeTextFiles(path_to_files

Re: too many small files and task

2014-12-19 Thread bethesda
started. D -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/too-many-small-files-and-task-tp20776p20783.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Optimizing text file parsing, many small files versus few big files

2014-11-20 Thread rzykov
: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-text-file-parsing-many-small-files-versus-few-big-files-tp19266p19354.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e

Re: Optimizing text file parsing, many small files versus few big files

2014-11-20 Thread tvas
$CombineTextFileRecordReader.init(Loaders.scala:31) [info] ... 27 more I saw that you tested with Spark 1.1.0 and but I am forced to use 1.0.2 currently. Perhaps that is the source of the error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimizing-text-file-parsing-many-small

Solution for small files in HDFS

2014-10-01 Thread rzykov
We encountered a problem of loading a huge number of small files (hundred thousands of files) from HDFS in Spark. Our jobs were failed over time. This one forced us to write own loader with combining by means of Hadoop CombineFileInputFormat. It significantly reduced number of mappers from 10

Spark processing small files.

2014-09-16 Thread cem
Hi all, Spark is taking too much time to start the first stage with many small files in HDFS. I am reading a folder that contains RC files: sc.hadoopFile(hdfs://hostname :8020/test_data2gb/, classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf