Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK, good news. You have made some progress here :) bzip (bzip2) works (splittable) because it is block-oriented whereas gzip is stream oriented. I also noticed that you are creating a managed ORC file. You can bucket and partition an ORC (Optimized Row Columnar file format. An example below:

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hi Mich, Thanks for the reply. I started running ANALYZE TABLE on the external table, but the progress was very slow. The stage had only read about 275MB in 10 minutes. That equates to about 5.5 hours just to analyze the table. This might just be the reality of trying to process a 240m record

Re: Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Mich Talebzadeh
OK for now have you analyzed statistics in Hive external table spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL COLUMNS; spark-sql (default)> DESC EXTENDED test.stg_t2; Hive external tables have little optimization HTH Mich Talebzadeh, Solutions Architect/Engineering

Spark-Sql - Slow Performance With CTAS and Large Gzipped File

2023-06-26 Thread Patrick Tucci
Hello, I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and 64GB of RAM. I'm trying to process a large pipe delimited file that has been compressed with gzip (9.2GB zipped, ~58GB unzipped, ~241m

spark perform slow in eclipse rich client platform

2021-07-09 Thread jianxu
Hi There; � Wonder if anyone might have experience with running spark app from Eclipse Rich Client Platform in java. The same code run from Eclipse Rich Client Platform of spark app is much slower than running from normal Java in Eclipse without Rich Client Platform. � Appreciate any

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-10 Thread 钟雨
ng csv files to parquet, but from my >>hands-on so far, it seems that parquet's read time is slower than csv? >> This >>seems contradictory to popular opinion that parquet performs better in >>terms of both computation and storage? >> >&g

Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
--- Forwarded message - > From: Takeshi Yamamuro (Jira) > Date: Sat, 6 Mar 2021, 20:02 > Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark > Extremely Slow for Large Number of Files? > To: > > > > [ > https://issues.apache.org/jira/br

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-06 Thread Pankaj Bhootra
- From: Takeshi Yamamuro (Jira) Date: Sat, 6 Mar 2021, 20:02 Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files? To: [ https://issues.apache.org/jira/browse/SPARK-34648?page

RE: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-26 Thread van den Heever, Christian CC
Hi, How do I get the filename from textFileStream Using streaming. Thanks a mill Standard Bank email disclaimer and confidentiality note Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read our email disclaimer and confidentiality note. Kindly email

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
> Not sure if the dynamic overwrite logic is implemented in Spark or in Hive AFAIK I'm using spark implementation(s). Does the thread dump that I posted show that? I'd like to remain within Spark impl. What I'm trying to ask is, do you spark developers see some ways to optimize this? Otherwise,

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
There is a probably a limit in the number of element you can pass in the list of partitions for the listPartitionsWithAuthInfo API call. Not sure if the dynamic overwrite logic is implemented in Spark or in Hive, in which case using hive 1.2.1 is probably the reason for un-optimized logic but also

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
Ok, I've verified that hive> SHOW PARTITIONS is using get_partition_names, which is always quite fast. Spark's insertInto uses get_partitions_with_auth which is much slower (it also gets location etc. of each partition). I created a test in java that with a local metastore client to measure the

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Khare, Ankit
Why do you need 1 partition when 10 partition is doing the job .. ?? Thanks Ankit From: vincent gromakowski Date: Thursday, 25. April 2019 at 09:12 To: Juho Autio Cc: user Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions Which metastore are you

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread vincent gromakowski
Which metastore are you using? Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit : > Would anyone be able to answer this question about the non-optimal > implementation of insertInto? > > On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote: > >> Hi, >> >> My job is writing ~10 partitions with

Re: [Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-25 Thread Juho Autio
Would anyone be able to answer this question about the non-optimal implementation of insertInto? On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote: > Hi, > > My job is writing ~10 partitions with insertInto. With the same input / > output data the total duration of the job is very different

[Spark SQL]: Slow insertInto overwrite if target table has many partitions

2019-04-18 Thread Juho Autio
Hi, My job is writing ~10 partitions with insertInto. With the same input / output data the total duration of the job is very different depending on how many partitions the target table has. Target table with 10 of partitions: 1 min 30 s Target table with ~1 partitions: 13 min 0 s It seems

spark streaming slow checkpointing when calling Rserve

2016-09-19 Thread Piubelli, Manuel
ointing time, or why calling checkpoint(Durations.minutes(1440)) on the JavaMapWithStateDStream would cause spark to not pass most of the tuples in the JavaPairDStream<String, Iterable> to the mapWithState callback function? Question is also posted on http://stackoverflow.com/questions/395358

mongo-hadoop with Spark is slow for me, and adding nodes doesn't seem to make any noticeable difference

2015-09-21 Thread cscarioni
r I don't think the Splits are big enough to actually fill the 6GB of memory of each node, as when it stores them on HDFS is a lot less than that. Is there anything obvious (or not :)) that I am not doing correctly?. Is this the correct way to transform a collection from Mongo to Mongo?. Is there ano

Spark dramatically slow when I add saveAsTextFile

2015-05-24 Thread allanjie
-spark-user-list.1001560.n3.nabble.com/Spark-dramatically-slow-when-I-add-saveAsTextFile-tp23003.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Spark dramatically slow when I add saveAsTextFile

2015-05-24 Thread Joe Wass
PS: if I reduce the size the input to just 10 records, it performs very fast. But it doesn't make any sense for just 10 records. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-dramatically-slow-when-I-add-saveAsTextFile-tp23003.html Sent from

Re: Join on Spark too slow.

2015-04-09 Thread Guillaume Pitel
Maybe I'm wrong, but what you are doing here is basically a bunch of cartesian product for each key. So if hello appear 100 times in your corpus, it will produce 100*100 elements in the join output. I don't understand what you're doing here, but it's normal your join takes forever, it makes

Join on Spark too slow.

2015-04-09 Thread Kostas Kloudas
Hello guys, I am trying to run the following dummy example for Spark, on a dataset of 250MB, using 5 machines with 10GB RAM each, but the join seems to be taking too long ( 2hrs). I am using Spark 0.8.0 but I have also tried the same example on more recent versions, with the same results. Do

Re: Join on Spark too slow.

2015-04-09 Thread ๏̯͡๏
If your data has special characteristics like one small other large then you can think of doing map side join in Spark using (Broadcast Values), this will speed up things. Otherwise as Pitel mentioned if there is nothing special and its just cartesian product it might take ever, or you might

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-22 Thread TJ Klein
Seems like it is a bug rather than a feature. I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html Sent from the Apache Spark

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein
for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Davies Liu
it called many times successfully before in a loop). Any clue? Or do I have to wait for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread Tassilo Klein
successfully before in a loop). Any clue? Or do I have to wait for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-21 Thread critikaled
I'm also facing the same issue. is this a bug? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread TJ Klein
? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
() statement (after having it called many times successfully before in a loop). Any clue? Or do I have to wait for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Tassilo Klein
() and collect() statement (after having it called many times successfully before in a loop). Any clue? Or do I have to wait for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2

Re: Spark 1.1 (slow, working), Spark 1.2 (fast, freezing)

2015-01-20 Thread Davies Liu
? Or do I have to wait for the next version? Best, Tassilo -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark run slow after unexpected repartition

2014-09-30 Thread matthes
use the caching option! By the way, I have the same behavior with different jobs! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-run-slow-after-unexpected-repartition-tp14542p15416.html Sent from the Apache Spark User List mailing list archive

Re: Spark run slow after unexpected repartition

2014-09-18 Thread Tan Tim
I also encountered the similar problem: after some stages, all the taskes are assigned to one machine, and the stage execution get slower and slower. *[the spark conf setting]* val conf = new SparkConf().setMaster(sparkMaster).setAppName(ModelTraining

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-24 Thread neeravsalaria
://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526p4811.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-22 Thread Andre Bois-Crettez
in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- André Bois-Crettez Software Architect Big Data Developer http://www.kelkoo.com/ Kelkoo

Re: Spark is slow

2014-04-22 Thread Nicholas Chammas
partition, they will reside in the same node. So, isn't it supposed to be fast when we partition by keys. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html Sent from the Apache Spark User List mailing list archive

Re: Spark is slow

2014-04-21 Thread Marcelo Vanzin
Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any answers to my questions. One thing that might explain that is that, at least for me, all (and I mean *all*) of your messages are ending up in my GMail spam folder, complaining that GMail can't

Re: Spark is slow

2014-04-21 Thread John Meagher
Yahoo made some changes that drive mailing list posts into spam folders: http://www.virusbtn.com/blog/2014/04_15.xml On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Joe, On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote: And, I haven't gotten any

Re: Spark is slow

2014-04-21 Thread Nicholas Chammas
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my Spam folder. :( With regards to your questions, I would suggest in general adding some more technical detail to them. It will be difficult for people to give you suggestions if all they are told is Spark is slow. How does

Re: Spark is slow

2014-04-21 Thread Sam Bessalah
. And, I haven't gotten any answers to my questions. I don't understand the purpose of this group and there is no enough documentations of spark and its usage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html Sent from the Apache

Re: Spark is slow

2014-04-21 Thread Joe L
.nabble.com/Spark-is-slow-tp4539p4577.html Sent from the Apache Spark User List mailing list archive at Nabble.com.