OK, good news. You have made some progress here :)
bzip (bzip2) works (splittable) because it is block-oriented whereas gzip
is stream oriented. I also noticed that you are creating a managed ORC
file. You can bucket and partition an ORC (Optimized Row Columnar file
format. An example below:
Hi Mich,
Thanks for the reply. I started running ANALYZE TABLE on the external
table, but the progress was very slow. The stage had only read about 275MB
in 10 minutes. That equates to about 5.5 hours just to analyze the table.
This might just be the reality of trying to process a 240m record
OK for now have you analyzed statistics in Hive external table
spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;
Hive external tables have little optimization
HTH
Mich Talebzadeh,
Solutions Architect/Engineering
Hello,
I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzipped, ~241m
Hi There;
�
Wonder if anyone might have experience with running spark app from Eclipse Rich
Client Platform in java. The same code run from Eclipse Rich Client Platform of
spark app is much slower than running from normal Java in Eclipse without Rich
Client Platform.
�
Appreciate any
ng csv files to parquet, but from my
>>hands-on so far, it seems that parquet's read time is slower than csv?
>> This
>>seems contradictory to popular opinion that parquet performs better in
>>terms of both computation and storage?
>>
>&g
--- Forwarded message -
> From: Takeshi Yamamuro (Jira)
> Date: Sat, 6 Mar 2021, 20:02
> Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
> Extremely Slow for Large Number of Files?
> To:
>
>
>
> [
> https://issues.apache.org/jira/br
-
From: Takeshi Yamamuro (Jira)
Date: Sat, 6 Mar 2021, 20:02
Subject: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark
Extremely Slow for Large Number of Files?
To:
[
https://issues.apache.org/jira/browse/SPARK-34648?page
Hi,
How do I get the filename from
textFileStream
Using streaming.
Thanks a mill
Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read
our email disclaimer and confidentiality note. Kindly email
> Not sure if the dynamic overwrite logic is implemented in Spark or in Hive
AFAIK I'm using spark implementation(s). Does the thread dump that I posted
show that? I'd like to remain within Spark impl.
What I'm trying to ask is, do you spark developers see some ways to
optimize this?
Otherwise,
There is a probably a limit in the number of element you can pass in the
list of partitions for the listPartitionsWithAuthInfo API call. Not sure if
the dynamic overwrite logic is implemented in Spark or in Hive, in which
case using hive 1.2.1 is probably the reason for un-optimized logic but
also
Ok, I've verified that hive> SHOW PARTITIONS is using get_partition_names,
which is always quite fast. Spark's insertInto uses
get_partitions_with_auth which
is much slower (it also gets location etc. of each partition).
I created a test in java that with a local metastore client to measure the
Why do you need 1 partition when 10 partition is doing the job .. ??
Thanks
Ankit
From: vincent gromakowski
Date: Thursday, 25. April 2019 at 09:12
To: Juho Autio
Cc: user
Subject: Re: [Spark SQL]: Slow insertInto overwrite if target table has many
partitions
Which metastore are you
Which metastore are you using?
Le jeu. 25 avr. 2019 à 09:02, Juho Autio a écrit :
> Would anyone be able to answer this question about the non-optimal
> implementation of insertInto?
>
> On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote:
>
>> Hi,
>>
>> My job is writing ~10 partitions with
Would anyone be able to answer this question about the non-optimal
implementation of insertInto?
On Thu, Apr 18, 2019 at 4:45 PM Juho Autio wrote:
> Hi,
>
> My job is writing ~10 partitions with insertInto. With the same input /
> output data the total duration of the job is very different
Hi,
My job is writing ~10 partitions with insertInto. With the same input /
output data the total duration of the job is very different depending on
how many partitions the target table has.
Target table with 10 of partitions:
1 min 30 s
Target table with ~1 partitions:
13 min 0 s
It seems
ointing time, or why calling
checkpoint(Durations.minutes(1440)) on the JavaMapWithStateDStream would cause
spark to not pass most of the tuples in the JavaPairDStream<String,
Iterable> to the mapWithState callback function?
Question is also posted on
http://stackoverflow.com/questions/395358
r I don't think the Splits are big enough to
actually fill the 6GB of memory of each node, as when it stores them on HDFS
is a lot less than that.
Is there anything obvious (or not :)) that I am not doing correctly?. Is
this the correct way to transform a collection from Mongo to Mongo?. Is
there ano
-spark-user-list.1001560.n3.nabble.com/Spark-dramatically-slow-when-I-add-saveAsTextFile-tp23003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
PS: if I reduce the size the input to just 10 records, it performs very
fast. But it doesn't make any sense for just 10 records.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-dramatically-slow-when-I-add-saveAsTextFile-tp23003.html
Sent from
Maybe I'm wrong, but what you are doing here is basically a bunch of
cartesian product for each key. So if hello appear 100 times in your
corpus, it will produce 100*100 elements in the join output.
I don't understand what you're doing here, but it's normal your join
takes forever, it makes
Hello guys,
I am trying to run the following dummy example for Spark,
on a dataset of 250MB, using 5 machines with 10GB RAM
each, but the join seems to be taking too long ( 2hrs).
I am using Spark 0.8.0 but I have also tried the same example
on more recent versions, with the same results.
Do
If your data has special characteristics like one small other large then
you can think of doing map side join in Spark using (Broadcast Values),
this will speed up things.
Otherwise as Pitel mentioned if there is nothing special and its just
cartesian product it might take ever, or you might
Seems like it is a bug rather than a feature.
I filed a bug report: https://issues.apache.org/jira/browse/SPARK-5363
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21317.html
Sent from the Apache Spark
://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com
it
called many times successfully before in a loop).
Any clue? Or do I have to wait for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html
successfully before in a loop).
Any clue? Or do I have to wait for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html
I'm also facing the same issue.
is this a bug?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278p21283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
() statement (after having it
called many times successfully before in a loop).
Any clue? Or do I have to wait for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278
() and collect() statement (after having
it
called many times successfully before in a loop).
Any clue? Or do I have to wait for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2
? Or do I have to wait for the next version?
Best,
Tassilo
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-slow-working-Spark-1-2-fast-freezing-tp21278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
use the
caching option!
By the way, I have the same behavior with different jobs!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-run-slow-after-unexpected-repartition-tp14542p15416.html
Sent from the Apache Spark User List mailing list archive
I also encountered the similar problem: after some stages, all the taskes
are assigned to one machine, and the stage execution get slower and slower.
*[the spark conf setting]*
val conf = new SparkConf().setMaster(sparkMaster).setAppName(ModelTraining
://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526p4811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
André Bois-Crettez
Software Architect
Big Data Developer
http://www.kelkoo.com/
Kelkoo
partition, they will reside in the same node. So, isn't it
supposed
to be fast when we partition by keys. Thank you.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539p4577.html
Sent from the Apache Spark User List mailing list archive
Hi Joe,
On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote:
And, I haven't gotten any answers to my questions.
One thing that might explain that is that, at least for me, all (and I
mean *all*) of your messages are ending up in my GMail spam folder,
complaining that GMail can't
Yahoo made some changes that drive mailing list posts into spam
folders: http://www.virusbtn.com/blog/2014/04_15.xml
On Mon, Apr 21, 2014 at 2:50 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Joe,
On Mon, Apr 21, 2014 at 11:23 AM, Joe L selme...@yahoo.com wrote:
And, I haven't gotten any
I'm seeing the same thing as Marcelo, Joe. All your mail is going to my
Spam folder. :(
With regards to your questions, I would suggest in general adding some more
technical detail to them. It will be difficult for people to give you
suggestions if all they are told is Spark is slow. How does
. And, I
haven't gotten any answers to my questions. I don't understand the purpose
of this group and there is no enough documentations of spark and its usage.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-slow-tp4539.html
Sent from the Apache
.nabble.com/Spark-is-slow-tp4539p4577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
43 matches
Mail list logo