Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
of the UI's display, not necessarily a bug in the Spark framework itself. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury
Can you please explain how did you realize it’s wrong? Did you check cloudwatch for the same metrics and compare? Also are you using do.cache() and expecting that shuffle read/write to go away ? Sent from Yahoo Mail for iPhone On Sunday, May 26, 2024, 7:53 AM, Prem Sahoo wrote: Can anyone

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can view the tasks. >> >> In each

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team, in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "ShuffleWrite Size/Records " that column prints wrong data when it gets the data from cache/persist . it typically will show the wrong record number though the data

Bug in org.apache.spark.util.sketch.BloomFilter

2024-03-21 Thread Nathan Conroy
Hi All, I believe that there is a bug that affects the Spark BloomFilter implementation when creating a bloom filter with large n. Since this implementation uses integer hash functions, it doesn’t work properly when the number of bits exceeds MAX_INT. I asked a question about

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
wiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my kno >> wledge but of course cannot be guaranteed . It is essential to note that, >> as with any advice, quote "one test result is worth one-thousand

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
/Wernher_von_Braun>)". > > > On Mon, 11 Mar 2024 at 05:07, 刘唯 wrote: > >> *now -> not >> >> 刘唯 于2024年3月10日周日 22:04写道: >> >>> Have you tried using microbatch_data.get("processedRowsPerSecond")? >>> Camel case now snake ca

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
写道: >> >>> >>> There is a paper from Databricks on this subject >>> >>> >>> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html >>> >>> But having tested it, there seems to be a bug there that I reported to &

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
t; >> https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html >> >> But having tested it, there seems to be a bug there that I reported to >> Databricks forum as well (in answer to a user question) >> >> I have come to a conclus

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
g-queries-in-pyspark.html > > But having tested it, there seems to be a bug there that I reported to > Databricks forum as well (in answer to a user question) > > I have come to a conclusion that this is a bug. In general there is a bug > in obtaining individual values from the dictio

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html But having tested it, there seems to be a bug there that I reported to Databricks forum as well (in answer to a user question) I have come to a conclusion

Re: [Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Mich Talebzadeh
cause partial aggregation if a single executor processes most items of a particular type. - Partial Aggregations, Spark might be combining partial counts from executors incorrectly, leading to inaccuracies. - Finally a bug in 3.5 is possible. HTH Mich Talebzadeh, Dad | Technologist | Solutions

[Spark Core] Potential bug in JavaRDD#countByValue

2024-02-27 Thread Stuart Fehr
Hello, I recently encountered a bug with the results from JavaRDD#countByValue that does not reproduce when running locally. For background, we are running a Spark 3.5.0 job on AWS EMR 7.0.0. The code in question is something like this: JavaRDD rdd = // ... > rdd.count(); // 75187 //

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-21 Thread Mich Talebzadeh
aise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while >> calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: >> java.lang.IllegalStateException: You hit a query analyzer bug. Please report >> your query to Spark user mailing list.\n\

Re: Spark 3.3 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
me.py\" raise Py4JJavaError(\npy4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing l

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
.spark.sql.api.python.PythonSQLUtils.explainString.\n: > java.lang.IllegalStateException: You hit a query analyzer bug. Please report > your query to Spark user mailing list.\n\tat > org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:516)\n\tat > > org.apache.spark.sql.cat

Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Sharma, Anup
e calling z:org.apache.spark.sql.api.python.PythonSQLUtils.explainString.\n: java.lang.IllegalStateException: You hit a query analyzer bug. Please report your query to Spark user mailing list.\n\tat org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:

[Spark SQL] [Bug] Adding `checkpoint()` causes "column [...] cannot be resolved" error

2023-11-05 Thread Robin Zimmerman
Hi all, Wondering if anyone has run into this as I can't find any similar issues in JIRA, mailing list archives, Stack Overflow, etc. I had a query that was running successfully, but the query planning time was extremely long (4+ hours). To fix this I added `checkpoint()` calls earlier in the

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
en to disk you may see duplicates when there are failures. However, if you read the output location with Spark you should get exactly once results (unless there is a bug) since spark will know how to use the commit log to see what data files are committed and not. Best, Jerry On Mon, Sep 18, 20

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote:Hi Craig,Can you please clarify what this bug is and provide sample code causing this issue?HTH  Mich Talebzadeh,Distinguished Technologist, Sol

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
please clarify what this bug is and provide sample code causing this issue?HTH  Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom    view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh  Disclaimer: Use it at your own risk. A

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Disting

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
out to share more artifacts with this thread. Thank you Jerry. From: Jerry Peng Date: Thursday, September 14, 2023 at 1:10 PM To: Craig Alfieri Cc: user@spark.apache.org Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2 Hi Craig, Thank you

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev
Hi Mich, It's not specific to ORC, and looks like a bug from Hadoop Common project. I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do you know if anyone could help me to set the Assignee? https://issues.apache.org/jira/browse/HADOOP-18856 With Best Regards, Dipayan Dev

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh
Under gs directory "gs://test_dd1/abc/" What do you see? gsutil ls gs://test_dd1/abc and the same gs://test_dd1/ gsutil ls gs://test_dd1 I suspect you need a folder for multiple ORC slices! Mich Talebzadeh, Solutions Architect/Engineering Lead London United Kingdom view my Linkedin

Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Dipayan Dev
Hi Everyone, I'm stuck with one problem, where I need to provide a custom GCS location for the Hive table from Spark. The code fails while doing an *'insert into'* whenever my Hive table has a flag GS location like gs://, but works for nested locations like gs://bucket_name/blob_name. Is anyone

Spark UI - Bug Executors tab when using proxy port

2023-07-06 Thread Bruno Pistone
Hello everyone, I’m really sorry to use this mailing list, but seems impossible to notify a strange behaviour that is happening with the Spark UI. I’m sending also the link to the stackoverflow question here https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty I’m

[JDBC] [PySpark] Possible bug when comparing incoming data frame from mssql and empty delta table

2023-02-26 Thread lennart
Hello, I have been working on a small ETL framework for pyspark/delta/databricks on my spare time. It looks like I might have encountered a bug, however I'm not totally sure its actually caused by spark itself and not one of the other technologies. The error shows up when using spark sql

[BUG?] How to handle with special characters or scape them on spark version 3.3.0?

2023-01-04 Thread Vieira, Thiago
Hello everyone, I’ve already raised this question on stack overflow, but to be honest I truly believe this is a bug at new spark version, so I am also sending this email. Previously I was using spark version 3.2.1 to read data from SAP database by JDBC connector, I had no issues to perform

[PySpark] [applyInPandas] Regression Bug: Cogroup in pandas drops columns from the first dataframe

2022-11-25 Thread Michael Bílý
Hello there, I ran into this problem on pyspark: when using the groupby.cogroup functionality on the same dataframe, it silently drops columns from the first instance, minimal example: spark = ( SparkSession.builder .getOrCreate() ) df = spark.createDataFrame([["2017-08-17", 1,]],

[PySpark, Spark Streaming] Bug in timestamp handling in Structured Streaming?

2022-10-21 Thread kai-michael.roes...@sap.com.INVALID
Hi, I suspect I may have come across a bug in the handling of data containing timestamps in PySpark "Structured Streaming" using the foreach option. I'm "just" a user of PySpark, no Spark community member, so I don't know how to properly address the issue. I have post

Unusual bug,please help me,i can do nothing!!!

2022-03-30 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in windows cmd, the first startup is normal, when I use the "ctrl+c" command to force the end of the spark window, it can't start normally again. .The error message is as follows "Failed to initialize Spark

error bug,please help me!!!

2022-03-20 Thread spark User
Hello, I am a spark user. I use the "spark-shell.cmd" startup command in windows cmd, the first startup is normal, when I use the "ctrl+c" command to force the end of the spark window, it can't start normally again. .The error message is as follows "Failed to initialize Spark

Fwd: metastore bug when hive update spark table ?

2022-01-06 Thread Mich Talebzadeh
>From my experience this is an spark issue (more code base diversion on spark-sql from Hive), but of course there is the work-around as below -- Forwarded message - From: Mich Talebzadeh Date: Thu, 6 Jan 2022 at 17:29 Subject: Re: metastore bug when hive update spark ta

spark metadata metastore bug ?

2022-01-06 Thread Nicolas Paris
Spark can't see hive schema updates partly because it stores the schema in a weird way in hive metastore. 1. FROM SPARK: create a table >>> spark.sql("select 1 col1, 2 >>> col2").write.format("parquet").saveAsTable("my_table") >>> spark.table("my_table").printSchema() root |--

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
Spark 3.1.1 view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
I ran this one on RHES 7.6 with 64GB of memory and it hit OOM >>> data=list(range(rows)) >>> rdd=sc.parallelize(data,rows) >>> assert rdd.getNumPartitions()==rows >>> rdd0=rdd.filter(lambda x:False) >>> assert rdd0.getNumPartitions()==rows >>> rdd00=rdd0.coalesce(1) >>> data=rdd00.collect()

Re: possible bug

2021-04-09 Thread Sean Owen
OK so it's '7 threads overwhelming off heap mem in the JVM' kind of thing. Or running afoul of ulimits in the OS. On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi Sean! > > So the "coalesce" without shuffle will create a CoalescedRDD which during

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros
Hi Sean! So the "coalesce" without shuffle will create a CoalescedRDD which during its computation delegates to the parent RDD partitions. As the CoalescedRDD contains only 1 partition so we talk about 1 task and 1 task context. The next stop is PythonRunner. Here the python workers at least

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
: 0k, detached. >> >> # >> >> # There is insufficient memory for the Java Runtime Environment to >> continue. >> >> # Native memory allocation (mmap) failed to map 16384 bytes for >> committing reserved memory. >> >> >> >> A funct

Re: possible bug

2021-04-09 Thread Sean Owen
Yeah I figured it's not something fundamental to the task or Spark. The error is very odd, never seen that. Do you have a theory on what's going on there? I don't! On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi! > > I looked into the code and find

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros
afely and with performance, I will now always use coalesce > with shuffling, even though in theory this will come with quite a > performance decrease. > > > > Markus > > > > *Von:* Russell Spitzer > *Gesendet:* Donnerstag, 8. April 2021 15:24 > *An:* Weiand, Markus,

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
esendet: Donnerstag, 8. April 2021 15:24 An: Weiand, Markus, NMA-CFD Cc: user@spark.apache.org Betreff: Re: possible bug Could be that the driver JVM cannot handle the metadata required to store the partition information of a 70k partition RDD. I see you say you have a 100GB driver but i'm not sure

Re: possible bug

2021-04-08 Thread Russell Spitzer
> *An:* Weiand, Markus, NMA-CFD > *Cc:* user@spark.apache.org > *Betreff:* Re: possible bug > > > > That's a very low level error from the JVM. Any chance you are > misconfiguring the executor size? like to 10MB instead of 10GB, that kind > of thing. Trying to think of why th

AW: possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
trying to coalesce an empty rdd with 7 partitions in an empty rdd with 1 partition, why is this a problem without shuffling? Von: Sean Owen Gesendet: Donnerstag, 8. April 2021 15:00 An: Weiand, Markus, NMA-CFD Cc: user@spark.apache.org Betreff: Re: possible bug That's a very low level error from

Re: possible bug

2021-04-08 Thread Sean Owen
:53 AM Weiand, Markus, NMA-CFD < markus.wei...@bertelsmann.de> wrote: > Hi all, > > > > I'm using spark on a c5a.16xlarge machine in amazon cloud (so having 64 > cores and 128 GB RAM). I'm using spark 3.01. > > > > The following python code leads to

possible bug

2021-04-08 Thread Weiand, Markus, NMA-CFD
Hi all, I'm using spark on a c5a.16xlarge machine in amazon cloud (so having 64 cores and 128 GB RAM). I'm using spark 3.01. The following python code leads to an exception, is this a bug or is my understanding of the API incorrect? import pyspark conf=pyspark.SparkConf().setMaster

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic
ching". From: Jeff Evans Sent: Thursday, March 4, 2021 2:55 PM To: Oldrich Vlasic Cc: Russell Spitzer ; Sean Owen ; user ; Ondřej Havlíček Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto Why not perform a df.select(...) before the final write

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Jeff Evans
) from > falling victim to this. > -- > *From:* Russell Spitzer > *Sent:* Wednesday, March 3, 2021 3:31 PM > *To:* Sean Owen > *Cc:* Oldrich Vlasic ; user < > user@spark.apache.org>; Ondřej Havlíček > *Subject:* Re: [Spark SQL, intermedi

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-04 Thread Oldrich Vlasic
; user ; Ondřej Havlíček Subject: Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto Yep this is the behavior for Insert Into, using the other write apis does schema matching I believe. On Mar 3, 2021, at 8:29 AM, Sean Owen mailto:sro...@gmail.com>> wrote: I

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Russell Spitzer
erning > partial overwrites of partitioned data. Not sure if this is a bug or just > abstraction > leak. I have checked Spark section of Stack Overflow and haven't found any > relevant > questions or answers. > > Full minimal working example provided as attachment. Tested on

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Sean Owen
> Hi, > > I have encountered a weird and potentially dangerous behaviour of Spark > concerning > partial overwrites of partitioned data. Not sure if this is a bug or just > abstraction > leak. I have checked Spark section of Stack Overflow and haven't found any > relevant &g

[Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-02 Thread Oldrich Vlasic
Hi, I have encountered a weird and potentially dangerous behaviour of Spark concerning partial overwrites of partitioned data. Not sure if this is a bug or just abstraction leak. I have checked Spark section of Stack Overflow and haven't found any relevant questions or answers. Full minimal

Re: A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Sean Owen
I'll take a look. At a glance - is it converging? might turn down the tolerance to check. Also what does scikit learn say on the same data? we can continue on the JIRA. On Mon, Feb 22, 2021 at 5:42 PM Yakov Kerzhner wrote: > I have written up a JIRA, and there is a gist attached that has code

A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Yakov Kerzhner
I have written up a JIRA, and there is a gist attached that has code that reproduces the issue. This is a fairly serious issue as it probably affects everyone who uses spark to fit binary logistic regressions. https://issues.apache.org/jira/browse/SPARK-34448 Would be great if someone who

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-18 Thread 王长春
s with multi-threads, > each snapshot represent an "hour" of data, and we do the "read-reduce-write" > operations > on multiple snapshots(hours) simultaneously. We pretty sure the same > snapshot(hour) never process parallelly and the output path always > generated with a t

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
gt; generated at the 1st time, and other files are generated at the >>> 2nd(retry) time. >>> Moreover, those duplicated logs will be duplicated exactly twice and >>> located in >>> both batches (one in the first batch; and one in the second batch). >>> >

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Mich Talebzadeh
esent an "hour" of data, and we do the > "read-reduce-write" operations > on multiple snapshots(hours) simultaneously. We pretty sure the same > snapshot(hour) never process parallelly and the output path always > generated with a timestamp, so those jobs should

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Gourav Sengupta
/output files are parquet on GCS. The Spark version is 2.4.4 with >> standalone deployment. Workers running on GCP preemptible instances and >> they >> being preempted very frequently. >> >> The pipeline is running in a single long-running process with >> multi-thre

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen
h batches (one in the first batch; and one in the second batch). >> >> The input/output files are parquet on GCS. The Spark version is 2.4.4 with >> standalone deployment. Workers running on GCP preemptible instances and >> they >> being preempted very frequently. >> &

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
hour) never process parallelly and the output path always > generated with a timestamp, so those jobs shouldn't affect each other. > > After changing the line (1) to `coalesce` or `repartition(100, $"pkey")` > the issue > was gone, but I believe there is still a correctness bug t

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
y key") and some "pkey" missing. > Since it only happens when executors being preempted, I believe this is a > bug (nondeterministic shuffle) that SPARK-23207 trying to solve. > > Thanks, > > Shiao-An Yuan > > On Tue, Dec 29, 2020 at 10:53 PM Sean Owen wrote:

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
st, I mean duplicated "pkey" exists in the output file (after "reduce by key") and some "pkey" missing. Since it only happens when executors being preempted, I believe this is a bug (nondeterministic shuffle) that SPARK-23207 trying to solve. Thanks, Shiao-An Yuan

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
ot; of data, and we do the > "read-reduce-write" operations > on multiple snapshots(hours) simultaneously. We pretty sure the same > snapshot(hour) never process parallelly and the output path always > generated with a timestamp, so those jobs shouldn't affect each other. > > Af

Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Shiao-An Yuan
the issue was gone, but I believe there is still a correctness bug that hasn't been reported yet. We have tried to reproduce this bug on a smaller scale but haven't succeeded yet. I have read SPARK-23207 and SPARK-28699, but couldn't found the bug. Since this case is DataSet, I believe it is unre

[bug] Scala reflection "assertion failed: class Byte" in Dataset.toJSON

2020-05-30 Thread Brandon Vincent
Hi all, I have a job that executes a query and collects the results as JSON using Dataset.toJSON. For the most part it is stable, but sometimes it fails randomly with a scala assertion error. Here is the stack trace: org.apache.spark.sql.Dataset.toJSON

Have you paid your bug bounty or did you log him off without paying

2020-05-01 Thread Nelson Mandela
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: BUG: spark.readStream .schema(staticSchema) not receiving schema information

2020-03-28 Thread Zahid Rahman
rganizations) is > ludicrous. Learn a little bit of humility > > If you're new to something, assume you have made a mistake rather than > that there is a bug. Lurk a bit more, or even do a simple Google search, > and you will realize Sean is a very senior committer (i.e. expert) in >

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Zahid Rahman
~/spark-3.0.0-preview2-bin-hadoop2.7$ sbin/start-slave.sh spark:// 192.168.0.38:7077 ~/spark-3.0.0-preview2-bin-hadoop2.7$ sbin/start-master.sh Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org On Fri, 27 Mar 2020 at 06:12, Zahid Rahman

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Zahid Rahman
sbin/start-master.sh sbin/start-slave.sh spark://192.168.0.38:7077 Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org On Fri, 27 Mar 2020 at 05:59, Wenchen Fan wrote: > Your Spark cluster, spark://192.168.0.38:7077, how is it deployed if

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan
Your Spark cluster, spark://192.168.0.38:7077, how is it deployed if you just include Spark dependency in IntelliJ? On Fri, Mar 27, 2020 at 1:54 PM Zahid Rahman wrote: > I have configured in IntelliJ as external jars > spark-3.0.0-preview2-bin-hadoop2.7/jar > > not pulling anything from maven.

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Zahid Rahman
I have configured in IntelliJ as external jars spark-3.0.0-preview2-bin-hadoop2.7/jar not pulling anything from maven. Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > Which

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

BUG: take with SparkSession.master[url]

2020-03-26 Thread Zahid Rahman
with the following sparksession configuration val spark = SparkSession.builder().master("local[*]").appName("Spark Session take").getOrCreate(); this line works flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").map(flight_row => flight_row).take(5) however if change the

Spark 2.2.1 Dataframes multiple joins bug?

2020-03-23 Thread Dipl.-Inf. Rico Bergmann
Hi all! Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins? What I did: buse.count res0: Long = 20554365 buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count res1: Long = 20554365

Re: Hostname :BUG

2020-03-12 Thread Zahid Rahman
hey Dodgy Bob, Linux & C programmers, conscientious non - objector, I have a great idea I want share with you. In linux I am familiar with wc {wc = word count} (linux users don't like long winded typing ). wc flags are : -c, --bytes print the byte counts -m, --chars print

Re: Hostname :BUG

2020-03-09 Thread Zahid Rahman
Hey floyd , I just realised something: You need to practice using the adduser command to create users or in your case useradd because that's less painless for you to create a user. Instead of working in root. Trust me it is good for you. Then you will realise this bit of code new SparkConf() is

Re: Hostname :BUG

2020-03-05 Thread Zahid Rahman
Talking about copy and paste Larry Tesler The *inventor* of *cut*/*copy* & *paste*, find & replace past away last week age 74. Backbutton.co.uk ¯\_(ツ)_/¯ ♡۶Java♡۶RMI ♡۶ Make Use Method {MUM} makeuse.org On Thu, 5 Mar 2020 at 07:01, Zahid Rahman wrote: > Please

Re: Hostname :BUG

2020-03-04 Thread Zahid Rahman
Please explain why you think that if there is a different reason from this : - If you think that, because the header of /etc/hostname says hosts then that is because I copied the file header from /etc/hosts to /etc/hostname. On Wed, 4 Mar 2020, 21:14 Andrew Melo, wrote: > Hello Zabid, > >

Hostname :BUG

2020-03-04 Thread Zahid Rahman
Hi, I found the problem was because on my Linux Operating System the /etc/hostname was blank. *STEP 1* I searched on google the error message and there was an answer suggesting I should add to /etc/hostname 127.0.0.1 [hostname] localhost. I did that but there was still an error, this

[Spark SQL]: Dataframe group by potential bug (Scala)

2019-10-31 Thread ludwiggj
is that I've found a bug in Spark, though I'm happy to be wrong. I can't find any reference to this issue online. *Given this schema:* val salesSchema = StructType(Seq( StructField("shopId", LongType, nullable = false), StructField("game", StringTy

Dataset schema incompatibility bug when reading column partitioned data

2019-03-29 Thread Dávid Szakállas
We observed the following bug on Spark 2.4.0: scala> spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet") scala> val schema = StructType(Seq(StructField("_1", IntegerType),StructField("_2", IntegerType))) scala&g

Re: Bug in Window Function

2018-07-25 Thread Jacek Laskowski
ecifications (ArrayBuffer(windowspecdefinition(campaign_id#104, > app_id#93, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), > windowspecdefinition(campaign_id#104, app_id#93, country#123, ROWS > BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) ).Please file a > bug report with this error message, stack trace, and the query.; >

Bug in Window Function

2018-07-25 Thread Elior Malul
app_id#93, country#123, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)) ).Please file a bug report with this error message, stack trace, and the query.;

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
: >> >>> Hi Shiyuan, >>> >>> I do not know whether I am right, but I would prefer to avoid >>> expressions in Spark as: >>> >>> df = <> >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>&

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando
>> >> I do not know whether I am right, but I would prefer to avoid expressions >> in Spark as: >> >> df = <> >> >> >> Regards, >> Gourav Sengupta >> >> On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote: &

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
0, 2018 at 10:42 PM, Shiyuan <gshy2...@gmail.com> wrote: > >> Here is the pretty print of the physical plan which reveals some details >> about what causes the bug (see the lines highlighted in bold): >> WithColumnRenamed() fails to update the dependency graph correctly: &

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Gourav Sengupta
ls > about what causes the bug (see the lines highlighted in bold): > WithColumnRenamed() fails to update the dependency graph correctly: > > > 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121 > in operator !Project [ID#118, score#121, LABEL#119, kk#14

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
Here is the pretty print of the physical plan which reveals some details about what causes the bug (see the lines highlighted in bold): WithColumnRenamed() fails to update the dependency graph correctly: 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121 in operator

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
you could kindly try using the below statement and > go through your used case once again (I am yet to go through all the lines): > > > > from pyspark.sql import Row > > df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2), > Row(score = 1.0,ID=&

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Gourav Sengupta
ark Users, > The following code snippet has an "attribute missing" error while the > attribute exists. This bug is triggered by a particular sequence of of > "select", "groupby" and "join". Note that if I take away the "select" in >

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Shiyuan
Hi Spark Users, The following code snippet has an "attribute missing" error while the attribute exists. This bug is triggered by a particular sequence of of "select", "groupby" and "join". Note that if I take away the "select" i

spark 2.3 dataframe join bug

2018-03-26 Thread 李斌松
Hi, sparks: I'm using spark2.3 and had found a bug in spark dataframe, here is my codes: sc = sparkSession.sparkContext tmp = sparkSession.createDataFrame(sc.parallelize([[1, 2, 3, 4], [1, 2, 5, 6], [2, 3, 4, 5], [2, 3, 5, 6]])).toDF('a', 'b', 'c', 'd

A possible bug? Must call persist to make code run

2017-12-06 Thread kwunlyou
a bug in Spark. Does anyone know which solved Spark issues are related? == CODE == from __future__ import absolute_import, division, print_function import pyspark.sql.types as T import pyspark.sql.functions as F # 2.1.1, 2.1.2 doesn't work # 2.2.0

Bug Report: Spark Config Defaults not Loading with python code/spark-submit

2017-10-13 Thread Nathan McLean
= SparkConf() print conf.toDebugString() # this prints configuration options print 'SPARK DEFAULTS' spark_context = SparkContext() conf = spark_context.getConf() print conf.toDebugString() This bug does not seem to exist in Spark 1.6.x I have reproduced it in Spark 2.1.1 and Spark 2.2.0

How do I create a JIRA issue and associate it with a PR that I created for a bug in master?

2017-09-12 Thread Mikhailau, Alex
How do I create a JIRA issue and associate it with a PR that I created for a bug in master? https://github.com/apache/spark/pull/19210

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
.getOrCreate(); Thanks very much. Keith From: 周康 [mailto:zhoukang199...@gmail.com] Sent: 2017年8月22日 20:22 To: Sun, Keith <ai...@ebay.com> Cc: user@spark.apache.org Subject: Re: A bug in spark or hadoop RPC with kerberos authentication? you can checkout Hadoop**credential class in spark yarn。During spark submit,it will use config on the classpath. I wonder how do you reference your own config?

  1   2   3   4   5   >