Re: Different execution results with wholestage codegen on and off

2020-05-29 Thread Pasha Finkelshteyn
Here[1] it is, please review [1] https://issues.apache.org/jira/browse/SPARK-31854 On 20/05/27 10:21PM, Xiao Li wrote: > Thanks for reporting it. Please open a JIRA with a test case. > > Cheers, > > Xiao > > On Wed, May 27, 2020 at 1:42 PM Pasha Finkelshteyn < > pavel.finkelsht...@gmail.com>

Re: CSV parsing issue

2020-05-29 Thread elango vaidyanathan
Thanks Sean, got it. Thanks, Elango On Thu, May 28, 2020, 9:04 PM Sean Owen wrote: > I don't think so, that data is inherently ambiguous and incorrectly > formatted. If you know something about the structure, maybe you can rewrite > the middle column manually to escape the inner quotes and

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Srinivas V
Yes it is application specific class. This is how java Spark Functions work. You can refer to this code in the documentation: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java public class StateUpdateTask

Spark Security

2020-05-29 Thread wilbertseoane
Hello, I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files. When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this

Re: Spark Security

2020-05-29 Thread Sean Owen
What do you mean by secure here? On Fri, May 29, 2020 at 10:21 AM wrote: > Hello, > > I plan to load in a local .tsv file from my hard drive using sparklyr (an > R package). I have figured out how to do this already on small files. > > When I decide to receive my client’s large .tsv file, can I

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Randy, > >

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
Did you try this on the Cluster? Note: This works just fine under 'Local' mode. On Thu, May 28, 2020 at 9:12 PM ZHANG Wei wrote: > I can't reproduce the issue with my simple code: > ```scala > spark.streams.addListener(new StreamingQueryListener { > override def onQueryProgress(event:

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
Thanks! I will take a look at the link. Just one question, you seem to be passing 'accumulators' in the constructor but where do you use it in the StateUpdateTask class? I am still missing that connection. Sorry, if my question is dumb. I must be missing something. Thanks for your help so far.

Re: Spark Security

2020-05-29 Thread Sean Owen
If you load a file on your computer, that is unrelated to Spark. Whatever you load via Spark APIs will at some point live in memory on the Spark cluster, or the storage you back it with if you store it. Whether the cluster and storage are secure (like, ACLs / auth enabled) is up to whoever runs

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Srinivas V
Yes, accumulators are updated in the call method of StateUpdateTask. Like when state times out or when the data is pushed to next Kafka topic etc. On Fri, May 29, 2020 at 11:55 PM Something Something < mailinglist...@gmail.com> wrote: > Thanks! I will take a look at the link. Just one question,

Re: Spark Security

2020-05-29 Thread Anwar AliKhan
What is the size of your .tsv file sir ? What is the size of your local hard drive sir ? Regards Wali Ahaad On Fri, 29 May 2020, 16:21 , wrote: > Hello, > > I plan to load in a local .tsv file from my hard drive using sparklyr (an > R package). I have figured out how to do this

[pyspark 2.3+] Dedupe records

2020-05-29 Thread Rishi Shah
Hi All, I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below: 1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread randy clinton
HDFS is simply a better place to make performant reads and on top of that the data is closer to your spark job. The databricks link from above will show you that where they find a 6x read throughput difference between the two. If your HDFS is part of the same Spark cluster than it should be an

Re: Using Spark Accumulators with Structured Streaming

2020-05-29 Thread Something Something
I mean... I don't see any reference to 'accumulator' in your Class *definition*. How can you access it in the class if it's not in your definition of class: public class StateUpdateTask implements MapGroupsWithStateFunction<*String, InputEventModel, ModelStateInfo, ModelUpdate*> {. *--> I was

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Jörn Franke
Maybe some aws network optimized instances with higher bandwidth will improve the situation. > Am 27.05.2020 um 19:51 schrieb Dark Crusader : > >  > Hi Jörn, > > Thanks for the reply. I will try to create a easier example to reproduce the > issue. > > I will also try your suggestion to look

Re: [pyspark 2.3+] Dedupe records

2020-05-29 Thread Sonal Goyal
Hi Rishi, 1. Dataframes are RDDs under the cover. If you have unstructured data or if you know something about the data through which you can optimize the computation. you can go with RDDs. Else the Dataframes which are optimized by Spark SQL should be fine. 2. For incremental deduplication, I