Questions about caching

2018-12-11 Thread Andrew Melo
Greetings, Spark Aficionados- I'm working on a project to (ab-)use PySpark to do particle physics analysis, which involves iterating with a lot of transformations (to apply weights and select candidate events) and reductions (to produce histograms of relevant physics objects). We have a basic

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Andrew Melo
Could you join() the DFs on a common key? On Fri, Dec 28, 2018 at 18:35 wrote: > Shabad , I am not sure what you are trying to say. Could you please give > me an example? The result of the Query is a Dataframe that is created after > iterating, so I am not sure how could I map that to a column

Re: Where does the Driver run?

2019-03-24 Thread Andrew Melo
Hi Pat, On Sun, Mar 24, 2019 at 1:03 PM Pat Ferrel wrote: > Thanks, I have seen this many times in my research. Paraphrasing docs: “in > deployMode ‘cluster' the Driver runs on a Worker in the cluster” > > When I look at logs I see 2 executors on the 2 slaves (executor 0 and 1 > with addresses

Re: Where does the Driver run?

2019-03-25 Thread Andrew Melo
Hi Pat, Indeed, I don't think that it's possible to use cluster mode w/o spark-submit. All the docs I see appear to always describe needing to use spark-submit for cluster mode -- it's not even compatible with spark-shell. But it makes sense to me -- if you want Spark to run your application's

Please stop asking to unsubscribe

2019-01-31 Thread Andrew Melo
The correct way to unsubscribe is to mail user-unsubscr...@spark.apache.org Just mailing the list with "unsubscribe" doesn't actually do anything... Thanks Andrew - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:41 AM Patrick McCarthy wrote: > > Thanks Gourav. > > Incidentally, since the regular UDF is row-wise, we could optimize that a bit > by taking the convert() closure and simply making that the UDF. > > Since there's that MGRS object that we have to create too, we

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-05-06 Thread Andrew Melo
Hi, On Mon, May 6, 2019 at 11:59 AM Gourav Sengupta wrote: > > Hence, what I mentioned initially does sound correct ? I don't agree at all - we've had a significant boost from moving to regular UDFs to pandas UDFs. YMMV, of course. > > On Mon, May 6, 2019 at 5:43 PM Andrew

Re: Connecting to Spark cluster remotely

2019-04-22 Thread Andrew Melo
Hi Rishkesh On Mon, Apr 22, 2019 at 4:26 PM Rishikesh Gawade wrote: > > To put it simply, what are the configurations that need to be done on the > client machine so that it can run driver on itself and executors on > spark-yarn cluster nodes? TBH, if it were me, I would simply SSH to the

Re: can't download 2.4.1 sourcecode

2019-04-22 Thread Andrew Melo
On Mon, Apr 22, 2019 at 10:54 PM yutaochina wrote: > > >

Re: Driver vs master

2019-10-07 Thread Andrew Melo
ot;local" master and "client mode" then yes tasks execute in the same JVM as the driver". The answer depends on the exact setup Amit has and how the application is configured > HTH... > > Ayan > > > > On Tue, Oct 8, 2019 at 12:11 PM Andrew Melo wrote: > &

Re: Driver vs master

2019-10-07 Thread Andrew Melo
Hi Amit On Mon, Oct 7, 2019 at 18:33 Amit Sharma wrote: > Can you please help me understand this. I believe driver programs runs on > master node If we are running 4 spark job and driver memory config is 4g then total 16 > 6b would be used of master node. This depends on what master/deploy

Re: Driver vs master

2019-10-07 Thread Andrew Melo
a time. I understand that. I think there's a misunderstanding with the terminology, though. Are you running multiple separate spark instances on a single machine or one instance with multiple jobs inside. > > On Monday, October 7, 2019, Andrew Melo wrote: > >> Hi Amit >>

Re: Reading 7z file in spark

2020-01-14 Thread Andrew Melo
It only makes sense if the underlying file is also splittable, and even then, it doesn't really do anything for you if you don't explicitly tell spark about the split boundaries On Tue, Jan 14, 2020 at 7:36 PM Someshwar Kale wrote: > I would suggest to use other compression technique which is

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
) Thanks for your help, Andrew > On Mon, Apr 6, 2020 at 3:50 PM Andrew Melo wrote: > >> Hello all, >> >> I'm aware that Scala is not binary compatible between revisions. I have >> some Java code whose only Scala dependency is the transitive dependency >> thr

Re: Scala version compatibility

2020-04-06 Thread Andrew Melo
not included in my artifact except for this single callsite. Thanks Andrew > On Mon, Apr 6, 2020 at 4:16 PM Andrew Melo wrote: > >> >> >> On Mon, Apr 6, 2020 at 3:08 PM Koert Kuipers wrote: >> >>> yes it will >>> >>> >> Ooof, I

Scala version compatibility

2020-04-06 Thread Andrew Melo
Hello all, I'm aware that Scala is not binary compatible between revisions. I have some Java code whose only Scala dependency is the transitive dependency through Spark. This code calls a Spark API which returns a Seq, which I then convert into a List with JavaConverters.seqAsJavaListConverter.

Supporting Kryo registration in DSv2

2020-03-26 Thread Andrew Melo
Hello all, Is there a way to register classes within a datasourcev2 implementation in the Kryo serializer? I've attempted the following in both the constructor and static block of my toplevel class: SparkContext context = SparkContext.getOrCreate(); SparkConf conf =

Optimizing LIMIT in DSv2

2020-03-30 Thread Andrew Melo
Hello, Executing "SELECT Muon_Pt FROM rootDF LIMIT 10", where "rootDF" is a temp view backed by a DSv2 reader yields the attached plan [1]. It appears that the initial stage is run over every partition in rootDF, even though each partition has 200k rows (modulo the last partition which holds the

Re: Can I collect Dataset[Row] to driver without converting it to Array [Row]?

2020-04-22 Thread Andrew Melo
Hi Maqy On Wed, Apr 22, 2020 at 3:24 AM maqy <454618...@qq.com> wrote: > > I will traverse this Dataset to convert it to Arrow and send it to Tensorflow > through Socket. (I presume you're using the python tensorflow API, if you're not, please ignore) There is a JIRA/PR ([1] [2]) which

Re: REST Structured Steaming Sink

2020-07-01 Thread Andrew Melo
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz wrote: > > I'm not sure having a built-in sink that allows you to DDOS servers is the > best idea either. foreachWriter is typically used for such use cases, not > foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, > etc.

PySpark aggregation w/pandas_udf

2020-07-15 Thread Andrew Melo
Hi all, For our use case, we would like to perform an aggregation using a pandas_udf with dataframes that have O(100m) rows and a few 10s of columns. Conceptually, this looks a bit like pyspark.RDD.aggregate, where the user provides: * A "seqOp" which accepts pandas series(*) and outputs an

Re: Spark DataFrame Creation

2020-07-22 Thread Andrew Melo
Hi Mark, On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell wrote: > > Sorry if this is the wrong place for this. I am trying to debug an issue > with this library: > https://github.com/springml/spark-sftp > > When I attempt to create a dataframe: > > spark.read. >

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
t merge join Cheers Andrew On Wed, May 12, 2021 at 11:32 AM Andrew Melo wrote: > > Hi, > > In the case where the left and right hand side share a common parent like: > > df = spark.read.someDataframe().withColumn('rownum', row_number()) > df1 = df.withColumn('c1', expensive_

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
ch. A big goal of mine is to make it so that what was changed is recomputed, and no more, which will speed up the rate at which we can find new physics. Cheers Andrew > > On 5/17/21, 2:56 PM, "Andrew Melo" wrote: > > CAUTION: This email originated from outside of t

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
explicitly compute them themselves. Cheers Andrew On Mon, May 17, 2021 at 1:10 PM Sean Owen wrote: > > Why join here - just add two columns to the DataFrame directly? > > On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: >> >> Anyone have ideas about the below Q? >> &g

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
Hi, In the case where the left and right hand side share a common parent like: df = spark.read.someDataframe().withColumn('rownum', row_number()) df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')

Grabbing the current MemoryManager in a plugin

2022-04-08 Thread Andrew Melo
Hello, I've implemented support for my DSv2 plugin to back its storage with ArrowColumnVectors, which necessarily means using off-heap memory. Is it possible to somehow grab either a reference to the current MemoryManager so that the off-heap memory usage is properly accounted for and to prevent

Re: Grabbing the current MemoryManager in a plugin

2022-04-13 Thread Andrew Melo
Hello, Any wisdom on the question below? Thanks Andrew On Fri, Apr 8, 2022 at 16:04 Andrew Melo wrote: > Hello, > > I've implemented support for my DSv2 plugin to back its storage with > ArrowColumnVectors, which necessarily means using off-heap memory. Is > it possible to some

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
Hi Sean, Out of curiosity, will Java 11+ always require special flags to access the unsafe direct memory interfaces, or is this something that will either be addressed by the spec (by making an "approved" interface) or by libraries (with some other workaround)? Thanks Andrew On Tue, Apr 12,

Re: cannot access class sun.nio.ch.DirectBuffer

2022-04-13 Thread Andrew Melo
. > > On Wed, Apr 13, 2022 at 9:05 AM Andrew Melo wrote: > >> Hi Sean, >> >> Out of curiosity, will Java 11+ always require special flags to access >> the unsafe direct memory interfaces, or is this something that will either >> be addressed by the spec (by

Re: How is union() implemented? Need to implement column bind

2022-04-20 Thread Andrew Melo
It would certainly be useful for our domain to have some sort of native cbind(). Is there a fundamental disapproval of adding that functionality, or is it just a matter of nobody implementing it? On Wed, Apr 20, 2022 at 16:28 Sean Owen wrote: > Good lead, pandas on Spark concat() is worth

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-16 Thread Andrew Melo
I'm curious about using shared memory to speed up the JVM->Python round trip. Is there any sane way to do anonymous shared memory in Java/scale? On Sat, Jul 16, 2022 at 16:10 Sebastian Piu wrote: > Other alternatives are to look at how PythonRDD does it in spark, you > could also try to go for

PySpark cores

2022-07-28 Thread Andrew Melo
Hello, Is there a way to tell Spark that PySpark (arrow) functions use multiple cores? If we have an executor with 8 cores, we would like to have a single PySpark function use all 8 cores instead of having 8 single core python functions run. Thanks! Andrew

Re: Profiling PySpark Pandas UDF

2022-08-25 Thread Andrew Melo
Hi Gourav, Since Koalas needs the same round-trip to/from JVM and Python, I expect that the performance should be nearly the same for UDFs in either API Cheers Andrew On Thu, Aug 25, 2022 at 11:22 AM Gourav Sengupta wrote: > > Hi, > > May be I am jumping to conclusions and making stupid

Re: [Spark Sql] Global Setting for Case-Insensitive String Compare

2022-11-21 Thread Andrew Melo
I think this is the right place, just a hard question :) As far as I know, there's no "case insensitive flag", so YMMV On Mon, Nov 21, 2022 at 5:40 PM Patrick Tucci wrote: > > Is this the wrong list for this type of question? > > On 2022/11/12 16:34:48 Patrick Tucci wrote: > > Hello, > > > >