Re: Spark writing API

2023-08-16 Thread Andrew Melo
like with arrow's off-heap storage), it's crazy inefficient to try and do the equivalent of realloc() to grow the buffer size. Thanks Andrew > On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran > wrote: > >> >> >> On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote: >>

Re: Spark writing API

2023-08-02 Thread Andrew Melo
Hello Spark Devs Could anyone help me with this? Thanks, Andrew On Wed, May 31, 2023 at 20:57 Andrew Melo wrote: > Hi all > > I've been developing for some time a Spark DSv2 plugin "Laurelin" ( > https://github.com/spark-root/laurelin > ) to read the ROOT (https

Spark writing API

2023-05-31 Thread Andrew Melo
Hi all I've been developing for some time a Spark DSv2 plugin "Laurelin" ( https://github.com/spark-root/laurelin ) to read the ROOT (https://root.cern) file format (which is used in high energy physics). I've recently presented my work in a conference (

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
I'm Central US time (AKA UTC -6:00) On Tue, Feb 7, 2023 at 5:32 PM Holden Karau wrote: > > Awesome, I guess I should have asked folks for timezones that they’re in. > > On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo wrote: >> >> Hello Holden, >> >> We are inter

Re: Spark on Kube (virtua) coffee/tea/pop times

2023-02-07 Thread Andrew Melo
Hello Holden, We are interested in Spark on k8s and would like the opportunity to speak with devs about what we're looking for slash better ways to use spark. Thanks! Andrew On Tue, Feb 7, 2023 at 5:24 PM Holden Karau wrote: > > Hi Folks, > > It seems like we could maybe use some additional

Re: Apache Spark 3.3 Release

2022-03-16 Thread Andrew Melo
Hello, I've been trying for a bit to get the following two PRs merged and into a release, and I'm having some difficulty moving them forward: https://github.com/apache/spark/pull/34903 - This passes the current python interpreter to spark-env.sh to allow some currently-unavailable customization

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
HTH Andrew On Tue, Aug 17, 2021 at 2:29 PM Mich Talebzadeh wrote: > Hi Andrew, > > Can you please elaborate on blowing pip cache before committing the layer? > > Thanks, > > Much > > On Tue, 17 Aug 2021 at 16:57, Andrew Melo wrote: > >> Silly Q, did

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Silly Q, did you blow away the pip cache before committing the layer? That always trips me up. Cheers Andrew On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh wrote: > With no additional python packages etc we get 1.4GB compared to 2.19GB > before > > REPOSITORY TAG

Re: WholeStageCodeGen + DSv2

2021-05-19 Thread Andrew Melo
eproduce the issue you described? >> >> Bests, >> Takeshi >> >> On Wed, May 19, 2021 at 11:38 AM Andrew Melo wrote: >>> >>> Hello, >>> >>> When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows >>> past

WholeStageCodeGen + DSv2

2021-05-18 Thread Andrew Melo
Hello, When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows past the 64kB source limit and fails. Looking at the generated code, a big part of the code is simply the DSv2 convention that the codegen'd variable names are the same as the columns instead of something more compact

Secrets store for DSv2

2021-05-18 Thread Andrew Melo
Hello, When implementing a DSv2 datasource, where is an appropriate place to store/transmit secrets from the driver to the executors? Is there built-in spark functionality for that, or is my best bet to stash it as a member variable in one of the classes that gets sent to the executors? Thanks!

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Andrew Melo
Hi, Integrating Koalas with pyspark might help enable a richer integration between the two. Something that would be useful with a tighter integration is support for custom column array types. Currently, Spark takes dataframes, converts them to arrow buffers then transmits them over the socket to

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-16 Thread Andrew Melo
Hello Ryan, This proposal looks very interesting. Would future goals for this functionality include both support for aggregation functions, as well as support for processing ColumnBatch-es (instead of Row/InternalRow)? Thanks Andrew On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue wrote: > > Thanks

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-24 Thread Andrew Melo
Hello, On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote: > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same release >

Re: DSv2 & DataSourceRegister

2020-04-16 Thread Andrew Melo
Hi again, Does anyone have thoughts on either the idea or the implementation? Thanks, Andrew On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo wrote: > > Hi all, > > I've opened a WIP PR here https://github.com/apache/spark/pull/28159 > I'm a novice at Scala, so I'm sure the code

Re: DSv2 & DataSourceRegister

2020-04-09 Thread Andrew Melo
hanks again, Andrew On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo wrote: > > On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan wrote: > > > > It would be good to support your use case, but I'm not sure how to > > accomplish that. Can you open a PR so that we can di

Re: DSv2 & DataSourceRegister

2020-04-08 Thread Andrew Melo
t; > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo wrote: >> >> Hello >> >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote: >>> >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not >>> sure this is possible as the DS V2 AP

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
n from META-INF and pass in the full class name to the DataFrameReader. Thanks Andrew > On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo wrote: > >> Hi Ryan, >> >> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue wrote: >> > >> > Hi Andrew, >> > >>

Re: DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
oth interfaces. Thanks again, Andrew > > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo wrote: >> >> Hi all, >> >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I >> send an email to the dev list for discussion. >> >> As the DSv2

DSv2 & DataSourceRegister

2020-04-07 Thread Andrew Melo
Hi all, I posted an improvement ticket in JIRA and Hyukjin Kwon requested I send an email to the dev list for discussion. As the DSv2 API evolves, some breaking changes are occasionally made to the API. It's possible to split a plugin into a "common" part and multiple version-specific parts and

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-14 Thread Andrew Melo
dle and the desire to increase utillzation. Thanks Andrew Sean > > On Fri, Mar 13, 2020 at 6:33 PM Andrew Melo wrote: > > > > Hi Xingbo, Sean, > > > > On Fri, Mar 13, 2020 at 12:31 PM Xingbo Jiang > wrote: > >> > >> Andrew, could you provide mor

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
dicated k8s/mesos/yarn clusters we use for prototyping > Thanks, > > Xingbo > > On Fri, Mar 13, 2020 at 10:23 AM Sean Owen wrote: > >> You have multiple workers in one Spark (standalone) app? this wouldn't >> prevent N apps from each having a worker on a machine. >>

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Andrew Melo
Hello, On Fri, Feb 28, 2020 at 13:21 Xingbo Jiang wrote: > Hi all, > > Based on my experience, there is no scenario that necessarily requires > deploying multiple Workers on the same node with Standalone backend. A > worker should book all the resources reserved to Spark on the host it is >

Re: how to get partition column info in Data Source V2 writer

2019-12-17 Thread Andrew Melo
Hi Aakash On Tue, Dec 17, 2019 at 12:42 PM aakash aakash wrote: > Hi Spark dev folks, > > First of all kudos on this new Data Source v2, API looks simple and it > makes easy to develop a new data source and use it. > > With my current work, I am trying to implement a new data source V2 writer >

Re: DSv2 reader lifecycle

2019-11-06 Thread Andrew Melo
hey are created. > That's good to know, I'll search around JIRA for docs describing that functionality. Thanks again, Andrew > > rb > > On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo wrote: > >> Hello, >> >> During testing of our DSv2 implementation (on 2.4.3 FW

DSv2 reader lifecycle

2019-11-05 Thread Andrew Melo
Hello, During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that our DataSourceReader is being instantiated multiple times for the same dataframe. For example, the following snippet Dataset df = spark .read()

Re: Exposing functions to pyspark

2019-10-08 Thread Andrew Melo
:48 PM Andrew Melo wrote: > > Hello, > > I'm working on a DSv2 implementation with a userbase that is 100% pyspark > based. > > There's some interesting additional DS-level functionality I'd like to > expose from the Java side to pyspark -- e.g. I/O metrics, which source

Exposing functions to pyspark

2019-09-30 Thread Andrew Melo
Hello, I'm working on a DSv2 implementation with a userbase that is 100% pyspark based. There's some interesting additional DS-level functionality I'd like to expose from the Java side to pyspark -- e.g. I/O metrics, which source site provided the data, etc... Does someone have an example of

Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Andrew Melo
Hi Spark Aficionados- On Fri, Sep 13, 2019 at 15:08 Ryan Blue wrote: > +1 for a preview release. > > DSv2 is quite close to being ready. I can only think of a couple issues > that we need to merge, like getting a fix for stats estimation done. I'll > have a better idea once I've caught up from

DSV2 API Question

2019-06-25 Thread Andrew Melo
Hello, I've (nearly) implemented a DSV2-reader interface to read particle physics data stored in the ROOT (https://root.cern.ch/) file format. You can think of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a column can be of type "float[]", meaning each row in the column

Re: Detect executor core count

2019-06-18 Thread Andrew Melo
> >> case _: NoSuchElementException => >> >> // If spark.executor.cores is not defined, get the cores per JVM >> >> import spark.implicits._ >> >> val numMachineCores = spark.range(0, 1) >> >>

Detect executor core count

2019-06-18 Thread Andrew Melo
Hello, Is there a way to detect the number of cores allocated for an executor within a java-based InputPartitionReader? Thanks! Andrew

Re: DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
hich I improperly passing in instead of Metadata.empty() Thanks again, Andrew > > On Tue, May 21, 2019 at 11:39 AM Andrew Melo wrote: >> >> Hello, >> >> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/) >> file format to replace a previous DSV

DataSourceV2Reader Q

2019-05-21 Thread Andrew Melo
Hello, I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/) file format to replace a previous DSV1 source that was in use before. I have a bare skeleton of the reader, which can properly load the files and pass their schema into Spark 2.4.3, but any operation on the resulting

DataSourceV2 exceptions

2019-04-08 Thread Andrew Melo
Hello, I'm developing a (java) DataSourceV2 to read a columnar fileformat popular in a number of physical sciences (https://root.cern.ch/). (I also understand that the API isn't fixed and subject to change). My question is -- what is the expected way to transmit exceptions from the DataSource up

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
On Fri, Apr 5, 2019 at 9:41 AM Jungtaek Lim wrote: > > Thanks Andrew for reporting this. I just submitted the fix. > https://github.com/apache/spark/pull/24304 Thanks! > > On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo wrote: >> >> Hello, >> >> I'm not sur

Re: [ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-05 Thread Andrew Melo
Hello, I'm not sure if this is the proper place to report it, but the 2.4.1 version of the config docs apparently didn't render right into HTML (scroll down to "Compression and Serialization") https://spark.apache.org/docs/2.4.1/configuration.html#available-properties By comparison, the 2.4.0

Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Andrew Melo
Hi, On Fri, Mar 1, 2019 at 9:48 AM Xingbo Jiang wrote: > > Hi Sean, > > To support GPU scheduling with YARN cluster, we have to update the hadoop > version to 3.1.2+. However, if we decide to not upgrade hadoop to beyond that > version for Spark 3.0, then we just have to disable/fallback the

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
e'll need to calculate the sum of their 4-d momenta, while samples with <2 electrons will need subtract two different physical quantities -- several more steps before we get to the point where we'll histogram the different subsamples for the outputs. Cheers Andrew > > On Mon, Feb 4, 2019 at

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
so it's possible we're not using it correctly. Cheers Andrew > rb > > On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo wrote: >> >> Hello >> >> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: >> > >> > I've seen many application need to split data

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: > > I've seen many application need to split dataset to multiple datasets based > on some conditions. As there is no method to do it in one place, developers > use filter method multiple times. I think it can be useful to have method

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
just getting started). > > On Mon, Aug 27, 2018 at 12:18 PM Andrew Melo wrote: >> >> Hi Holden, >> >> I'm agnostic to the approach (though it seems cleaner to have an >> explicit API for it). If you would like, I can take that JIRA and >> implement it

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
bly add `getActiveSession` to the PySpark > API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255 > ) > > On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo wrote: >> >> Hello Sean, others - >> >> Just to confirm, is it OK for client

Re: SparkContext singleton get w/o create?

2018-08-27 Thread Andrew Melo
, 2018 at 5:52 PM, Andrew Melo wrote: > Hi Sean, > > On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen wrote: >> Ah, python. How about SparkContext._active_spark_context then? > > Ah yes, that looks like the right member, but I'm a bit wary about > depending on functionality

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
; and subject to change. Is that something I should be unconcerned about. The other thought is that the accesses with SparkContext are protected by "SparkContext._lock" -- should I also use that lock? Thanks for your help! Andrew > > On Tue, Aug 7, 2018 at 5:34 PM Andr

Re: SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
ion and causing a JVM to start. Is there an easy way to call getActiveSession that doesn't start a JVM? Cheers Andrew > > On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo wrote: >> >> Hello, >> >> One pain point with various Jupyter extensions [1][2] that provide >&

SparkContext singleton get w/o create?

2018-08-07 Thread Andrew Melo
Hello, One pain point with various Jupyter extensions [1][2] that provide visual feedback about running spark processes is the lack of a public API to introspect the web URL. The notebook server needs to know the URL to find information about the current SparkContext. Simply looking for