like
with arrow's off-heap storage), it's crazy inefficient to try and do the
equivalent of realloc() to grow the buffer size.
Thanks
Andrew
> On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran
> wrote:
>
>>
>>
>> On Thu, 1 Jun 2023 at 00:58, Andrew Melo wrote:
>>
Hello Spark Devs
Could anyone help me with this?
Thanks,
Andrew
On Wed, May 31, 2023 at 20:57 Andrew Melo wrote:
> Hi all
>
> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
> https://github.com/spark-root/laurelin
> ) to read the ROOT (https
Hi all
I've been developing for some time a Spark DSv2 plugin "Laurelin" (
https://github.com/spark-root/laurelin
) to read the ROOT (https://root.cern) file format (which is used in high
energy physics). I've recently presented my work in a conference (
I'm Central US time (AKA UTC -6:00)
On Tue, Feb 7, 2023 at 5:32 PM Holden Karau wrote:
>
> Awesome, I guess I should have asked folks for timezones that they’re in.
>
> On Tue, Feb 7, 2023 at 3:30 PM Andrew Melo wrote:
>>
>> Hello Holden,
>>
>> We are inter
Hello Holden,
We are interested in Spark on k8s and would like the opportunity to
speak with devs about what we're looking for slash better ways to use
spark.
Thanks!
Andrew
On Tue, Feb 7, 2023 at 5:24 PM Holden Karau wrote:
>
> Hi Folks,
>
> It seems like we could maybe use some additional
Hello,
I've been trying for a bit to get the following two PRs merged and
into a release, and I'm having some difficulty moving them forward:
https://github.com/apache/spark/pull/34903 - This passes the current
python interpreter to spark-env.sh to allow some currently-unavailable
customization
HTH
Andrew
On Tue, Aug 17, 2021 at 2:29 PM Mich Talebzadeh
wrote:
> Hi Andrew,
>
> Can you please elaborate on blowing pip cache before committing the layer?
>
> Thanks,
>
> Much
>
> On Tue, 17 Aug 2021 at 16:57, Andrew Melo wrote:
>
>> Silly Q, did
Silly Q, did you blow away the pip cache before committing the layer? That
always trips me up.
Cheers
Andrew
On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh
wrote:
> With no additional python packages etc we get 1.4GB compared to 2.19GB
> before
>
> REPOSITORY TAG
eproduce the issue you described?
>>
>> Bests,
>> Takeshi
>>
>> On Wed, May 19, 2021 at 11:38 AM Andrew Melo wrote:
>>>
>>> Hello,
>>>
>>> When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows
>>> past
Hello,
When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows
past the 64kB source limit and fails. Looking at the generated code, a
big part of the code is simply the DSv2 convention that the codegen'd
variable names are the same as the columns instead of something more
compact
Hello,
When implementing a DSv2 datasource, where is an appropriate place to
store/transmit secrets from the driver to the executors? Is there
built-in spark functionality for that, or is my best bet to stash it
as a member variable in one of the classes that gets sent to the
executors?
Thanks!
Hi,
Integrating Koalas with pyspark might help enable a richer integration
between the two. Something that would be useful with a tighter
integration is support for custom column array types. Currently, Spark
takes dataframes, converts them to arrow buffers then transmits them
over the socket to
Hello Ryan,
This proposal looks very interesting. Would future goals for this
functionality include both support for aggregation functions, as well
as support for processing ColumnBatch-es (instead of Row/InternalRow)?
Thanks
Andrew
On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue wrote:
>
> Thanks
Hello,
On Wed, Jun 24, 2020 at 2:13 PM Holden Karau wrote:
>
> So I thought our theory for the pypi packages was it was for local
> developers, they really shouldn't care about the Hadoop version. If you're
> running on a production cluster you ideally pip install from the same release
>
Hi again,
Does anyone have thoughts on either the idea or the implementation?
Thanks,
Andrew
On Thu, Apr 9, 2020 at 11:32 PM Andrew Melo wrote:
>
> Hi all,
>
> I've opened a WIP PR here https://github.com/apache/spark/pull/28159
> I'm a novice at Scala, so I'm sure the code
hanks again,
Andrew
On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo wrote:
>
> On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan wrote:
> >
> > It would be good to support your use case, but I'm not sure how to
> > accomplish that. Can you open a PR so that we can di
t;
> On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo wrote:
>>
>> Hello
>>
>> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan wrote:
>>>
>>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not
>>> sure this is possible as the DS V2 AP
n from META-INF and pass in the full class
name to the DataFrameReader.
Thanks
Andrew
> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo wrote:
>
>> Hi Ryan,
>>
>> On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue wrote:
>> >
>> > Hi Andrew,
>> >
>>
oth interfaces.
Thanks again,
Andrew
>
> On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo wrote:
>>
>> Hi all,
>>
>> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>> send an email to the dev list for discussion.
>>
>> As the DSv2
Hi all,
I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
send an email to the dev list for discussion.
As the DSv2 API evolves, some breaking changes are occasionally made
to the API. It's possible to split a plugin into a "common" part and
multiple version-specific parts and
dle and the desire to increase utillzation.
Thanks
Andrew
Sean
>
> On Fri, Mar 13, 2020 at 6:33 PM Andrew Melo wrote:
> >
> > Hi Xingbo, Sean,
> >
> > On Fri, Mar 13, 2020 at 12:31 PM Xingbo Jiang
> wrote:
> >>
> >> Andrew, could you provide mor
dicated k8s/mesos/yarn clusters we use
for prototyping
> Thanks,
>
> Xingbo
>
> On Fri, Mar 13, 2020 at 10:23 AM Sean Owen wrote:
>
>> You have multiple workers in one Spark (standalone) app? this wouldn't
>> prevent N apps from each having a worker on a machine.
>>
Hello,
On Fri, Feb 28, 2020 at 13:21 Xingbo Jiang wrote:
> Hi all,
>
> Based on my experience, there is no scenario that necessarily requires
> deploying multiple Workers on the same node with Standalone backend. A
> worker should book all the resources reserved to Spark on the host it is
>
Hi Aakash
On Tue, Dec 17, 2019 at 12:42 PM aakash aakash
wrote:
> Hi Spark dev folks,
>
> First of all kudos on this new Data Source v2, API looks simple and it
> makes easy to develop a new data source and use it.
>
> With my current work, I am trying to implement a new data source V2 writer
>
hey are created.
>
That's good to know, I'll search around JIRA for docs describing that
functionality.
Thanks again,
Andrew
>
> rb
>
> On Tue, Nov 5, 2019 at 4:58 PM Andrew Melo wrote:
>
>> Hello,
>>
>> During testing of our DSv2 implementation (on 2.4.3 FW
Hello,
During testing of our DSv2 implementation (on 2.4.3 FWIW), it appears that
our DataSourceReader is being instantiated multiple times for the same
dataframe. For example, the following snippet
Dataset df = spark
.read()
:48 PM Andrew Melo wrote:
>
> Hello,
>
> I'm working on a DSv2 implementation with a userbase that is 100% pyspark
> based.
>
> There's some interesting additional DS-level functionality I'd like to
> expose from the Java side to pyspark -- e.g. I/O metrics, which source
Hello,
I'm working on a DSv2 implementation with a userbase that is 100% pyspark based.
There's some interesting additional DS-level functionality I'd like to
expose from the Java side to pyspark -- e.g. I/O metrics, which source
site provided the data, etc...
Does someone have an example of
Hi Spark Aficionados-
On Fri, Sep 13, 2019 at 15:08 Ryan Blue wrote:
> +1 for a preview release.
>
> DSv2 is quite close to being ready. I can only think of a couple issues
> that we need to merge, like getting a fix for stats estimation done. I'll
> have a better idea once I've caught up from
Hello,
I've (nearly) implemented a DSV2-reader interface to read particle physics
data stored in the ROOT (https://root.cern.ch/) file format. You can think
of these ROOT files as roughly parquet-like: column-wise and nested (i.e. a
column can be of type "float[]", meaning each row in the column
>
>> case _: NoSuchElementException =>
>>
>> // If spark.executor.cores is not defined, get the cores per JVM
>>
>> import spark.implicits._
>>
>> val numMachineCores = spark.range(0, 1)
>>
>>
Hello,
Is there a way to detect the number of cores allocated for an executor
within a java-based InputPartitionReader?
Thanks!
Andrew
hich I improperly
passing in instead of Metadata.empty()
Thanks again,
Andrew
>
> On Tue, May 21, 2019 at 11:39 AM Andrew Melo wrote:
>>
>> Hello,
>>
>> I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/)
>> file format to replace a previous DSV
Hello,
I'm developing a DataSourceV2 reader for the ROOT (https://root.cern/)
file format to replace a previous DSV1 source that was in use before.
I have a bare skeleton of the reader, which can properly load the
files and pass their schema into Spark 2.4.3, but any operation on the
resulting
Hello,
I'm developing a (java) DataSourceV2 to read a columnar fileformat
popular in a number of physical sciences (https://root.cern.ch/). (I
also understand that the API isn't fixed and subject to change).
My question is -- what is the expected way to transmit exceptions from
the DataSource up
On Fri, Apr 5, 2019 at 9:41 AM Jungtaek Lim wrote:
>
> Thanks Andrew for reporting this. I just submitted the fix.
> https://github.com/apache/spark/pull/24304
Thanks!
>
> On Fri, Apr 5, 2019 at 3:21 PM Andrew Melo wrote:
>>
>> Hello,
>>
>> I'm not sur
Hello,
I'm not sure if this is the proper place to report it, but the 2.4.1
version of the config docs apparently didn't render right into HTML
(scroll down to "Compression and Serialization")
https://spark.apache.org/docs/2.4.1/configuration.html#available-properties
By comparison, the 2.4.0
Hi,
On Fri, Mar 1, 2019 at 9:48 AM Xingbo Jiang wrote:
>
> Hi Sean,
>
> To support GPU scheduling with YARN cluster, we have to update the hadoop
> version to 3.1.2+. However, if we decide to not upgrade hadoop to beyond that
> version for Spark 3.0, then we just have to disable/fallback the
e'll need to calculate the sum of their 4-d
momenta, while samples with <2 electrons will need subtract two
different physical quantities -- several more steps before we get to
the point where we'll histogram the different subsamples for the
outputs.
Cheers
Andrew
>
> On Mon, Feb 4, 2019 at
so it's possible we're not using it correctly.
Cheers
Andrew
> rb
>
> On Mon, Feb 4, 2019 at 8:33 AM Andrew Melo wrote:
>>
>> Hello
>>
>> On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote:
>> >
>> > I've seen many application need to split data
Hello
On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote:
>
> I've seen many application need to split dataset to multiple datasets based
> on some conditions. As there is no method to do it in one place, developers
> use filter method multiple times. I think it can be useful to have method
just getting started).
>
> On Mon, Aug 27, 2018 at 12:18 PM Andrew Melo wrote:
>>
>> Hi Holden,
>>
>> I'm agnostic to the approach (though it seems cleaner to have an
>> explicit API for it). If you would like, I can take that JIRA and
>> implement it
bly add `getActiveSession` to the PySpark
> API (filed a starter JIRA https://issues.apache.org/jira/browse/SPARK-25255
> )
>
> On Mon, Aug 27, 2018 at 12:09 PM Andrew Melo wrote:
>>
>> Hello Sean, others -
>>
>> Just to confirm, is it OK for client
, 2018 at 5:52 PM, Andrew Melo wrote:
> Hi Sean,
>
> On Tue, Aug 7, 2018 at 5:44 PM, Sean Owen wrote:
>> Ah, python. How about SparkContext._active_spark_context then?
>
> Ah yes, that looks like the right member, but I'm a bit wary about
> depending on functionality
; and subject to change. Is that something I
should be unconcerned about.
The other thought is that the accesses with SparkContext are protected
by "SparkContext._lock" -- should I also use that lock?
Thanks for your help!
Andrew
>
> On Tue, Aug 7, 2018 at 5:34 PM Andr
ion and causing a JVM to start.
Is there an easy way to call getActiveSession that doesn't start a JVM?
Cheers
Andrew
>
> On Tue, Aug 7, 2018 at 5:11 PM Andrew Melo wrote:
>>
>> Hello,
>>
>> One pain point with various Jupyter extensions [1][2] that provide
>&
Hello,
One pain point with various Jupyter extensions [1][2] that provide
visual feedback about running spark processes is the lack of a public
API to introspect the web URL. The notebook server needs to know the
URL to find information about the current SparkContext.
Simply looking for
47 matches
Mail list logo