Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
's a C++ facility to do this, but it's not exposed in Python yet. > I opened ARROW-7375 for it. > > Regards > > Antoine. > > > Le 11/12/2019 à 19:36, Weston Pace a écrit : > > I'm trying to combine multiple parquet files. They were produced at > > different

Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
I'm trying to combine multiple parquet files. They were produced at different points in time and have different columns. For example, one has columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I want to concatenate all three into one table with columns A, B, C, D, E. To do

Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Weston Pace
> null, > null, > null, > null, > null, > null, > null > ] > > > Regards > > Antoine. > > > Le 11/12/2019 à 21:08, Weston Pace a écrit : > > Thanks. Ted, I tried using numpy similar to your approach and had the > same > >

Timestamp coerced by default writing to parquet when resolution is ns (python)

2019-12-06 Thread Weston Pace
If my table has timestamp fields with ns resolution and I save the table to parquet format without specifying any timestamp args (default coerce and legacy settings) then it automatically converts my timestamp to us resolution. As best I can tell Parquet supports ns resolution so I would prefer

Re: Timestamp coerced by default writing to parquet when resolution is ns (python)

2019-12-06 Thread Weston Pace
actually sure what's required to write these. Using version='2.0' is > not safe because our implementation of Parquet V2 data pages is > incorrect (see PARQUET-458) > > So I'd recommend using the deprecated int96 flag if you need > nanoseconds right now > > On Fri, Dec 6, 2019 at

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
It sounds like you are describing two problems. 1) Idleness - Tasks are holding threads in the thread pool while they wait for IO or some long running non-CPU task to complete. These threads are often in a "wait" state or something similar. 2) Fairness - The ordering of tasks is causing short

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
My C++ is pretty rusty but I'll see if I can come up with a concrete CSV example / experiment / proof of concept on Friday when I have a break from work. On Tue, Sep 15, 2020 at 3:47 PM Wes McKinney wrote: > > On Tue, Sep 15, 2020 at 7:54 PM Weston Pace wrote: > > > > Yes.

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-15 Thread Weston Pace
an tasks do IO, resulting in suboptimal performance (the > problems caused by this will be especially exacerbated when running > against slower filesystems like Amazon S3) > > Hopefully the issues are more clear. > > Thanks > Wes > > On Tue, Sep 15, 2020 at 2:57 PM Weston Pa

Re: Multifile parquet support

2020-09-04 Thread Weston Pace
Hello Radu, If your goal is strictly "append" with common schema then maybe the terminology you are looking for is "append a parquet file to a parquet dataset" and not "append a row group to a multi-file parquet file". Parquet datasets (and arrow datasets) support having a common schema which is

Creating filesystems that read local files

2020-08-25 Thread Weston Pace
I created a RelativeFileSystem that extended FileSystem and proxied calls to a LocalFileSystem instance. This filesystem allowed me to specify a base directory and then all paths were resolved relative to that base directory (so fs.open("foo.parquet") became

Re: Creating filesystems that read local files

2020-08-25 Thread Weston Pace
Actually my workaround (extending LocalFileSystem) does not work since `open` is never called in this case and the path is not normalized to the base directory. On Tue, Aug 25, 2020 at 11:38 AM Weston Pace wrote: > > I created a RelativeFileSystem that extended FileSystem and proxied &

Questions about S3 options

2020-08-19 Thread Weston Pace
instance for S3 access elsewhere and so I'd rather reuse this if possible. -Weston Pace

Writing parquet to new filesystem API

2020-08-26 Thread Weston Pace
Forgive me if I am missing something obvious but I am unable to write parquet files using the new filesystem API. Here is what I am trying: https://gist.github.com/westonpace/0c5ef01e21a40de5d16608b7f12de80d I receive an error: OSError: Unrecognized filesystem:

Re: Creating filesystems that read local files

2020-08-26 Thread Weston Pace
: num_rows += row_group.num_rows On Wed, Aug 26, 2020 at 10:06 AM Weston Pace wrote: > > Thanks Joris / Antoine, > > It appears I will have to learn the new datasets API. I can confirm > that SubTreeFileSystem is working for me. In case there is still > interest here is the code

Re: Creating filesystems that read local files

2020-08-26 Thread Weston Pace
Based on your description, I assume you are using the "legacy" > LocalFileSystem. > In the new filesystems, however, I think there is already the feature you > are looking for, called "SubTreeFileSystem", created from a base directory > and other filesystem instance.

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-29 Thread Weston Pace
alancing I/O vs. CPU workload, balancing for fairness. It may not be obvious what exactly to aim for. On Mon, Sep 28, 2020 at 2:32 AM Antoine Pitrou wrote: > > Le 28/09/2020 à 11:38, Antoine Pitrou a écrit : > > > > Hi Weston, > > > > Le 25/09/2020 à 23:21, Westo

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-25 Thread Weston Pace
t; I don't have an intuition whether depth-first scheduling (what Julia > > > > > is doing) or breadth-first scheduling (aka "work stealing" -- which is > > > > > what Intel's TBB library does [1]) will work better for our use cases. > >

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-19 Thread Weston Pace
is that there is no way to have a Julia task that is performing blocking I/O (in the sense that a "thread pool thread" is blocked on I/O. You can have blocking I/O in the async/await sense where you are awaiting on I/O to maintain sequential semantics. On Wed, Sep 16, 2020 at 8:10 AM Weston P

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Weston Pace
ussion points was Julia's > > > > task-based > > > > multithreading model that has been part of the language for over a year > > > > now. An announcement blogpost for Julia 1.3 laid out some of the details > > > > and high-level app

Re: [c++] Futures API review & help understanding benchmark result

2020-10-27 Thread Weston Pace
; Antoine. > > > > > > Regards > > > > Antoine. > > > > > > Le 26/10/2020 à 16:48, Weston Pace a écrit : > >> Hi all, > >> > >> I've completed the initial composable futures API and iterator work. > >> The CSV read

[c++] Futures API review & help understanding benchmark result

2020-10-26 Thread Weston Pace
Hi all, I've completed the initial composable futures API and iterator work. The CSV reader portion is still WIP. First, I'm interested in getting any feedback on the futures API. In particular Future.Then in future.h (and the type erased Composable.Compose). The actual implementation can

Re: Upcoming JS fixes and release timeline

2020-07-10 Thread Weston Pace
Just to be more specific. Since most JavaScript packages follow semantic versioning that means that a change from 1.0.0 to 2.0.0 would imply that there were breaking changes in the API (i.e. not backwards compatible). By default, when declaring a dependency on a package that has a 1.X release,

Snappy illegal instructions errors & pypi build

2020-07-01 Thread Weston Pace
I have a customer that has encountered what I believe to be https://issues.apache.org/jira/browse/ARROW-9114 They are running Windows. They receive an illegal instruction exception on pyarrow.parquet.read_table. Their processor (i5-3470) does not support BMI2. The customer is using the pypi

Re: Snappy illegal instructions errors & pypi build

2020-07-01 Thread Weston Pace
s/win-build.bat#L25 > > On Wed, Jul 1, 2020 at 4:47 PM Weston Pace wrote: > > > > I have a customer that has encountered what I believe to be > > https://issues.apache.org/jira/browse/ARROW-9114 > > > > They are running Windows. They receive an illegal instru

Re: Pandas Block Manager

2020-11-11 Thread Weston Pace
Nick, it appears converting the ndarray to a dataframe clears the contiguous flag even though it doesn't actually change the underlying array. At least, this is what I'm seeing with my testing. My guess is this is what is causing arrow to do a copy (arrow is indeed doing a new allocation here,

Please Review: Application for a Media Type

2021-01-20 Thread Weston Pace
ree to suggest changes. https://docs.google.com/document/d/1PmZFoSifV_TX4vXnv775WiOtqCgz5zLF5ryFRWio3HQ/edit?usp=sharing One we align on the content we should probably have a PMC member actually make the submission and be listed as contact person. Thanks, Weston Pace Ursa Computing

Threading Improvements Proposal

2021-02-02 Thread Weston Pace
e recently joined Ursa Computing which will allow me more time to work on Arrow. Thanks, Weston Pace [1] https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing [2] https://github.com/apache/arrow/pull/9095 [3] https://mail-archives.apache.org/mod_mbox

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Weston Pace
> So it's wrong to put "timezone=UTC", because in Arrow, the 'timezone" field > means, "how the data is *displayed*." The data isn't displayed as UTC. I don't think users will generally be using Arrow to format timestamps for display to the user. However, if it is, the correct thing to do here

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-14 Thread Weston Pace
., 11 Jun. 2021, 00:45 Wes McKinney, wrote: > > > From this, it seems like seeding the RecordBatchStreamWriter's output > > stream with a much larger preallocated buffer would improve > > performance (depends on the allocator used of course). > > > > On Thu, Jun 10, 2021

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Weston Pace
1 at 5:34 PM Weston Pace wrote: > > I'm in no rush, so feel free to respond when you have time. > > > If the timezone field doesn't say how to display data to the user, and we > > agree it doesn't describe how data is stored (since its very presence means > > data

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Weston Pace
sounds to me like it is being proposed to eliminate the first of > > these two data types. I understand the principles that might motivate > > that, but I don't think that is something we can do at this time lest > > we lose the ability to have high-fidelity interoperability with

Re: [C++] Adopting a library for (distributed) tracing

2021-06-08 Thread Weston Pace
FWIW, I tried this out yesterday since I was profiling the execution of the async API reader. It worked great so +1 from me on that basis. I did struggle finding a good simple visualization tool. Do you have any good recommendations on that front? On Mon, Jun 7, 2021 at 10:50 AM David Li

Re: Complex Number support in Arrow

2021-06-10 Thread Weston Pace
> While dedicated types are not strictly required, compute functions would > be much easier to add for a first-class dedicated complex datatype > rather than for an extension type. @pitrou This is perhaps a naive question (and admittedly, I'm not up to speed on my compute kernels) but why is this

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Weston Pace
Just for some reference times from my system I created a quick test to dump a ~1.7GB table to buffer(s). Going to many buffers (just collecting the buffers): ~11,000ns Going to one preallocated buffer: ~160,000,000ns Going to one dynamically allocated buffer (using a grow factor of 2x):

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Weston Pace
use cases would be disenfranchised by requiring UTC normalization > always. > > On Tue, Jun 15, 2021 at 3:16 PM Adam Hooper wrote: > > > > On Tue, Jun 15, 2021 at 1:19 PM Weston Pace wrote: > > > > > Arrow's "Timestamp with Timezone" can have fields ext

Re: Question on releasing record batch

2021-06-17 Thread Weston Pace
The only owner of input_batch that I can see here is the shared_ptr that you are resetting so I would expect the memory to be freed. How are you measuring memory usage? The dynamic allocators (mimalloc / jemalloc) don't always release memory as soon as they possibly can. Even malloc will

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-22 Thread Weston Pace
at Arrow *doesn't* do something extremely useful. It's voting for a > >> negative. That sounds painful! What if there were positives to vote for? An > >> "INSTANT" type? A new TIMESTAMP metadata field, "instant" (on by default)? > >> A fiat that timezone

Re: [ANNOUNCE] New Arrow PMC member: David M Li

2021-06-21 Thread Weston Pace
Congratulations David! On Mon, Jun 21, 2021 at 2:24 PM Niranda Perera wrote: > > Congrats David! :-) > > On Mon, Jun 21, 2021 at 6:32 PM Nate Bauernfeind > wrote: > > > Congratulations! Well earned! > > > > On Mon, Jun 21, 2021 at 4:20 PM Ian Cook wrote: > > > > > Congratulations, David! > > >

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-21 Thread Weston Pace
I agree that a vote would be a good idea. Do you want to start a dedicated vote thread? I can write one up too if you'd rather. -Weston On Mon, Jun 21, 2021 at 4:54 PM Micah Kornfield wrote: > > I think comments on the doc are tailing off. Jorge's test cases I think > still need some more

[STRAW POLL] (How) should Arrow define storage for "Instant"s

2021-06-24 Thread Weston Pace
The discussion in [1] led to the following question. Before we proceed on a vote it was decided we should do a straw poll to settle on an approach (which can then be voted on in a +1/-1 fashion). --- Some date & time libraries have three temporal concepts. For the sake of this document we will

[VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-24 Thread Weston Pace
The discussion in [1] led to the following proposal which I would like to submit for a vote. --- Arrow allows a timestamp column to omit the time zone property. This has caused confusion because some people have interpreted a timestamp without a time zone to be an Instant while others have

Re: [STRAW POLL] (How) should Arrow define storage for "Instant"s

2021-06-24 Thread Weston Pace
/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit?usp=sharing On Thu, Jun 24, 2021 at 9:24 AM Weston Pace wrote: > > The discussion in [1] led to the following question. Before we > proceed on a vote it was decided we should do a straw poll to settle > on an approach (which can then be voted on in a +

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Weston Pace
Thanks for the excellent summary everyone. I agree with these summaries that have been pointed out. It seems like things are moving towards consensus. > I think Instant is what is represented as Arrow's Timestamp with Timezone. > I don't think Arrow has a type for DateTime because we don't have

Re: [VOTE] Register media types (MIME types) for Apache Arrow formats to IANA

2021-05-10 Thread Weston Pace
> > > > On Wed, May 5, 2021 at 9:25 AM Kazuaki Ishizaki > > wrote: > > > > > > +1, great > > > > > > Weston Pace wrote on 2021/05/04 20:41:34: > > > > > > > From: Weston Pace > > > > To: dev@arrow.a

Re: [DISCUSS/QUESTION][C++] Persisting "field id" (or other metadata) through transformation?

2021-05-18 Thread Weston Pace
:21 PM Antoine Pitrou wrote: > > > > > > Le 12/05/2021 à 21:19, Weston Pace a écrit : > > > The parquet format has a "field id" concept (unique integer identifier > > > for a column) that gets promoted in the C++ implementation to a > > > key

Re: Language silos and transpilers

2021-05-18 Thread Weston Pace
So, checking my understanding, let's imagine a hypothetical scenario. * There is a data scientist that is well versed in pandas * There is a project team working in kotlin * The project team wants to use the data scientists' code in their project. # Transpilation The transpilation approach

Re: [ANNOUNCE] New Arrow PMC member: Benjamin Kietzman

2021-05-05 Thread Weston Pace
Congratulations Ben! On Wed, May 5, 2021 at 6:48 PM Micah Kornfield wrote: > Congrats! > > On Wed, May 5, 2021 at 4:33 PM David Li wrote: > > > Congrats Ben! Well deserved. > > > > Best, > > David > > > > On Wed, May 5, 2021, at 19:22, Neal Richardson wrote: > > > Congrats Ben! > > > > > >

Re: String reverse kernel

2021-05-17 Thread Weston Pace
FWIW, combining marks were not actually added to support emojis. Emojis are just one of the more popular uses of the feature. Combining marks is a standard Unicode feature necessary to represent single “characters” in some complex situations (e.g. when it is necessary to distinguish between

Re: Long title on github page

2021-05-17 Thread Weston Pace
> “Apache Arrow is a format and compute kernel for in-memory data” I like this but no one ever knows what "in-memory" means (or they just think 'data is always in memory'). How about... "Apache Arrow is a format and compute kernel for zero-copy processing and sharing of data." or... "Apache

Re: Long title on github page

2021-05-17 Thread Weston Pace
> > > On Mon, May 17, 2021 at 3:06 PM Wes McKinney wrote: > > > > > On Mon, May 17, 2021 at 4:58 PM Weston Pace > > wrote: > > > > > > > > > “Apache Arrow is a format and compute kernel for in-memory data” > > > > > > >

[DISCUSS/QUESTION][C++] Persisting "field id" (or other metadata) through transformation?

2021-05-12 Thread Weston Pace
The parquet format has a "field id" concept (unique integer identifier for a column) that gets promoted in the C++ implementation to a key/value pair in the field's metadata. This has led me to a few questions around how this field (or metadata in general) interacts with higher level APIs. 1)

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-20 Thread Weston Pace
I like Yibo's stack overflow theory given the "error reading variable" but I did confirm that I can cause a segmentation fault if std::atomic_store / std::atomic_load are unavailable. I simulated this by simply commenting out the specializations rather than actually run against GCC 4.9.2 so it

[Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-18 Thread Weston Pace
I spoke a while ago about working on a multithreaded stress test suite. I have put together some very early details[1]. I would appreciate any feedback. The goal would be to stress test the C++ dataset API (and soon C++ execution plans and perhaps someday a language independent logical plan /

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Weston Pace
thub.com/Crunch-io/diagnose#breakpoints ) > > On Wed, May 19, 2021 at 9:01 AM Antoine Pitrou wrote: > > > > > Le 19/05/2021 à 07:37, Weston Pace a écrit : > > > I spoke a while ago about working on a multithreaded stress test > > > suite. I have put together some very

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Weston Pace
What compiler / glibc version are you using? arrow::SimpleRecordBatch::column does some non-trivial caching which uses std::atomic_load[1] which is not implemented properly on gcc < 5 so our behavior is different depending on the compiler version. [1]

Re: [C++][DISCUSS] Implementing interpreted (non-compiled) tests for compute functions

2021-05-14 Thread Weston Pace
With that in mind it seems the somewhat recurring discussion on coming up with a language independent standard for logical query plans (https://lists.apache.org/thread.html/rfab15e09c97a8fb961d6c5db8b2093824c58d11a51981a40f40cc2c0%40%3Cdev.arrow.apache.org%3E) would be relevant. Each test case

[C++] Deciding between "compute function" and "utility function"

2021-05-11 Thread Weston Pace
How does one decide between "utility function" and "compute function"? For example, https://issues.apache.org/jira/browse/ARROW-12739 is very similar to StructArray::Make which is implemented as a static function. However, 12739 would require pool allocation (to concatenate the list items into

Re: [Format] Timestamp timezone semantics?

2021-06-04 Thread Weston Pace
> We are recommending that the behavior of > these functions should consistently have the UTC interpretation of the > value rather than using the system locale. This is what Python does > with "tz-naive" datetime.datetime objects This is not quite true, although perhaps my reading is incorrect.

[C++] [DISCUSS] Moving towards a consistent enum naming scheme

2021-06-04 Thread Weston Pace
The C++ code base currently has a mix of ALL_CAPS (e.g. arrow::ValueDescr::Shape, seems to be favored in arrow::compute::), CapWords (e.g. arrow::StatusCode), and kCapWords (e.g. arrow::DecimalStatus, not common in arrow:: but used in gandiva:: and technically what the Google style guide

Re: Improving PR workload management for Arrow maintainers

2021-07-07 Thread Weston Pace
I investigated the cpython approach and the PR labelling is a part of the existing bedevere bot which does a number of things (not all relevant to Arrow). Yesterday I created a standalone Github action[1] dedicated to this task roughly based on my previous email. It will apply "awaiting-review"

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Weston Pace
I don't know about removal but you could probably ignore the timezone string and it's not clear the issues would be that significant. If Rust never produces a non-null non-UTC timestamp then I don't see that as an issue. If you are consuming data with a timestamp string other than UTC it isn't

Re: [Rust] Eliminate Timezone field from Timestamp types?

2021-07-07 Thread Weston Pace
ity would be the biggest issue; how much does C++ do with the > timezone string? > > -Evan > > > On Jul 7, 2021, at 1:33 PM, Weston Pace wrote: > > > > I don't know about removal but you could probably ignore the timezone > > string and it's not clear the issues

Re: Moving "Improvements" and "New Features" to 6.0.0 release

2021-07-02 Thread Weston Pace
Can you leave the ones marked “in progress” or that have the pull-request-available label? On Thu, Jul 1, 2021 at 11:06 PM Alessandro Molina < alessan...@ursacomputing.com> wrote: > Hi everybody, > > Given that the expected time for release 5.0.0 is approaching and there are > 160+ Jira issues

Re: Improving PR workload management for Arrow maintainers

2021-06-29 Thread Weston Pace
I apologize. I did plan on working on this but it's taken a back seat for a while. I would still recommend shying away from a standalone UI. You will end up making a lot of requests (and possibly running into Github throttles) if you want detailed PR information for all of the PRs. To work

Re: [STRAW POLL] (How) should Arrow define storage for "Instant"s

2021-06-30 Thread Weston Pace
Bryan Cutler wrote: > > C first choice, E second > > On Mon, Jun 28, 2021, 8:40 AM Julian Hyde wrote: > > > D > > > > (2nd choice E if we’re doing ranked-choice voting) > > > > Julian > > > > > On Jun 24, 2021, at 12:24 PM, Weston Pace wr

[VOTE] Arrow should state a convention for encoding instants as Timestamp with "UTC" as the time zone

2021-06-30 Thread Weston Pace
This vote is a result of previous discussion[1][2]. This vote is also a prerequisite for the PR in [5]. --- Some date & time libraries have three temporal concepts. For the sake of this document we will call them LocalDateTime, ZonedDateTime, and Instant. An Instant is a timestamp that has no

Re: [VOTE] Clarify meaning of timestamp without time zone to equal the concept of "LocalDateTime"

2021-06-30 Thread Weston Pace
] https://github.com/apache/arrow/pull/10629 On Fri, Jun 25, 2021 at 8:25 AM Jorge Cardoso Leitão wrote: > > +1 > > On Fri, Jun 25, 2021 at 7:47 PM Julian Hyde wrote: > > > +1 > > > > > On Jun 25, 2021, at 10:36 AM, Antoine Pitrou wrote: > > > > >

Re: [ANNOUNCE] New Arrow committer: Weston Pace

2021-07-12 Thread Weston Pace
Thank you everyone. I'm really enjoying working on such a great project. On Fri, Jul 9, 2021 at 4:01 PM Neal Richardson wrote: > > Congrats Weston! > > On Fri, Jul 9, 2021 at 11:53 AM Micah Kornfield > wrote: > > > Congrats! > > > > On Fri, Jul 9, 2021 at 7:56 AM Benjamin Kietzman > > wrote:

[DISCUSS] Should we start marking "feather" as deprecated?

2021-07-12 Thread Weston Pace
Feather V2 is currently synonymous with the IPC format. My impression is that the feather terminology is now being deprecated in favor of IPC. Do we want to start marking feather modules as deprecated (both in code and the documentation) and more explicitly point users to the newer

Re: [VOTE] Arrow should state a convention for encoding instants as Timestamp with "UTC" as the time zone

2021-07-05 Thread Weston Pace
> Kazuaki Ishizaki > > "Weston Pace" wrote on 2021/06/30 18:52:46: > > > From: "Weston Pace" > > To: dev@arrow.apache.org > > Date: 2021/06/30 18:53 > > Subject: [EXTERNAL] [VOTE] Arrow should state a convention for > > encoding instants a

Re: [Python] Custom Metadata in PyArrow

2021-04-23 Thread Weston Pace
I have used the custom metadata feature in the past. I used it to track (for example) which variables were independent variables and which were dependent variables. This was used as input for later tools to help present the data. > Is that how most people handle metadata they create the schema

Independent releases and format version

2021-04-28 Thread Weston Pace
We now have independent releases. There has been some discussion (not sure if it was formalized) around aligning major release versions across the languages. There is also a potential format change coming up (new interval type). I think this brings up a few questions... Can an arrow library

Re: Please Review: Application for a Media Type

2021-04-28 Thread Weston Pace
;> Thanks for updating the draft. > >> > >> I want to wait for at least a weak before we start a vote. > >> Does anyone have an opinion about file extension of Apache > >> Arrow format data? What do you think about ".arrow"? > >> > >> &

Re: Independent releases and format version

2021-04-29 Thread Weston Pace
o align well with the > > idea that Rust X.Y.Z is bundled as part of the arrow release. > > > > This does not address backward incompatible changes of the format, which is > > a whole different beast (e.g. do we require all implementations to change > > prior to releasing the

Re: [VOTE] Register media types (MIME types) for Apache Arrow formats to IANA

2021-05-04 Thread Weston Pace
May 4, 2021 at 9:56 AM Jorge Cardoso Leitão < > jorgecarlei...@gmail.com> wrote: > > > +1 > > > > Also, great process, Weston. > > > > Best, > > Jorge > > > > > > > > On Tue, May 4, 2021 at 6:48 PM Antoine Pitrou wrote: > >

[VOTE] Register media types (MIME types) for Apache Arrow formats to IANA

2021-05-04 Thread Weston Pace
Per ARROW-7396 I would like to propose an application to the IANA to register media types for the Arrow IPC formats (both file and streaming). The proposed application is available as [1]. It is based on previous discussion in a draft [2] as well as two ML threads [3][4]. For reference, the

Re: [C++] Dataset API simplification

2021-03-26 Thread Weston Pace
I'll also note that there could be other Fragments which may naturally have > > intra-fragment parallelism, if the concern is mostly that ParquetScanTask > > is a bit of an outlier. For instance, a hypothetical FlightFragment > > wrapping a FlightInfo struct could generate multiple

[C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Weston Pace
I have been working the last few months on ARROW-7001 [0] which enables nested parallelism by converting the dataset scanning to asynchronous (previously announced here[1] and discussed here[2]). In addition to enabling nested parallelism this also allows for parallel readahead which gives

[C++] Dataset API simplification

2021-03-25 Thread Weston Pace
This is a bit of a follow-up on https://issues.apache.org/jira/browse/ARROW-11782 and also a bit of a consequence of my work on https://issues.apache.org/jira/browse/ARROW-7001 (nested scan parallelism). I think the current dataset interface should be simplified. Currently, we have Dataset ->*

Re: [Proposal] Improving PR tracking in Github with PR labels

2021-03-12 Thread Weston Pace
point > > > that scrolling through > 100 PRs in GitHub is not a good use of > > > reviewer time, so creating some kind of "semantic layer" on top of the > > > PR review queue (like what the Spark folks did) would help a great > > > deal. So rathe

[C++] struct/class conventions

2021-03-13 Thread Weston Pace
Hi, this might be a bit of a pedantic email but I'm going through and cleaning up my code on some of my threading work and wondered about the style guidelines around struct/class. Technically, the Google style guide states... --- structs should be used for passive objects that carry data, and

Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

2021-02-26 Thread Weston Pace
I used Arrow for this purpose in the past. I don't have much to add but just a few thoughts off the top of my head... * The line between data and metadata can be blurry - For most measurements we were able to store the "expected distribution" as metadata (e.g. this measurement should have an

Re: Requirements on JIRA usage in Apache Arrow

2021-03-02 Thread Weston Pace
It also seems like we're describing two different issues. The first, a barrier to entry for new development. The second, overhead imposed on an active developer. I'm personally not so worried about the overhead imposed, perhaps because I can't write code that fast anyways, so I'll stay out of

[Proposal] Improving PR tracking in Github with PR labels

2021-03-06 Thread Weston Pace
some kind of dashboard showing which PRs need review and which have been waiting for review the longest. Even without that, it would serve to make it clear to the submitter and the reviewers where the action is. -Weston Pace

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Weston Pace
t; concepts we can refer back to, especially if we further expand the > > > utilities. While not many of us may be familiar with rxcpp already, at > > > least we'd have a reference for how our utilities are supposed to work. > > > > > > Using the framework fo

Re: CI feedback time

2021-04-14 Thread Weston Pace
It may be worth reaching out to the Airflow project. Based on https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status it seems they have been investing time into figuring how to make self-hosted runners work (it seems Github's patching model makes this somewhat difficult). On

Re: [DISCUSS] [Rust] Move Rust components to new repos and process

2021-04-09 Thread Weston Pace
> I'm assuming the idea is that the existing integration tests will remain in > apache/arrow. Will you also run the integration test suites on your rust > repository CI checks? Furthermore, against what version will these tests run? * If Arrow runs against the latest release of Rust then it

Re: 4.0 release preparation

2021-04-10 Thread Weston Pace
Nightly build triage (based on nightly builds from 4/9): Failed Tasks: - conda-linux-gcc-py36-aarch64: ARROW-12324 (conda builds timing out, conda slow) - conda-linux-gcc-py37-aarch64: ARROW-12324 (conda builds timing out, conda slow) - conda-osx-clang-py37-r40: Appears to have been an

Re: [JS] Exploring usage of apache arrow at my company for complex table rendering

2021-04-18 Thread Weston Pace
t>>>>>>> >> >> but basically the idea would be if you were to retrieve the data for a given >> index of let’s say a state it would return all the cities and vectors of >> data related to that given state. >> >> I also don’t know also if thi

Re: [VOTE] Release Apache Arrow 4.0.0 - RC1

2021-04-20 Thread Weston Pace
I'm not sure if it is blocking (and it might even be expected given the current status of jfrog) but I attempted to install the CentOS 7 RPM and got the following error when I ran `sudo yum update` after installing the arrow repo rpm.

Re: [VOTE] Release Apache Arrow 4.0.0 - RC1

2021-04-20 Thread Weston Pace
gt; > It seems that you use old verification script. Could you > confirm that you use the verification script on master? > > Thanks, > -- > kou > > In > "Re: [VOTE] Release Apache Arrow 4.0.0 - RC1" on Tue, 20 Apr 2021 11:23:25 > -1000, > Weston Pace wrot

Re: nullptr for mutable data in pyarrow table from pandas

2021-04-20 Thread Weston Pace
If it comes from pandas (and is eligible for zero-copy) then the buffer implementation will be `NumPyBuffer`. Printing one in GDB yields... ``` $12 = {_vptr.Buffer = 0x7f0b66e147f8 , is_mutable_ = true, is_cpu_ = true, data_ = 0x55b71f901a70 "\001", mutable_data_ = 0x0, size_ = 16, capacity_ =

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-08 Thread Weston Pace
t great hardship) to have a > fallback to the non-async version (so there's a workaround if there > end up being show-stopping bugs) then that's even better. > > On Wed, Apr 7, 2021 at 1:24 PM Weston Pace wrote: > > > > 1) Most of the committed changes have been of

Re: Threading Improvements Proposal

2021-02-16 Thread Weston Pace
much code as possible to use the > > asynchronous model — per above, if there is a mechanism for async task > > producers to coexist alongside with code that manually manages the > > execution order of tasks generated by its task graph (thinking of > > query engine code

Re: Please Review: Application for a Media Type

2021-04-21 Thread Weston Pace
fer is used) > > Our Julia tests use the following extensions: > > * vnd.apache.arrow.file: Not used (in-memory buffer is used) > * vnd.apache.arrow.stream: Not used (in-memory buffer is used) > > Our Rust tests use the following extensions: > > * vnd.apache

Re: [VOTE] Release Apache Arrow 4.0.0 - RC3

2021-04-21 Thread Weston Pace
I'm getting a failure during the download files check... Traceback (most recent call last): File "/home/centos/arrow/dev/release/download_rc_binaries.py", line 172, in download_rc_binaries(args.version, args.rc_number, dest=args.dest, File

Re: [VOTE][Format] Clarify allowed value range for the Time types

2021-08-19 Thread Weston Pace
+1 On Thu, Aug 19, 2021 at 9:18 AM Wes McKinney wrote: > > +1 > > On Thu, Aug 19, 2021 at 6:20 PM Antoine Pitrou wrote: > > > > > > Hello, > > > > I would like to propose clarifying the allowed value range for the Time > > types. Specifically, I would propose that: > > > > 1) allowed values

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

2021-08-30 Thread Weston Pace
My (incredibly naive) interpretation is that there are three problems to tackle. 1) How do you represent a graph and relational operators (join, union, groupby, etc.) - The PR appears to be addressing this question fairly well 2) How does a frontend query a backend to know what UDFs are

Re: [ANNOUNCE] New Arrow committer: Matt Topol

2021-08-30 Thread Weston Pace
Congratulations Matt! On Mon, Aug 30, 2021 at 5:36 PM Micah Kornfield wrote: > > On behalf of the Apache Arrow PMC, I'm happy to announce that Matt Topol > has accepted an invitation to become a committer on Apache Arrow. > > Welcome and thank you for your contributions.

Re: [DISCUSS] Splitting out the Arrow format directory

2021-08-13 Thread Weston Pace
I believe you would need a JSON compatible version of the type system (including binary values) because you'd need to at least encode literals. However, I don't think that creating a human readable encoding of the Arrow type system is a bad thing in and of itself. We have tickets and get

  1   2   3   4   5   >