[jira] [Created] (ARROW-12681) [Python] Expose IpcReadOptions to ipc facility

2021-05-07 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-12681:
--

 Summary: [Python] Expose IpcReadOptions to ipc facility
 Key: ARROW-12681
 URL: https://issues.apache.org/jira/browse/ARROW-12681
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Francois Saint-Jacques


I would like to be able to read only a subset of columns from a given IPC file. 
To do this, we need to expose the EXPERIMENTAL (is it still?) 
IpcReaderOptions.include_fields option.

I do not know the best way to "pythonize" IpcReaderOptions and would need help 
on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11794) [Go] Add concurrent-safe ipc.FileReader.RecordAt(i)

2021-02-26 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-11794:
--

 Summary: [Go] Add concurrent-safe ipc.FileReader.RecordAt(i)
 Key: ARROW-11794
 URL: https://issues.apache.org/jira/browse/ARROW-11794
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques


Arrow IPC files are safe to load concurrently. The implementation of 
`ipc.FileReader.Record(i)` is not safe due to stashing the current record 
internally. This adds a backward-compatible function `RecordAt` that behaves 
like ReadAt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8981) [C++][Dataset] Add support for compressed FileSources

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8981:
--
Fix Version/s: (was: 1.0.0)

> [C++][Dataset] Add support for compressed FileSources
> -
>
> Key: ARROW-8981
> URL: https://issues.apache.org/jira/browse/ARROW-8981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> FileSource::compression_ is currently ignored. Ideally files/buffers which 
> are compressed could be decompressed on read. See ARROW-8942



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8163) [C++][Dataset] Allow FileSystemDataset's file list to be lazy

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8163:
--
Fix Version/s: (was: 1.0.0)

> [C++][Dataset] Allow FileSystemDataset's file list to be lazy
> -
>
> Key: ARROW-8163
> URL: https://issues.apache.org/jira/browse/ARROW-8163
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> A FileSystemDataset currently requires a full listing of files it contains on 
> construction, so a scan cannot start until all files in the dataset are 
> discovered. Instead it would be ideal if a large dataset could be constructed 
> with a lazy file listing so that scans can start immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7945) [C++][Dataset] Implement InMemoryDatasetFactory

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7945:
--
Fix Version/s: (was: 1.0.0)

> [C++][Dataset] Implement InMemoryDatasetFactory
> ---
>
> Key: ARROW-7945
> URL: https://issues.apache.org/jira/browse/ARROW-7945
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> This will allow in memory datasets (such as tables) to participate in 
> discovery through {{UnionDatasetFactory}}. This class will be trivial since 
> Inspect will do nothing but return the table's schema, but is necessary to 
> ensure that the resulting {{UnionDataset}}'s unified schema accommodates the 
> table's schema (for example including fields present only in the table's 
> schema or emitting an error when unification is not possible)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8658) [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8658:
--
Fix Version/s: (was: 1.0.0)

> [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments
> 
>
> Key: ARROW-8658
> URL: https://issues.apache.org/jira/browse/ARROW-8658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> This is a very handy optimization for large datasets with multiple partition 
> fields. For example, given a hive-style directory {{$base_dir/a=3/}} and a 
> filter {{"a"_ == 2}} none of its files or subdirectories need be examined.
> After ARROW-8318 FileSystemDataset stores only files so subtree pruning 
> (whose implementation depended on the presence of directories to represent 
> subtrees) was disabled. It should be possible to reintroduce this without 
> reference to directories by examining partition expressions directly and 
> extracting a tree structure from their subexpressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8201) [Python][Dataset] Improve ergonomics of FileFragment

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8201:
--
Fix Version/s: (was: 1.0.0)

> [Python][Dataset] Improve ergonomics of FileFragment
> 
>
> Key: ARROW-8201
> URL: https://issues.apache.org/jira/browse/ARROW-8201
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> FileFragment can be made more directly useful by adding convenience methods.
> For example, a FileFragment could allow underlying file/buffer to be opened 
> directly:
> {code}
> def open(self):
> """
> Open a NativeFile of the buffer or file viewed by this fragment.
> """
> cdef:
> CFileSystem* c_filesystem
> shared_ptr[CRandomAccessFile] opened
> NativeFile out = NativeFile()
> buf = self.buffer
> if buf is not None:
> return pa.io.BufferReader(buf)
> with nogil:
> c_filesystem = self.file_fragment.source().filesystem()
> opened = GetResultValue(c_filesystem.OpenInputFile(
> self.file_fragment.source().path()))
> out.set_random_access_file(opened)
> out.is_readable = True
> return out
> {code}
> Additionally, a ParquetFileFragment's metadata could be introspectable:
> {code}
> @property
> def metadata(self):
> from pyarrow._parquet import ParquetReader
> reader = ParquetReader()
> reader.open(self.open())
> return reader.metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8137) [C++][Dataset] Investigate multithreaded discovery

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8137:
--
Fix Version/s: (was: 1.0.0)

> [C++][Dataset] Investigate multithreaded discovery
> --
>
> Key: ARROW-8137
> URL: https://issues.apache.org/jira/browse/ARROW-8137
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> Currently FileSystemDatasetFactory Inpsects all files serially. For slow file 
> systems or systems which support batched reads, this could be accelerated by 
> inspecting files in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7617:
--
Fix Version/s: (was: 1.0.0)

> [Python] parquet.write_to_dataset creates empty partitions for non-observed 
> dictionary items (categories)
> -
>
> Key: ARROW-7617
> URL: https://issues.apache.org/jira/browse/ARROW-7617
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Vladimir
>Priority: Major
>  Labels: dataset, parquet
>
> Hello,
> it looks like, views with selection along categorical column are not properly 
> respected.
> For the following dummy dataframe:
>  
> {code:java}
> d = pd.date_range('1990-01-01', freq='D', periods=1)
> vals = pd.np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> {code}
> The slice by Year is saved to partitioned parquet properly:
> {code:java}
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_a.parquet', 
> partition_cols=['Year']){code}
> However, if we convert Year to pandas.Categorical - it will save the whole 
> original dataframe, not only slice of Year=1990:
> {code:java}
> x['Year'] = x['Year'].astype('category')
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_b.parquet', 
> partition_cols=['Year'])
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2020-06-12 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-2628:
--
Fix Version/s: (was: 1.0.0)

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7798) [R] Refactor R <-> Array conversion

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7798:
-

Assignee: (was: Francois Saint-Jacques)

> [R] Refactor R <-> Array conversion
> ---
>
> Key: ARROW-7798
> URL: https://issues.apache.org/jira/browse/ARROW-7798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> There's a bit of technical debt accumulated in array_to_vector and 
> vector_to_array:
> * Mix of conversion *and* casting, ideally we'd move casting out of there (at 
> the cost of more memory copy). The rationale is that the conversion logic 
> will differ from the CastKernels, e.g. when to raise errors, benefits from 
> complex conversions like timezone... The current implementation is fast, e.g. 
> it fuses the conversion and casting in a single loop at the cost of code 
> clarity and divergence.
> * There should be 2 paths, zero-copy, non zero-copy. The non-zero copy should 
> use the newly introduced VectorToArrayConverter which will work with complex 
> nested types.
> *  The in array_to vector, Converter should work primarily with Array and not 
> ArrayVector



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5761) [R] Improve autosplice cpp code

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-5761:
-

Assignee: (was: Francois Saint-Jacques)

> [R] Improve autosplice cpp code
> ---
>
> Key: ARROW-5761
> URL: https://issues.apache.org/jira/browse/ARROW-5761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Minor
>
> Followup to ARROW-5178. See discussion on 
> [https://github.com/apache/arrow/pull/4704]. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9065) [C++] Support parsing date32 in dataset partition folders

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9065:
--
Summary: [C++] Support parsing date32 in dataset partition folders  (was: 
[Python] Support parsing date32 in dataset partition folders)

> [C++] Support parsing date32 in dataset partition folders
> -
>
> Key: ARROW-9065
> URL: https://issues.apache.org/jira/browse/ARROW-9065
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Dave Hirschfeld
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: dataset
>
> I have some data which is partitioned by year/month/date. It would be useful 
> if the date could be automatically parsed:
> {code:python}
> In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.date32())])
> In [18]: partition = DirectoryPartitioning(schema)
> In [19]: partition.parse("/2020/06/2020-06-08")
> ---
> ArrowNotImplementedError Traceback (most recent call last)
>  in 
> > 1 partition.parse("/2020/06/2020-06-08")
> ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
> pyarrow._dataset.Partitioning.parse()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: parsing scalars of type date32[day]
> {code}
> Not a big issue since you can just use string and convert, but nevertheless 
> it would be nice if it Just Worked
> {code}
> In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.string())])
> In [23]: partition = DirectoryPartitioning(schema)
> In [24]: partition.parse("/2020/06/2020-06-08")
> Out[24]:  6:int8)) and (day == 2020-06-08:string))>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9108:
--
Fix Version/s: 1.0.0

> [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
> --
>
> Key: ARROW-9108
> URL: https://issues.apache.org/jira/browse/ARROW-9108
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-9108:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
> --
>
> Key: ARROW-9108
> URL: https://issues.apache.org/jira/browse/ARROW-9108
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-9107:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Time-based types support
> ---
>
> Key: ARROW-9107
> URL: https://issues.apache.org/jira/browse/ARROW-9107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Blocker
>  Labels: dataset
> Fix For: 1.0.0
>
>
> We lack the support of date/timestamp partitions, and predicate pushdown 
> rules. Timestamp columns are usually the most important predicate in OLAP 
> style queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9108:
--
Component/s: C++

> [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
> --
>
> Key: ARROW-9108
> URL: https://issues.apache.org/jira/browse/ARROW-9108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9108:
--
Priority: Blocker  (was: Major)

> [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
> --
>
> Key: ARROW-9108
> URL: https://issues.apache.org/jira/browse/ARROW-9108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Blocker
>  Labels: dataset
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9108:
--
Labels: dataset  (was: )

> [C++][Dataset] Add Parquet Statistics conversion for timestamp columns
> --
>
> Key: ARROW-9108
> URL: https://issues.apache.org/jira/browse/ARROW-9108
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9108) [C++][Dataset] Add Parquet Statistics conversion for timestamp columns

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9108:
-

 Summary: [C++][Dataset] Add Parquet Statistics conversion for 
timestamp columns
 Key: ARROW-9108
 URL: https://issues.apache.org/jira/browse/ARROW-9108
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9107:
--
Fix Version/s: 1.0.0

> [C++][Dataset] Time-based types support
> ---
>
> Key: ARROW-9107
> URL: https://issues.apache.org/jira/browse/ARROW-9107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> We lack the support of date/timestamp partitions, and predicate pushdown 
> rules. Timestamp columns are usually the most important predicate in OLAP 
> style queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9107:
--
Priority: Blocker  (was: Major)

> [C++][Dataset] Time-based types support
> ---
>
> Key: ARROW-9107
> URL: https://issues.apache.org/jira/browse/ARROW-9107
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Blocker
>  Labels: dataset
> Fix For: 1.0.0
>
>
> We lack the support of date/timestamp partitions, and predicate pushdown 
> rules. Timestamp columns are usually the most important predicate in OLAP 
> style queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9065) [Python] Support parsing date32 in dataset partition folders

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9065:
--
Parent: ARROW-9107
Issue Type: Sub-task  (was: Improvement)

> [Python] Support parsing date32 in dataset partition folders
> 
>
> Key: ARROW-9065
> URL: https://issues.apache.org/jira/browse/ARROW-9065
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Dave Hirschfeld
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: dataset
>
> I have some data which is partitioned by year/month/date. It would be useful 
> if the date could be automatically parsed:
> {code:python}
> In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.date32())])
> In [18]: partition = DirectoryPartitioning(schema)
> In [19]: partition.parse("/2020/06/2020-06-08")
> ---
> ArrowNotImplementedError Traceback (most recent call last)
>  in 
> > 1 partition.parse("/2020/06/2020-06-08")
> ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
> pyarrow._dataset.Partitioning.parse()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: parsing scalars of type date32[day]
> {code}
> Not a big issue since you can just use string and convert, but nevertheless 
> it would be nice if it Just Worked
> {code}
> In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.string())])
> In [23]: partition = DirectoryPartitioning(schema)
> In [24]: partition.parse("/2020/06/2020-06-08")
> Out[24]:  6:int8)) and (day == 2020-06-08:string))>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9107) [C++][Dataset] Time-based types support

2020-06-11 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9107:
-

 Summary: [C++][Dataset] Time-based types support
 Key: ARROW-9107
 URL: https://issues.apache.org/jira/browse/ARROW-9107
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


We lack the support of date/timestamp partitions, and predicate pushdown rules. 
Timestamp columns are usually the most important predicate in OLAP style 
queries, we need to support this transparently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9065) [Python] Support parsing date32 in dataset partition folders

2020-06-11 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133353#comment-17133353
 ] 

Francois Saint-Jacques commented on ARROW-9065:
---

There's a general void of time based types support in dataset, we need to clean 
this before 1.0.0.

> [Python] Support parsing date32 in dataset partition folders
> 
>
> Key: ARROW-9065
> URL: https://issues.apache.org/jira/browse/ARROW-9065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Dave Hirschfeld
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: dataset
>
> I have some data which is partitioned by year/month/date. It would be useful 
> if the date could be automatically parsed:
> {code:python}
> In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.date32())])
> In [18]: partition = DirectoryPartitioning(schema)
> In [19]: partition.parse("/2020/06/2020-06-08")
> ---
> ArrowNotImplementedError Traceback (most recent call last)
>  in 
> > 1 partition.parse("/2020/06/2020-06-08")
> ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
> pyarrow._dataset.Partitioning.parse()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: parsing scalars of type date32[day]
> {code}
> Not a big issue since you can just use string and convert, but nevertheless 
> it would be nice if it Just Worked
> {code}
> In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.string())])
> In [23]: partition = DirectoryPartitioning(schema)
> In [24]: partition.parse("/2020/06/2020-06-08")
> Out[24]:  6:int8)) and (day == 2020-06-08:string))>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9065) [Python] Support parsing date32 in dataset partition folders

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-9065:
-

Assignee: Francois Saint-Jacques

> [Python] Support parsing date32 in dataset partition folders
> 
>
> Key: ARROW-9065
> URL: https://issues.apache.org/jira/browse/ARROW-9065
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Dave Hirschfeld
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: dataset
>
> I have some data which is partitioned by year/month/date. It would be useful 
> if the date could be automatically parsed:
> {code:python}
> In [17]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.date32())])
> In [18]: partition = DirectoryPartitioning(schema)
> In [19]: partition.parse("/2020/06/2020-06-08")
> ---
> ArrowNotImplementedError Traceback (most recent call last)
>  in 
> > 1 partition.parse("/2020/06/2020-06-08")
> ~\envs\dev\lib\site-packages\pyarrow\_dataset.pyx in 
> pyarrow._dataset.Partitioning.parse()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~\envs\dev\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: parsing scalars of type date32[day]
> {code}
> Not a big issue since you can just use string and convert, but nevertheless 
> it would be nice if it Just Worked
> {code}
> In [22]: schema = pa.schema([("year", pa.int16()), ("month", pa.int8()), 
> ("day", pa.string())])
> In [23]: partition = DirectoryPartitioning(schema)
> In [24]: partition.parse("/2020/06/2020-06-08")
> Out[24]:  6:int8)) and (day == 2020-06-08:string))>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8283:
-

Assignee: Joris Van den Bossche  (was: Francois Saint-Jacques)

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-11 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8283:
-

Assignee: Francois Saint-Jacques  (was: Joris Van den Bossche)

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-11 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133307#comment-17133307
 ] 

Francois Saint-Jacques commented on ARROW-8283:
---

Correct, we should not touch `get_file_info` when we have a list of paths.

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Assignee: Joris Van den Bossche
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8802) [C++][Dataset] Schema metadata are lost when reading a subset of columns

2020-06-10 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8802:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Schema metadata are lost when reading a subset of columns
> 
>
> Key: ARROW-8802
> URL: https://issues.apache.org/jira/browse/ARROW-8802
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, dataset-dask-integration
>
> Python example:
> {code}
> import pandas as pd 
> import pyarrow.dataset as ds  
>   
>   
> df = pd.DataFrame({'a': [1, 2, 3]})  
> df.to_parquet("test_metadata.parquet")  
> dataset = ds.dataset("test_metadata.parquet") 
>   
>   
> {code}
> gives:
> {code}
> >>> dataset.to_table().schema 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 397
> ARROW:schema: '/4ACAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwAAA' + 
> 806
> >>> dataset.to_table(columns=['a']).schema 
> a: int64
>   -- field metadata --
>   PARQUET:field_id: '1'
> {code}
> So when specifying a subset of the columns, the additional metadata entries 
> are lost (while those can still be informative, eg for conversion to pandas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-10 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8726.
---
Resolution: Fixed

Issue resolved by pull request 7377
[https://github.com/apache/arrow/pull/7377]

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-10 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8726:
--
Component/s: (was: R)
 C++

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8374:
-

Assignee: Francois Saint-Jacques

> [R] Table to vector of DictonaryType will error when Arrays don't have the 
> same Dictionary per array
> 
>
> Key: ARROW-8374
> URL: https://issues.apache.org/jira/browse/ARROW-8374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 1.0.0
>
>
> The conversion should accommodate Unifying the dictionary before converting, 
> otherwise the indices are simply broken



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6235) [R] Conversion from arrow::BinaryArray to R character vector not implemented

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6235:
-

Assignee: Francois Saint-Jacques

> [R] Conversion from arrow::BinaryArray to R character vector not implemented
> 
>
> Key: ARROW-6235
> URL: https://issues.apache.org/jira/browse/ARROW-6235
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Critical
> Fix For: 1.0.0
>
>
> See unhandled case at 
> https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8943) [C++] Add support for Partitioning to ParquetDatasetFactory

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8943:
-

Assignee: Francois Saint-Jacques

> [C++] Add support for Partitioning to ParquetDatasetFactory
> ---
>
> Key: ARROW-8943
> URL: https://issues.apache.org/jira/browse/ARROW-8943
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, dataset-dask-integration
> Fix For: 1.0.0
>
>
> Follow-up on ARROW-8062: the ParquetDatasetFactory currently does not yet 
> support specifying a Partitioning / inferring with a PartitioningFactory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-06-08 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128405#comment-17128405
 ] 

Francois Saint-Jacques edited comment on ARROW-7673 at 6/8/20, 3:56 PM:


This has been refactored/fixed in ARROW-8058:


{code:python}
In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="csv")   
Out[40]: 

In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="parquet")   
...
OSError: Could not open parquet input source 
'/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet 
magic bytes not found in footer. Either the file is corrupted or this is not a 
parquet file.

In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016", 
format="parquet")   
Out[42]: 

{code}



was (Author: fsaintjacques):
This has been refactored in ARROW-8058:


{code:python}
In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="csv")   
Out[40]: 

In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="parquet")   
...
OSError: Could not open parquet input source 
'/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet 
magic bytes not found in footer. Either the file is corrupted or this is not a 
parquet file.

In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016", 
format="parquet")   
Out[42]: 

{code}


> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-7673.
---
Resolution: Fixed

> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7673) [C++][Dataset] Revisit File discovery failure mode

2020-06-08 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128405#comment-17128405
 ] 

Francois Saint-Jacques commented on ARROW-7673:
---

This has been refactored in ARROW-8058:


{code:python}
In [40]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="csv")   
Out[40]: 

In [41]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/csv/2016", 
format="parquet")   
...
OSError: Could not open parquet input source 
'/home/fsaintjacques/datasets/nyc-tlc/csv/2016/01/data.csv': Invalid: Parquet 
magic bytes not found in footer. Either the file is corrupted or this is not a 
parquet file.

In [42]: da.dataset("/home/fsaintjacques/datasets/nyc-tlc/parquet/2016", 
format="parquet")   
Out[42]: 

{code}


> [C++][Dataset] Revisit File discovery failure mode
> --
>
> Key: ARROW-7673
> URL: https://issues.apache.org/jira/browse/ARROW-7673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> Currently, the default `FileSystemFactoryOptions::exclude_invalid_files` will 
> silently ignore unsupported files (either IO error, not of the valid format, 
> corruption, missing compression codecs, etc...) when creating a 
> `FileSystemSource`.
> We should change this behavior to propagate an error in the Inspect/Finish 
> calls by default and allow the user to toggle `exclude_invalid_files`. The 
> error should contain at least the file path and a decipherable error (if 
> possible).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-08 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128383#comment-17128383
 ] 

Francois Saint-Jacques edited comment on ARROW-8283 at 6/8/20, 3:24 PM:


[~jorisvandenbossche] This filtering is done on the [python 
side.|https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset.pyx#L458-L478]
 This is due to the fact that the FileSystemDataset constructor accepts a 
path_or_selector. The C++ variant only accepts a vector of fragments, maybe we 
should align both?


was (Author: fsaintjacques):
[~jorisvandenbossche] This filtering is done on the [python 
side.|https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset.pyx#L458-L478]
 This is due to the fact that the FileSystemDataset constructor accepts a 
path_or_selector.

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8283:
-

Assignee: (was: Francois Saint-Jacques)

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-08 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128383#comment-17128383
 ] 

Francois Saint-Jacques commented on ARROW-8283:
---

[~jorisvandenbossche] This filtering is done on the [python 
side.|https://github.com/apache/arrow/blob/master/python/pyarrow/_dataset.pyx#L458-L478]
 This is due to the fact that the FileSystemDataset constructor accepts a 
path_or_selector.

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8283:
--
Component/s: (was: C++)

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8283) [Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8283:
--
Summary: [Python][Dataset] Non-existent files are silently dropped in 
pa.dataset.FileSystemDataset  (was: [C++/Python][Dataset] Non-existent files 
are silently dropped in pa.dataset.FileSystemDataset)

> [Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe Korn
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface

2020-06-08 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9068:
--
Issue Type: Improvement  (was: Bug)

> [C++][Dataset] Simplify Partitioning interface
> --
>
> Key: ARROW-9068
> URL: https://issues.apache.org/jira/browse/ARROW-9068
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
>  Labels: dataset
>
> The `int segment` of `Partitioning::Parse` should not be exposed to the user. 
> KeyValuePartiioning should be a private Impl interface, not in public 
> headers. 
> The same apply to `Partitioning::Format`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9068) [C++][Dataset] Simplify Partitioning interface

2020-06-08 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9068:
-

 Summary: [C++][Dataset] Simplify Partitioning interface
 Key: ARROW-9068
 URL: https://issues.apache.org/jira/browse/ARROW-9068
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Francois Saint-Jacques


The `int segment` of `Partitioning::Parse` should not be exposed to the user. 
KeyValuePartiioning should be a private Impl interface, not in public headers. 

The same apply to `Partitioning::Format`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-9037) [C++][C] unable to import array with null count == -1 (which could be exported)

2020-06-07 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-9037.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7353
[https://github.com/apache/arrow/pull/7353]

> [C++][C] unable to import array with null count == -1 (which could be 
> exported)
> ---
>
> Key: ARROW-9037
> URL: https://issues.apache.org/jira/browse/ARROW-9037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> If an Array is created with null_count == -1 but without any null (and thus 
> no null bitmap buffer), then ArrayData.null_count will remain -1 when 
> exporting if null_count is never computed. The exported C struct also has 
> null_count == -1 [1]. But when importing, if null_count != 0, an error [2] 
> will be raised.
> [1] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560
> [2] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-9037) [C++][C] unable to import array with null count == -1 (which could be exported)

2020-06-07 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-9037:
-

Assignee: Zhuo Peng

> [C++][C] unable to import array with null count == -1 (which could be 
> exported)
> ---
>
> Key: ARROW-9037
> URL: https://issues.apache.org/jira/browse/ARROW-9037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Zhuo Peng
>Assignee: Zhuo Peng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If an Array is created with null_count == -1 but without any null (and thus 
> no null bitmap buffer), then ArrayData.null_count will remain -1 when 
> exporting if null_count is never computed. The exported C struct also has 
> null_count == -1 [1]. But when importing, if null_count != 0, an error [2] 
> will be raised.
> [1] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560
> [2] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9037) [C++][C] unable to import array with null count == -1 (which could be exported)

2020-06-07 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-9037:
--
Summary: [C++][C] unable to import array with null count == -1 (which could 
be exported)  (was: [C++/C-ABI] unable to import array with null count == -1 
(which could be exported))

> [C++][C] unable to import array with null count == -1 (which could be 
> exported)
> ---
>
> Key: ARROW-9037
> URL: https://issues.apache.org/jira/browse/ARROW-9037
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.17.1
>Reporter: Zhuo Peng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If an Array is created with null_count == -1 but without any null (and thus 
> no null bitmap buffer), then ArrayData.null_count will remain -1 when 
> exporting if null_count is never computed. The exported C struct also has 
> null_count == -1 [1]. But when importing, if null_count != 0, an error [2] 
> will be raised.
> [1] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L560
> [2] 
> https://github.com/apache/arrow/blob/5389008df0267ba8d57edb7d6bb6ec0bfa10ff9a/cpp/src/arrow/c/bridge.cc#L1404
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8471) [C++][Integration] Regression to /u?int64/ as JSON::number

2020-06-04 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8471.
---
Resolution: Fixed

Issue resolved by pull request 7292
[https://github.com/apache/arrow/pull/7292]

> [C++][Integration] Regression to /u?int64/ as JSON::number
> --
>
> Key: ARROW-8471
> URL: https://issues.apache.org/jira/browse/ARROW-8471
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Integration
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> In moving datagen.py under archery, the fix for ARROW-6310 was clobbered out 
> resulting in representing 64 bit integers as numbers in integration JSON.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8283) [C++/Python][Dataset] Non-existent files are silently dropped in pa.dataset.FileSystemDataset

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8283:
-

Assignee: Francois Saint-Jacques

> [C++/Python][Dataset] Non-existent files are silently dropped in 
> pa.dataset.FileSystemDataset
> -
>
> Key: ARROW-8283
> URL: https://issues.apache.org/jira/browse/ARROW-8283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Uwe Korn
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: dataset
> Fix For: 1.0.0
>
>
> When passing a list of files to the constructor of 
> {{pyarrow.dataset.FileSystemDataset}}, all files that don't exist are 
> silently dropped immediately (i.e. no fragments are created for them).
> Instead, I would expect that fragments will be created for them but an error 
> is thrown when one tries to read the fragment with the non-existent file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8726:
--
Labels: dataset  (was: )

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8946) [Python] Add tests for parquet.write_metadata metadata_collector

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8946.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7345
[https://github.com/apache/arrow/pull/7345]

> [Python] Add tests for parquet.write_metadata metadata_collector
> 
>
> Key: ARROW-8946
> URL: https://issues.apache.org/jira/browse/ARROW-8946
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up on ARROW-8062: the PR added functionality to 
> {{parquet.write_metadata}} to pass a a collection of metadata objects to be 
> concatenated. We should add some specific tests for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8946) [Python] Add tests for parquet.write_metadata metadata_collector

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8946:
-

Assignee: Joris Van den Bossche

> [Python] Add tests for parquet.write_metadata metadata_collector
> 
>
> Key: ARROW-8946
> URL: https://issues.apache.org/jira/browse/ARROW-8946
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Follow-up on ARROW-8062: the PR added functionality to 
> {{parquet.write_metadata}} to pass a a collection of metadata objects to be 
> concatenated. We should add some specific tests for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-03 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124996#comment-17124996
 ] 

Francois Saint-Jacques commented on ARROW-8726:
---

I forgot that it was aptly named `DirectoryPartitioning`, I'll check that ;)

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-03 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124984#comment-17124984
 ] 

Francois Saint-Jacques commented on ARROW-8726:
---

[~jorisvandenbossche]

What would you like to see solved:

1. The fact that the file name is used as a partition. Should we only consider 
the directory of the base path? This ambiguity goes away with HivePartitioning 
since it won't be parsed.
2. The fact that passing an "extra" key without value generates an error. The 
other option would be to default to NullType.

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9028) [R] Should be able to convert an empty table

2020-06-03 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-9028:
-

 Summary: [R] Should be able to convert an empty table
 Key: ARROW-9028
 URL: https://issues.apache.org/jira/browse/ARROW-9028
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8726) [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file name as value

2020-06-03 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124974#comment-17124974
 ] 

Francois Saint-Jacques commented on ARROW-8726:
---

The empty table conversion will be tracked in ARROW-9028.

> [C++][Dataset] Mis-specified DirectoryPartitioning incorrectly uses the file 
> name as value
> --
>
> Key: ARROW-8726
> URL: https://issues.apache.org/jira/browse/ARROW-8726
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> Calling filter + collect on a dataset with a mis-specified partitioning 
> causes a segfault. Though this is clearly input error, it would be nice if 
> there was some guidance that something was wrong with the partitioning.
> {code:r}
> library(arrow)
> library(dplyr)
> dir.create("multi_mtcars/one", recursive = TRUE)
> dir.create("multi_mtcars/two", recursive = TRUE)
> write_parquet(mtcars, "multi_mtcars/one/mtcars.parquet")
> write_parquet(mtcars, "multi_mtcars/two/mtcars.parquet")
> ds <- open_dataset("multi_mtcars", partitioning = c("level", "nothing"))
> # the following will segfault
> ds %>%
>   filter(cyl > 8) %>% 
>   collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8986) [Archery][ursabot] Fix benchmark diff checkout of origin/master

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8986.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Fixed upstream in ursabot finally.

> [Archery][ursabot] Fix benchmark diff checkout of origin/master
> ---
>
> Key: ARROW-8986
> URL: https://issues.apache.org/jira/browse/ARROW-8986
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> https://github.com/apache/arrow/pull/7300#issuecomment-635967095



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8711) [Python] Expose strptime timestamp parsing in read_csv conversion options

2020-06-03 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8711.
---
Resolution: Fixed

Issue resolved by pull request 7223
[https://github.com/apache/arrow/pull/7223]

> [Python] Expose strptime timestamp parsing in read_csv conversion options
> -
>
> Key: ARROW-8711
> URL: https://issues.apache.org/jira/browse/ARROW-8711
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Follow up to ARROW-8111



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8843) [C++] Optimize BitmapEquals unaligned case

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8843.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7285
[https://github.com/apache/arrow/pull/7285]

> [C++] Optimize BitmapEquals unaligned case
> --
>
> Key: ARROW-8843
> URL: https://issues.apache.org/jira/browse/ARROW-8843
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> BitmapEquals unaligned case compares two bitmap bit-by-bit[1]. Similar tricks 
> in this PR[2] may also be helpful here to improve performance by processing 
> in words.
> [1] 
> https://github.com/apache/arrow/blob/e5a33f1220705aec6a224b55d2a6f47fbd957603/cpp/src/arrow/util/bit_util.cc#L248-L254
> [2] https://github.com/apache/arrow/pull/7135



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8997.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7316
[https://github.com/apache/arrow/pull/7316]

> [Archery] Benchmark formatter should have friendly units
> 
>
> Key: ARROW-8997
> URL: https://issues.apache.org/jira/browse/ARROW-8997
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The current output is not friendly to glance at. Usage of humanfriendly can 
> help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8997:
-

Assignee: Francois Saint-Jacques

> [Archery] Benchmark formatter should have friendly units
> 
>
> Key: ARROW-8997
> URL: https://issues.apache.org/jira/browse/ARROW-8997
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>
> The current output is not friendly to glance at. Usage of humanfriendly can 
> help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8997:
--
Issue Type: Improvement  (was: Bug)

> [Archery] Benchmark formatter should have friendly units
> 
>
> Key: ARROW-8997
> URL: https://issues.apache.org/jira/browse/ARROW-8997
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> The current output is not friendly to glance at. Usage of humanfriendly can 
> help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8997:
--
Priority: Minor  (was: Major)

> [Archery] Benchmark formatter should have friendly units
> 
>
> Key: ARROW-8997
> URL: https://issues.apache.org/jira/browse/ARROW-8997
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Priority: Minor
>
> The current output is not friendly to glance at. Usage of humanfriendly can 
> help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8997:
--
Component/s: Archery

> [Archery] Benchmark formatter should have friendly units
> 
>
> Key: ARROW-8997
> URL: https://issues.apache.org/jira/browse/ARROW-8997
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> The current output is not friendly to glance at. Usage of humanfriendly can 
> help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8997) [Archery] Benchmark formatter should have friendly units

2020-06-01 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8997:
-

 Summary: [Archery] Benchmark formatter should have friendly units
 Key: ARROW-8997
 URL: https://issues.apache.org/jira/browse/ARROW-8997
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques


The current output is not friendly to glance at. Usage of humanfriendly can 
help here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8986) [Archery][ursabot] Fix benchmark diff checkout of origin/master

2020-06-01 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8986:
-

Assignee: Francois Saint-Jacques

> [Archery][ursabot] Fix benchmark diff checkout of origin/master
> ---
>
> Key: ARROW-8986
> URL: https://issues.apache.org/jira/browse/ARROW-8986
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>
> https://github.com/apache/arrow/pull/7300#issuecomment-635967095



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8986) [Archery][ursabot] Fix benchmark diff checkout of origin/master

2020-05-30 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8986:
-

 Summary: [Archery][ursabot] Fix benchmark diff checkout of 
origin/master
 Key: ARROW-8986
 URL: https://issues.apache.org/jira/browse/ARROW-8986
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques


https://github.com/apache/arrow/pull/7300#issuecomment-635967095



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8975) [FlightRPC][C++] Fix flaky MacOS tests

2020-05-29 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8975.
---
Resolution: Fixed

Issue resolved by pull request 7298
[https://github.com/apache/arrow/pull/7298]

> [FlightRPC][C++] Fix flaky MacOS tests
> --
>
> Key: ARROW-8975
> URL: https://issues.apache.org/jira/browse/ARROW-8975
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The gRPC MacOS tests have been flaking again.
> Looking at [https://github.com/grpc/grpc/issues/20311] they may possibly have 
> been fixed except [https://github.com/grpc/grpc/issues/13856] reports they 
> haven't (in some configurations?) so I will try a few things in CI, or just 
> disable the tests on MacOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

2020-05-29 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8914.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

> [C++][Gandiva] Decimal128 related test failed on big-endian platforms
> -
>
> Key: ARROW-8914
> URL: https://issues.apache.org/jira/browse/ARROW-8914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> These test failures in gandiva tests occur on big-endian platforms. An 
> example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306
> {code}
> ...
> [==] 17 tests from 1 test case ran. (2334 ms total)
> [  PASSED  ] 7 tests.
> [  FAILED  ] 10 tests, listed below:
> [  FAILED  ] TestDecimal.TestSimple
> [  FAILED  ] TestDecimal.TestLiteral
> [  FAILED  ] TestDecimal.TestCompare
> [  FAILED  ] TestDecimal.TestRoundFunctions
> [  FAILED  ] TestDecimal.TestCastFunctions
> [  FAILED  ] TestDecimal.TestIsDistinct
> [  FAILED  ] TestDecimal.TestCastVarCharDecimal
> [  FAILED  ] TestDecimal.TestCastDecimalVarChar
> [  FAILED  ] TestDecimal.TestVarCharDecimalNestedCast
> [  FAILED  ] TestDecimal.TestCastDecimalOverflow
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8914) [C++][Gandiva] Decimal128 related test failed on big-endian platforms

2020-05-29 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8914:
--
Component/s: (was: C++ - Gandiva)
 C++

> [C++][Gandiva] Decimal128 related test failed on big-endian platforms
> -
>
> Key: ARROW-8914
> URL: https://issues.apache.org/jira/browse/ARROW-8914
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> These test failures in gandiva tests occur on big-endian platforms. An 
> example from https://travis-ci.org/github/apache/arrow/jobs/690006107#L2306
> {code}
> ...
> [==] 17 tests from 1 test case ran. (2334 ms total)
> [  PASSED  ] 7 tests.
> [  FAILED  ] 10 tests, listed below:
> [  FAILED  ] TestDecimal.TestSimple
> [  FAILED  ] TestDecimal.TestLiteral
> [  FAILED  ] TestDecimal.TestCompare
> [  FAILED  ] TestDecimal.TestRoundFunctions
> [  FAILED  ] TestDecimal.TestCastFunctions
> [  FAILED  ] TestDecimal.TestIsDistinct
> [  FAILED  ] TestDecimal.TestCastVarCharDecimal
> [  FAILED  ] TestDecimal.TestCastDecimalVarChar
> [  FAILED  ] TestDecimal.TestVarCharDecimalNestedCast
> [  FAILED  ] TestDecimal.TestCastDecimalOverflow
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8962) [C++] Linking failure with clang-4.0

2020-05-27 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8962.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7286
[https://github.com/apache/arrow/pull/7286]

> [C++] Linking failure with clang-4.0
> 
>
> Key: ARROW-8962
> URL: https://issues.apache.org/jira/browse/ARROW-8962
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:java}
> FAILED: release/arrow-file-to-stream
> : && /Users/uwe/miniconda3/envs/pyarrow-dev/bin/ccache 
> /Users/uwe/miniconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++  
> -march=core2 -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE 
> -fstack-protector-strong -O2 -pipe -stdlib=libc++ -fvisibility-inlines-hidden 
> -std=c++14 -fmessage-length=0 -Qunused-arguments -fcolor-diagnostics -O3 
> -DNDEBUG  -Wall -Wno-unknown-warning-option -Wno-pass-failed -msse4.2  -O3 
> -DNDEBUG -isysroot 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk
>  -Wl,-search_paths_first -Wl,-headerpad_max_install_names -Wl,-pie 
> -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs 
> src/arrow/ipc/CMakeFiles/arrow-file-to-stream.dir/file_to_stream.cc.o  -o 
> release/arrow-file-to-stream  release/libarrow.a 
> /usr/local/opt/openssl@1.1/lib/libssl.dylib 
> /usr/local/opt/openssl@1.1/lib/libcrypto.dylib 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlienc-static.a 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlidec-static.a 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libbrotlicommon-static.a 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liblz4.dylib 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libsnappy.1.1.7.dylib 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libz.dylib 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libzstd.dylib 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/liborc.a 
> /Users/uwe/miniconda3/envs/pyarrow-dev/lib/libprotobuf.dylib 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a && :
> Undefined symbols for architecture x86_64:
>   "arrow::internal::(anonymous 
> namespace)::StringToFloatConverterImpl::main_junk_value_", referenced from:
>   arrow::internal::StringToFloat(char const*, unsigned long, float*) in 
> libarrow.a(value_parsing.cc.o)
>   arrow::internal::StringToFloat(char const*, unsigned long, double*) in 
> libarrow.a(value_parsing.cc.o)
>   "arrow::internal::(anonymous 
> namespace)::StringToFloatConverterImpl::fallback_junk_value_", referenced 
> from:
>   arrow::internal::StringToFloat(char const*, unsigned long, float*) in 
> libarrow.a(value_parsing.cc.o)
>   arrow::internal::StringToFloat(char const*, unsigned long, double*) in 
> libarrow.a(value_parsing.cc.o)
> ld: symbol(s) not found for architecture x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to see 
> invocation) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2020-05-26 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116725#comment-17116725
 ] 

Francois Saint-Jacques commented on ARROW-2079:
---

[~mdurant], my question is more about "why" than "how".

> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> ---
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2020-05-25 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17116205#comment-17116205
 ] 

Francois Saint-Jacques commented on ARROW-2079:
---

Question to users/developers, why the need of 2 files, is it because 
`_metadata` can be too big?

> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> ---
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3244) [Python] Multi-file parquet loading without scan

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-3244.
---
Fix Version/s: 1.0.0
   Resolution: Implemented

> [Python] Multi-file parquet loading without scan
> 
>
> Key: ARROW-3244
> URL: https://issues.apache.org/jira/browse/ARROW-3244
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Martin Durant
>Priority: Major
>  Labels: dataset, dataset-parquet-read, parquet
> Fix For: 1.0.0
>
>
> A number of mechanism are possible to avoid having to access and read the 
> parquet footers in a data set consisting of a number of files. In the case of 
> a large number of data files (perhaps split with directory partitioning) and 
> remote storage, this can be a significant overhead. This is significant from 
> the point of view of Dask, which must have the metadata available in the 
> client before setting up computational graphs.
>  
> Here are some suggestions of what could be done.
>  
>  * some parquet writing frameworks include a `_metadata` file, which contains 
> all the information from the footers of the various files. If this file is 
> present, then this data can be read from one place, with a single file 
> access. For a large number of files, parsing the thrift information may, by 
> itself, be a non-negligible overhead≥
>  * the schema (dtypes) can be found in a `_common_metadata`, or from any one 
> of the data-files, then the schema could be assumed (perhaps at the user's 
> option) to be the same for all of the files. However, the information about 
> the directory partitioning would not be available. Although Dask may infer 
> the information from the filenames, it would be preferable to go through the 
> machinery with parquet-cpp, and view the whole data-set as a single object. 
> Note that the files will still need to have the footer read to access the 
> data, for the bytes offsets, but from Dask's point of view, this would be 
> deferred to tasks running in parallel.
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8062.
---
Resolution: Fixed

Issue resolved by pull request 7180
[https://github.com/apache/arrow/pull/7180]

> [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file
> -
>
> Key: ARROW-8062
> URL: https://issues.apache.org/jira/browse/ARROW-8062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, dataset-dask-integration, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Partitioned parquet datasets sometimes come with {{_metadata}} / 
> {{_common_metadata}} files. Those files include information about the schema 
> of the full dataset and potentially all RowGroup metadata as well (for 
> {{_metadata}}).
> Using those files during the creation of a parquet {{Dataset}} can give a 
> more efficient factory (using the stored schema instead of inferring the 
> schema from unioning the schemas of all files + using the paths to individual 
> parquet files instead of crawling the directory).
> Basically, based those files, the schema, list of paths and partition 
> expressions (the information that is needed to create a Dataset) could be 
> constructed.   
> Such logic could be put in a different factory class, eg 
> {{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8932) [C++] symbol resolution failures with liborc.a

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8932.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7266
[https://github.com/apache/arrow/pull/7266]

> [C++] symbol resolution failures with liborc.a
> --
>
> Key: ARROW-8932
> URL: https://issues.apache.org/jira/browse/ARROW-8932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This is failing in the Travis CI s390x build. I am not sure this is related 
> to ARROW-8930.
> [https://travis-ci.org/github/apache/arrow/jobs/690006107] was successful.
> [https://travis-ci.org/github/apache/arrow/jobs/690634108#L1023|https://travis-ci.org/github/apache/arrow/jobs/690634108]
>  causes failures.
> {code:java}
> [435/548] Linking CXX executable debug/arrow-orc-adapter-test
> 1024 FAILED: debug/arrow-orc-adapter-test
> 1025 : && /usr/bin/ccache /usr/bin/c++  -Wno-noexcept-type  
> -fdiagnostics-color=always -ggdb -O0  -Wall -Wno-conversion 
> -Wno-sign-conversion -Wno-unused-variable -Werror  -g  -rdynamic 
> src/arrow/adapters/orc/CMakeFiles/arrow-orc-adapter-test.dir/adapter_test.cc.o
>   -o debug/arrow-orc-adapter-test  -Wl,-rpath,/build/cpp/debug  
> debug/libarrow_testing.a  debug/libarrow.a  debug//libgtest_maind.so  
> debug//libgtestd.so  /usr/lib/s390x-linux-gnu/libsnappy.so.1.1.8  
> /usr/lib/s390x-linux-gnu/liblz4.so  /usr/lib/s390x-linux-gnu/libz.so  
> -lpthread  -ldl  orc_ep-install/lib/liborc.a  
> /usr/lib/s390x-linux-gnu/libssl.so  /usr/lib/s390x-linux-gnu/libcrypto.so  
> /usr/lib/s390x-linux-gnu/libbrotlienc.so  
> /usr/lib/s390x-linux-gnu/libbrotlidec.so  
> /usr/lib/s390x-linux-gnu/libbrotlicommon.so  
> /usr/lib/s390x-linux-gnu/libbz2.so  /usr/lib/s390x-linux-gnu/libzstd.so  
> protobuf_ep-install/lib/libprotobuf.a  /usr/lib/s390x-linux-gnu/libglog.so  
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a  -pthread  
> -lrt && :
> 1026 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibCompressionStream::doStreamingCompression()':
> 1027 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:244: 
> undefined reference to `deflateReset'
> 1028 /usr/bin/ld: 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:266: undefined 
> reference to `deflate'
> 1029 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibCompressionStream::init()':
> 1030 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:296: 
> undefined reference to `deflateInit2_'
> 1031 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibCompressionStream::end()':
> 1032 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:303: 
> undefined reference to `deflateEnd'
> 1033 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibDecompressionStream::ZlibDecompressionStream(std::unique_ptr  std::default_delete >, unsigned long, 
> orc::MemoryPool&)':
> 1034 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:417: 
> undefined reference to `inflateInit2_'
> 1035 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibDecompressionStream::~ZlibDecompressionStream()':
> 1036 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:442: 
> undefined reference to `inflateEnd'
> 1037 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::ZlibDecompressionStream::Next(void const**, int*)':
> 1038 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:483: 
> undefined reference to `inflateReset'
> 1039 /usr/bin/ld: 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:489: undefined 
> reference to `inflate'
> 1040 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::SnappyDecompressionStream::decompress(char const*, unsigned long, 
> char*, unsigned long)':
> 1041 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:848: 
> undefined reference to `snappy::GetUncompressedLength(char const*, unsigned 
> long, unsigned long*)'
> 1042 /usr/bin/ld: 
> /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:856: undefined 
> reference to `snappy::RawUncompress(char const*, unsigned long, char*)'
> 1043 /usr/bin/ld: orc_ep-install/lib/liborc.a(Compression.cc.o): in function 
> `orc::Lz4DecompressionStream::decompress(char const*, unsigned long, char*, 
> unsigned long)':
> 1044 /build/cpp/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:922: 
> undefined 

[jira] [Resolved] (ARROW-8911) [C++] Slicing a ChunkedArray with zero chunks segfaults

2020-05-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8911.
---
Resolution: Fixed

Issue resolved by pull request 7262
[https://github.com/apache/arrow/pull/7262]

> [C++] Slicing a ChunkedArray with zero chunks segfaults
> ---
>
> Key: ARROW-8911
> URL: https://issues.apache.org/jira/browse/ARROW-8911
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.17.1
> Environment: macOS, ubuntu
>Reporter: A. Coady
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:python}
> import pyarrow as pa
> arr = pa.chunked_array([[1]])
> empty = arr.filter(pa.array([False]))
> print(empty)
> print(empty[:]) # <- crash
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8890) [R] Fix C++ lint issue

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8890.
---
Resolution: Fixed

Issue resolved by pull request 7251
[https://github.com/apache/arrow/pull/7251]

> [R] Fix C++ lint issue 
> ---
>
> Key: ARROW-8890
> URL: https://issues.apache.org/jira/browse/ARROW-8890
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8510) [C++] arrow/dataset/file_base.cc fails to compile with internal compiler error with "Visual Studio 15 2017 Win64" generator

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8510:
-

Assignee: Francois Saint-Jacques

> [C++] arrow/dataset/file_base.cc fails to compile with internal compiler 
> error with "Visual Studio 15 2017 Win64" generator
> ---
>
> Key: ARROW-8510
> URL: https://issues.apache.org/jira/browse/ARROW-8510
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I discovered this while running the release verification on Windows. There 
> was an obscuring issue which is that if the build fails, the verification 
> script continues. I will fix that



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8889:
-

Assignee: David Li

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in __libc_start_main () from /usr/lib/libc.so.6
> #19 0x5572fac0 in _start () at ../sysdeps/x86_64/elf/start.S:103
> {noformat}



--
This 

[jira] [Resolved] (ARROW-8889) [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8889.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7249
[https://github.com/apache/arrow/pull/7249]

> [Python] Python 3.7 SIGSEGV when comparing RecordBatch to None
> --
>
> Key: ARROW-8889
> URL: https://issues.apache.org/jira/browse/ARROW-8889
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1, 0.17.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This seems to only happen for Python 3.6 and 3.7. It doesn't happen with 3.8. 
> It seems to happen even when built from source, but I used the wheels for 
> this reproduction.
> {noformat}
> > uname -a
> Linux chaconne 5.6.13-arch1-1 #1 SMP PREEMPT Thu, 14 May 2020 06:52:53 + 
> x86_64 GNU/Linux
> > python --version
> Python 3.7.7
> > pip freeze
> numpy==1.18.4
> pyarrow==0.17.1{noformat}
> Reproduction:
> {code:python}
> import pyarrow as pa
> table = pa.Table.from_arrays([pa.array([1,2,3])], names=["a"])
> batches = table.to_batches()
> batches[0].equals(None)
> {code}
> {noformat}
> #0  0x7fffdf9d34f0 in arrow::RecordBatch::num_columns() const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #1  0x7fffdf9d69e9 in arrow::RecordBatch::Equals(arrow::RecordBatch 
> const&, bool) const () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/libarrow.so.17
> #2  0x7fffe084a6e0 in 
> __pyx_pw_7pyarrow_3lib_11RecordBatch_31equals(_object*, _object*, _object*) 
> () from 
> /home/lidavidm/Code/twosigma/arrow/venv2/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
> #3  0x556b97e4 in _PyMethodDef_RawFastCallKeywords 
> (method=0x7fffe0c1b760 <__pyx_methods_7pyarrow_3lib_RecordBatch+288>, 
> self=0x7fffdefd7110, args=0x7786f5c8, nargs=, 
> kwnames=)
> at /tmp/build/80754af9/python_1585000375785/work/Objects/call.c:694
> #4  0x556c06af in _PyMethodDescr_FastCallKeywords 
> (descrobj=0x7fffdefa4050, args=0x7786f5c0, nargs=2, kwnames=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Objects/descrobject.c:288
> #5  0x55724add in call_function (kwnames=0x0, oparg=2, 
> pp_stack=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:4593
> #6  _PyEval_EvalFrameDefault (f=, throwflag=) 
> at /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3110
> #7  0x55669289 in _PyEval_EvalCodeWithName (_co=0x778a68a0, 
> globals=, locals=, args=, 
> argcount=, kwnames=0x0, kwargs=0x0, kwcount=, 
> kwstep=2, 
> defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3930
> #8  0x5566a1c4 in PyEval_EvalCodeEx (_co=, 
> globals=, locals=, args=, 
> argcount=, kws=, kwcount=0, defs=0x0, 
> defcount=0, kwdefs=0x0, 
> closure=0x0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:3959
> #9  0x5566a1ec in PyEval_EvalCode (co=, 
> globals=, locals=) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/ceval.c:524
> #10 0x55780cb4 in run_mod (mod=, filename= out>, globals=0x778d7c30, locals=0x778d7c30, flags=, 
> arena=)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:1035
> #11 0x5578b0d1 in PyRun_FileExFlags (fp=0x558c24d0, 
> filename_str=, start=, globals=0x778d7c30, 
> locals=0x778d7c30, closeit=1, flags=0x7fffe1b0)
> at /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:988
> #12 0x5578b2c3 in PyRun_SimpleFileExFlags (fp=0x558c24d0, 
> filename=, closeit=1, flags=0x7fffe1b0) at 
> /tmp/build/80754af9/python_1585000375785/work/Python/pythonrun.c:429
> #13 0x5578c3f5 in pymain_run_file (p_cf=0x7fffe1b0, 
> filename=0x558e51f0 L"repro.py", fp=0x558c24d0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:462
> #14 pymain_run_filename (cf=0x7fffe1b0, pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:1641
> #15 pymain_run_python (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:2902
> #16 pymain_main (pymain=0x7fffe2c0) at 
> /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3442
> #17 0x5578c51c in _Py_UnixMain (argc=, argv= out>) at /tmp/build/80754af9/python_1585000375785/work/Modules/main.c:3477
> #18 0x77dcd002 in 

[jira] [Created] (ARROW-8890) [R] Fix C++ lint issue

2020-05-22 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8890:
-

 Summary: [R] Fix C++ lint issue 
 Key: ARROW-8890
 URL: https://issues.apache.org/jira/browse/ARROW-8890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Francois Saint-Jacques
Assignee: Francois Saint-Jacques
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8885) [R] Don't include everything everywhere

2020-05-22 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8885.
---
Resolution: Fixed

Issue resolved by pull request 7245
[https://github.com/apache/arrow/pull/7245]

> [R] Don't include everything everywhere
> ---
>
> Key: ARROW-8885
> URL: https://issues.apache.org/jira/browse/ARROW-8885
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I noticed that we were jamming all of our arrow #includes in one header file 
> in the R bindings and then including that everywhere. Seemed like that was 
> wasteful and probably causing compilation to be slower.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8884) [C++] Listing files with S3FileSystem is slow

2020-05-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8884:
--
Description: 
Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all 
objects but no "grouping" (directories). This lower dramatically the number of 
HTTP calls. 
{code:c++}
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
 if (!prefix.empty()) {
   req.SetPrefix(ToAwsString(prefix) + kSep);
 }
-req.SetDelimiter(Aws::String() + kSep);
+// req.SetDelimiter(Aws::String() + kSep);
 req.SetMaxKeys(kListObjectsMaxKeys);
 
 while (true) {

{code}

The suggested change is to add an option to Selector, e.g. 
`no_directory_result` or something like this.


  was:
Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all 
objects but no "grouping" (directories). This lower dramatically the number of 
HTTP calls. 
{code:c++}
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
 if (!prefix.empty()) {
   req.SetPrefix(ToAwsString(prefix) + kSep);
 }
-req.SetDelimiter(Aws::String() + kSep);
+// req.SetDelimiter(Aws::String() + kSep);
 req.SetMaxKeys(kListObjectsMaxKeys);
 
 while (true) {

{code}



> [C++] Listing files with S3FileSystem is slow
> -
>
> Key: ARROW-8884
> URL: https://issues.apache.org/jira/browse/ARROW-8884
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: filesystem
>
> Listing files on S3 is slow due to the recursive nature of the algorithm.
> The following change modifies the behavior of the S3Result to include all 
> objects but no "grouping" (directories). This lower dramatically the number 
> of HTTP calls. 
> {code:c++}
> diff --git a/cpp/src/arrow/filesystem/s3fs.cc 
> b/cpp/src/arrow/filesystem/s3fs.cc
> index 70c87f46ec..98a40b17a2 100644
> --- a/cpp/src/arrow/filesystem/s3fs.cc
> +++ b/cpp/src/arrow/filesystem/s3fs.cc
> @@ -986,7 +986,7 @@ class S3FileSystem::Impl {
>  if (!prefix.empty()) {
>req.SetPrefix(ToAwsString(prefix) + kSep);
>  }
> -req.SetDelimiter(Aws::String() + kSep);
> +// req.SetDelimiter(Aws::String() + kSep);
>  req.SetMaxKeys(kListObjectsMaxKeys);
>  
>  while (true) {
> {code}
> The suggested change is to add an option to Selector, e.g. 
> `no_directory_result` or something like this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8884) [C++] Listing files with S3FileSystem is slow

2020-05-21 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8884:
-

 Summary: [C++] Listing files with S3FileSystem is slow
 Key: ARROW-8884
 URL: https://issues.apache.org/jira/browse/ARROW-8884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


Listing files on S3 is slow due to the recursive nature of the algorithm.

The following change modifies the behavior of the S3Result to include all 
objects but no "grouping" (directories). This lower dramatically the number of 
HTTP calls. 
{code:c++}
diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 70c87f46ec..98a40b17a2 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -986,7 +986,7 @@ class S3FileSystem::Impl {
 if (!prefix.empty()) {
   req.SetPrefix(ToAwsString(prefix) + kSep);
 }
-req.SetDelimiter(Aws::String() + kSep);
+// req.SetDelimiter(Aws::String() + kSep);
 req.SetMaxKeys(kListObjectsMaxKeys);
 
 while (true) {

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-8874) [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error

2020-05-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques closed ARROW-8874.
-
Fix Version/s: 1.0.0
   Resolution: Duplicate

> [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error
> ---
>
> Key: ARROW-8874
> URL: https://issues.apache.org/jira/browse/ARROW-8874
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> https://github.com/apache/arrow/pull/7180#issuecomment-631059751
> The issue is when 
> [Finish|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L184-L208]
>  exit early due to a ScanTask error, in-flight tasks may try to lock the 
> out-of-scope mutex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8763) [C++] Create RandomAccessFile::WillNeed-like API

2020-05-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8763.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7172
[https://github.com/apache/arrow/pull/7172]

> [C++] Create RandomAccessFile::WillNeed-like API
> 
>
> Key: ARROW-8763
> URL: https://issues.apache.org/jira/browse/ARROW-8763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> We need to inform RandomAccessFile that we will need a given range or number 
> of ranges.
> Also call that method from MemoryMappedFile::Read and friends.
> Also perhaps write specialized ReadAsync implementations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8847) [C++] Pass task size / metrics in Executor API

2020-05-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8847.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7225
[https://github.com/apache/arrow/pull/7225]

> [C++] Pass task size / metrics in Executor API
> --
>
> Key: ARROW-8847
> URL: https://issues.apache.org/jira/browse/ARROW-8847
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For now, our ThreadPool implementation would ignore those metrics, but other 
> implementations may use it for custom ordering.
> Example metrics:
> * IO size (number of bytes)
> * CPU cost (~ number of instructions)
> * Priority (opaque integer? lower is more urgent)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8703) [R] schema$metadata should be properly typed

2020-05-21 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8703.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7236
[https://github.com/apache/arrow/pull/7236]

> [R] schema$metadata should be properly typed
> 
>
> Key: ARROW-8703
> URL: https://issues.apache.org/jira/browse/ARROW-8703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.17.0
>Reporter: René Rex
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, I try to export numeric data plus some metadata in Python into to 
> a parquet file and read it in R. However, the metadata seems to be a dict in 
> Python but a string in R. I would have expected a list (which is roughly a 
> dict in Python). Am I missing something? Here is the code to demonstrate the 
> issue:
> {{import sys}}
> {{import numpy as np}}
> {{import pyarrow as pa}}
> {{import pyarrow.parquet as pq}}
> {{print(sys.version)}}
> {{print(pa.__version__)}}
> {{x = np.random.randint(0, 10, (10, 3))}}
> {{arrays = [pa.array(x[:, i]) for i in range(x.shape[1])]}}
> {{table = pa.Table.from_arrays(arrays=arrays, names=['A', 'B', 'C'],}}
> {{ metadata=\{'foo': '42'})}}
> {{pq.write_table(table, 'array.parquet', compression='snappy')}}
> {{table = pq.read_table('array.parquet')}}
> {{metadata = table.schema.metadata}}
> {{print(metadata)}}
> {{print(type(metadata))}}
>  
> And in R:
>  
> {{library(arrow)}}
> {{print(R.version)}}
> {{print(packageVersion("arrow"))}}
> {{table <- read_parquet("array.parquet", as_data_frame = FALSE)}}
> {{metadata <- table$schema$metadata}}
> {{print(metadata)}}
> {{print(is(metadata))}}
> {{print(metadata["foo"])}}{{ }}
>  
> Output Python:
> {{3.6.8 (default, Aug 7 2019, 17:28:10) }}
> {{[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]}}
> {{0.13.0}}
> {{OrderedDict([(b'foo', b'42')])}}
> {{}}
>  
> Output R:
> {{[1] ‘0.17.0’}}
> {{[1] "\n-- metadata --\nfoo: 42"}}
> {{[1] "character" "vector" "data.frameRowLabels"}}
> {{[4] "SuperClassMethod" }}
> {{[1] NA}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8874) [C++][Dataset] Scanner::ToTable race when ScanTask exit early with an error

2020-05-20 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8874:
-

 Summary: [C++][Dataset] Scanner::ToTable race when ScanTask exit 
early with an error
 Key: ARROW-8874
 URL: https://issues.apache.org/jira/browse/ARROW-8874
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Francois Saint-Jacques


https://github.com/apache/arrow/pull/7180#issuecomment-631059751

The issue is when 
[Finish|https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/scanner.cc#L184-L208]
 exit early due to a ScanTask error, in-flight tasks may try to lock the 
out-of-scope mutex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8851) [Python][Documentation] Fix FutureWarnings in Python Plasma docs

2020-05-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8851.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7217
[https://github.com/apache/arrow/pull/7217]

> [Python][Documentation] Fix FutureWarnings in Python Plasma docs
> 
>
> Key: ARROW-8851
> URL: https://issues.apache.org/jira/browse/ARROW-8851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Affects Versions: 0.17.0
>Reporter: Weston Steimel
>Assignee: Weston Steimel
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The documentation for Plasma in Python at 
> [https://arrow.apache.org/docs/python/plasma.html] produces several 
> FutureWarning messages for pyarrow.get_tensor_size, pyarrow.read_tensor and 
> pyarrow.write_tensor
> In [9]: import numpy as np 
>  : import pyarrow as pa 
>  : 
>  : # Create a pyarrow.Tensor object from a numpy random 2-dimensional array 
>  : data = np.random.randn(10, 4) 
>  : tensor = pa.Tensor.from_numpy(data) 
>  : 
>  : # Create the object in Plasma 
>  : object_id = plasma.ObjectID(np.random.bytes(20)) 
>  : data_size = pa.get_tensor_size(tensor) 
>  : buf = client.create(object_id, data_size) 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.get_tensor_size is deprecated as of 0.17.0, please use 
> pyarrow.ipc.get_tensor_size instead
>  warnings.warn(msg, FutureWarning)
> In [10]: # Write the tensor into the Plasma-allocated buffer 
>  : stream = pa.FixedSizeBufferWriter(buf) 
>  : pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma 
> stream 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.write_tensor is deprecated as of 0.17.0, please use 
> pyarrow.ipc.write_tensor instead
>  warnings.warn(msg, FutureWarning)
> In [13]: # Reconstruct the Arrow tensor object. 
>  : reader = pa.BufferReader(buf2) 
>  : tensor2 = pa.read_tensor(reader) 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.read_tensor is deprecated as of 0.17.0, please use 
> pyarrow.ipc.read_tensor instead
>  warnings.warn(msg, FutureWarning)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8851) [Python][Documentation] Fix FutureWarnings in Python Plasma docs

2020-05-18 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8851:
-

Assignee: Weston Steimel

> [Python][Documentation] Fix FutureWarnings in Python Plasma docs
> 
>
> Key: ARROW-8851
> URL: https://issues.apache.org/jira/browse/ARROW-8851
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Affects Versions: 0.17.0
>Reporter: Weston Steimel
>Assignee: Weston Steimel
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The documentation for Plasma in Python at 
> [https://arrow.apache.org/docs/python/plasma.html] produces several 
> FutureWarning messages for pyarrow.get_tensor_size, pyarrow.read_tensor and 
> pyarrow.write_tensor
> In [9]: import numpy as np 
>  : import pyarrow as pa 
>  : 
>  : # Create a pyarrow.Tensor object from a numpy random 2-dimensional array 
>  : data = np.random.randn(10, 4) 
>  : tensor = pa.Tensor.from_numpy(data) 
>  : 
>  : # Create the object in Plasma 
>  : object_id = plasma.ObjectID(np.random.bytes(20)) 
>  : data_size = pa.get_tensor_size(tensor) 
>  : buf = client.create(object_id, data_size) 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.get_tensor_size is deprecated as of 0.17.0, please use 
> pyarrow.ipc.get_tensor_size instead
>  warnings.warn(msg, FutureWarning)
> In [10]: # Write the tensor into the Plasma-allocated buffer 
>  : stream = pa.FixedSizeBufferWriter(buf) 
>  : pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma 
> stream 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.write_tensor is deprecated as of 0.17.0, please use 
> pyarrow.ipc.write_tensor instead
>  warnings.warn(msg, FutureWarning)
> In [13]: # Reconstruct the Arrow tensor object. 
>  : reader = pa.BufferReader(buf2) 
>  : tensor2 = pa.read_tensor(reader) 
>  /usr/local/lib/python3.8/site-packages/pyarrow/util.py:39: FutureWarning: 
> pyarrow.read_tensor is deprecated as of 0.17.0, please use 
> pyarrow.ipc.read_tensor instead
>  warnings.warn(msg, FutureWarning)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8553) [C++] Optimize unaligned bitmap operations

2020-05-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8553:
--
Summary: [C++] Optimize unaligned bitmap operations  (was: [C++] 
Reimplement BitmapAnd using Bitmap::VisitWords)

> [C++] Optimize unaligned bitmap operations
> --
>
> Key: ARROW-8553
> URL: https://issues.apache.org/jira/browse/ARROW-8553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Antoine Pitrou
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using 
> {{Bitmap::VisitWords}} instead would probably yield a manyfold performance 
> increase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8553) [C++] Reimplement BitmapAnd using Bitmap::VisitWords

2020-05-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-8553.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7135
[https://github.com/apache/arrow/pull/7135]

> [C++] Reimplement BitmapAnd using Bitmap::VisitWords
> 
>
> Key: ARROW-8553
> URL: https://issues.apache.org/jira/browse/ARROW-8553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Antoine Pitrou
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using 
> {{Bitmap::VisitWords}} instead would probably yield a manyfold performance 
> increase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8799) [C++][Dataset] Reading list column as nested dictionary segfaults

2020-05-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8799:
-

Assignee: (was: Francois Saint-Jacques)

> [C++][Dataset] Reading list column as nested dictionary segfaults
> -
>
> Key: ARROW-8799
> URL: https://issues.apache.org/jira/browse/ARROW-8799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
>
> Python example:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq  
> from pyarrow.tests import util
>   
>
>   
>   
>   
> repeats = 10 
> nunique = 5 
> data = [ 
> [[util.rands(10)] for i in range(nunique)] * repeats, 
> ] 
> table = pa.table(data, names=['f0'])  
>   
>
> pq.write_table(table, "test_dictionary.parquet")
> {code}
> Reading with the parquet code works:
> {code}
> >>> pq.read_table("test_dictionary.parquet", 
> >>> read_dictionary=['f0.list.item']) 
> >>>   
> >>>
> pyarrow.Table
> f0: list>
>   child 0, item: dictionary
> {code}
> but doing the same with the datasets API segfaults:
> {code}
> >>> fmt = 
> >>> ds.ParquetFileFormat(read_options=dict(dictionary_columns=["f0.list.item"]))
> >>> dataset = ds.dataset("test_dictionary.parquet", format=fmt)   
> >>> 
> >>> dataset.to_table()  
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-8799) [C++][Dataset] Reading list column as nested dictionary segfaults

2020-05-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-8799:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Reading list column as nested dictionary segfaults
> -
>
> Key: ARROW-8799
> URL: https://issues.apache.org/jira/browse/ARROW-8799
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> Python example:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq  
> from pyarrow.tests import util
>   
>
>   
>   
>   
> repeats = 10 
> nunique = 5 
> data = [ 
> [[util.rands(10)] for i in range(nunique)] * repeats, 
> ] 
> table = pa.table(data, names=['f0'])  
>   
>
> pq.write_table(table, "test_dictionary.parquet")
> {code}
> Reading with the parquet code works:
> {code}
> >>> pq.read_table("test_dictionary.parquet", 
> >>> read_dictionary=['f0.list.item']) 
> >>>   
> >>>
> pyarrow.Table
> f0: list>
>   child 0, item: dictionary
> {code}
> but doing the same with the datasets API segfaults:
> {code}
> >>> fmt = 
> >>> ds.ParquetFileFormat(read_options=dict(dictionary_columns=["f0.list.item"]))
> >>> dataset = ds.dataset("test_dictionary.parquet", format=fmt)   
> >>> 
> >>> dataset.to_table()  
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-8782) [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set

2020-05-13 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106440#comment-17106440
 ] 

Francois Saint-Jacques edited comment on ARROW-8782 at 5/13/20, 4:28 PM:
-

https://github.com/fsaintjacques/parquet-testing/tree/nyc-dataset/data/nyc-taxi

Note that this script is not perfect and will not generate uniform parquet 
files, it also has a conversion issue where `rate_code_id` will be int32 or 
string depending if no values were found.


was (Author: fsaintjacques):
https://github.com/fsaintjacques/parquet-testing/tree/nyc-dataset/data/nyc-taxi

> [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set
> -
>
> Key: ARROW-8782
> URL: https://issues.apache.org/jira/browse/ARROW-8782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> I plan on adding a new benchmarks folder beneatch the datafusion crate, 
> containing benchmarks based on the NYC Taxi data set. The benchmark will be a 
> CLI and will support running a number of different queries against CSV and 
> Parquet.
> The README will contain instructions for downloading the data set.
> The benchmark will produce CSV files containing results.
> These benchmarks will allow us to manually verify performance before major 
> releases and on an ongoing basis as we make changes to 
> Arrow/Parquet/DataFusion.
> I will be basing this on existing benchmarks I recently built in Ballista [1] 
> (I am the only contributor to these benchmarks so far).
> A dockerfile will be provided, making it easy to restrict CPU and RAM when 
> running these benchmarks.
> [1] https://github.com/ballista-compute/ballista/tree/master/rust/benchmarks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8782) [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set

2020-05-13 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106440#comment-17106440
 ] 

Francois Saint-Jacques commented on ARROW-8782:
---

https://github.com/fsaintjacques/parquet-testing/tree/nyc-dataset/data/nyc-taxi

> [Rust] [DataFusion] Add benchmarks based on NYC Taxi data set
> -
>
> Key: ARROW-8782
> URL: https://issues.apache.org/jira/browse/ARROW-8782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> I plan on adding a new benchmarks folder beneatch the datafusion crate, 
> containing benchmarks based on the NYC Taxi data set. The benchmark will be a 
> CLI and will support running a number of different queries against CSV and 
> Parquet.
> The README will contain instructions for downloading the data set.
> The benchmark will produce CSV files containing results.
> These benchmarks will allow us to manually verify performance before major 
> releases and on an ongoing basis as we make changes to 
> Arrow/Parquet/DataFusion.
> I will be basing this on existing benchmarks I recently built in Ballista [1] 
> (I am the only contributor to these benchmarks so far).
> A dockerfile will be provided, making it easy to restrict CPU and RAM when 
> running these benchmarks.
> [1] https://github.com/ballista-compute/ballista/tree/master/rust/benchmarks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8763) [C++] Create RandomAccessFile::WillNeed-like API

2020-05-13 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106256#comment-17106256
 ] 

Francois Saint-Jacques commented on ARROW-8763:
---

[~lidavidm]

> [C++] Create RandomAccessFile::WillNeed-like API
> 
>
> Key: ARROW-8763
> URL: https://issues.apache.org/jira/browse/ARROW-8763
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> We need to inform RandomAccessFile that we will need a given range or number 
> of ranges.
> Also call that method from MemoryMappedFile::Read and friends.
> Also perhaps write specialized ReadAsync implementations?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >