[jira] [Resolved] (ARROW-10292) [Rust] [DataFusion] Simplify merge

2020-10-13 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10292.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 8453
[https://github.com/apache/arrow/pull/8453]

> [Rust] [DataFusion] Simplify merge
> --
>
> Key: ARROW-10292
> URL: https://issues.apache.org/jira/browse/ARROW-10292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10289) [Rust] Support reading dictionary streams

2020-10-13 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão reassigned ARROW-10289:


Assignee: Neville Dipale

> [Rust] Support reading dictionary streams
> -
>
> Key: ARROW-10289
> URL: https://issues.apache.org/jira/browse/ARROW-10289
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We support reading dictionaries in the IPC file reader.
> We should do the same with the stream reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10289) [Rust] Support reading dictionary streams

2020-10-13 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão resolved ARROW-10289.
--
Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8450
[https://github.com/apache/arrow/pull/8450]

> [Rust] Support reading dictionary streams
> -
>
> Key: ARROW-10289
> URL: https://issues.apache.org/jira/browse/ARROW-10289
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 2.0.0
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We support reading dictionaries in the IPC file reader.
> We should do the same with the stream reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213567#comment-17213567
 ] 

utsav commented on ARROW-10276:
---

An update. I upgraded to Spark 3.0.1 and received the same error

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10304) [C++][Compute] Optimize variance kernel for integers

2020-10-13 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-10304:


 Summary: [C++][Compute] Optimize variance kernel for integers
 Key: ARROW-10304
 URL: https://issues.apache.org/jira/browse/ARROW-10304
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


Current variance kernel converts all data type to `double` before calculation. 
It's sub-optimal for integers. Integer arithmetic is much faster than floating 
points, e.g., summation is 4x faster [1].

A quick test for calculating int32 variance shows up to 3x performance gain. 
Another benefit is that integer arithmetic is accurate.

[1] https://quick-bench.com/q/_Sz-Peq1MNWYwZYrTtQDx3GI7lQ



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

utsav updated ARROW-10276:
--
Comment: was deleted

(was: An update.

 

I tried running the code in a script

 

20/10/13 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
 20/10/13 23:29:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
 20/10/13 23:29:31 WARN SizeEstimator: Failed to check whether 
UseCompressedOops is set; assuming yes
 +---+--+
|_c0|_c1|

+---+--+
|1582999200|1|
|1582999260|1|
|1582999320|1|
|1582999380|1|
|1582999440|1|
|1582999500|1|
|1582999560|1|
|1582999620|1|
|1582999680|1|
|1582999740|1|
|1582999800|1|
|1582999860|1|
|158220|1|
|158280|1|
|158340|1|
|1583000100|1|
|1583000160|1|
|1583000220|1|
|1583000280|1|
|1583000340|1|

+---+--+
 only showing top 20 rows

/opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set 
to true; however, failed by the reason below:
 PyArrow >= 0.8.0 must be installed; however, it was not found.
 Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is 
set to true.
 warnings.warn(msg)

 

 

I then did:-

`pip3 show pyarrow `
 Name: pyarrow
 Version: 0.17.0
 Summary: Python library for Apache Arrow
 Home-page: [https://arrow.apache.org/]
 Author: Apache Arrow Developers
 Author-email: d...@arrow.apache.org
 License: Apache License, Version 2.0
 Location: /home/xilinx/.local/lib/python3.6/site-packages
 Requires: numpy
 Required-by:

 

It definitely exist in my PYTHONPATH as I added the following in bashrc and 
sourced it to activate

`export PYTHONPATH=/home/xilinx/.local/lib/python3.6/site-packages:$PYTHONPATH`

 

 

 

 

 )

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213484#comment-17213484
 ] 

utsav edited comment on ARROW-10276 at 10/13/20, 11:34 PM:
---

An update.

 

I tried running the code in a script

 

20/10/13 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
 20/10/13 23:29:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
 20/10/13 23:29:31 WARN SizeEstimator: Failed to check whether 
UseCompressedOops is set; assuming yes
 +---+--+
|_c0|_c1|

+---+--+
|1582999200|1|
|1582999260|1|
|1582999320|1|
|1582999380|1|
|1582999440|1|
|1582999500|1|
|1582999560|1|
|1582999620|1|
|1582999680|1|
|1582999740|1|
|1582999800|1|
|1582999860|1|
|158220|1|
|158280|1|
|158340|1|
|1583000100|1|
|1583000160|1|
|1583000220|1|
|1583000280|1|
|1583000340|1|

+---+--+
 only showing top 20 rows

/opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set 
to true; however, failed by the reason below:
 PyArrow >= 0.8.0 must be installed; however, it was not found.
 Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is 
set to true.
 warnings.warn(msg)

 

 

I then did:-

`pip3 show pyarrow `
 Name: pyarrow
 Version: 0.17.0
 Summary: Python library for Apache Arrow
 Home-page: [https://arrow.apache.org/]
 Author: Apache Arrow Developers
 Author-email: d...@arrow.apache.org
 License: Apache License, Version 2.0
 Location: /home/xilinx/.local/lib/python3.6/site-packages
 Requires: numpy
 Required-by:

 

It definitely exist in my PYTHONPATH as I added the following in bashrc and 
sourced it to activate

`export PYTHONPATH=/home/xilinx/.local/lib/python3.6/site-packages:$PYTHONPATH`

 

 

 

 

 


was (Author: utri092):
An update.

 

I tried running the code in a script

 

20/10/13 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
20/10/13 23:29:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
20/10/13 23:29:31 WARN SizeEstimator: Failed to check whether UseCompressedOops 
is set; assuming yes
+--+---+ 
| _c0|_c1|
+--+---+
|1582999200| 1|
|1582999260| 1|
|1582999320| 1|
|1582999380| 1|
|1582999440| 1|
|1582999500| 1|
|1582999560| 1|
|1582999620| 1|
|1582999680| 1|
|1582999740| 1|
|1582999800| 1|
|1582999860| 1|
|158220| 1|
|158280| 1|
|158340| 1|
|1583000100| 1|
|1583000160| 1|
|1583000220| 1|
|1583000280| 1|
|1583000340| 1|
+--+---+
only showing top 20 rows

/opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set 
to true; however, failed by the reason below:
 PyArrow >= 0.8.0 must be installed; however, it was not found.
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is 
set to true.
 warnings.warn(msg)

 

I then did

 

pip3 show pyarrow 
Name: pyarrow
Version: 0.17.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: Apache Arrow Developers
Author-email: d...@arrow.apache.org
License: Apache License, Version 2.0
Location: /home/xilinx/.local/lib/python3.6/site-packages
Requires: numpy
Required-by:

 

It definitely exist in my PYTHONPATH as I added the following in bashrc and 
sourced it to activate

export PYTHONPATH=/home/xilinx/.local/lib/python3.6/site-packages:$PYTHONPATH

 

 

 

 

 

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using 

[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213484#comment-17213484
 ] 

utsav commented on ARROW-10276:
---

An update.

 

I tried running the code in a script

 

20/10/13 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
20/10/13 23:29:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
20/10/13 23:29:31 WARN SizeEstimator: Failed to check whether UseCompressedOops 
is set; assuming yes
+--+---+ 
| _c0|_c1|
+--+---+
|1582999200| 1|
|1582999260| 1|
|1582999320| 1|
|1582999380| 1|
|1582999440| 1|
|1582999500| 1|
|1582999560| 1|
|1582999620| 1|
|1582999680| 1|
|1582999740| 1|
|1582999800| 1|
|1582999860| 1|
|158220| 1|
|158280| 1|
|158340| 1|
|1583000100| 1|
|1583000160| 1|
|1583000220| 1|
|1583000280| 1|
|1583000340| 1|
+--+---+
only showing top 20 rows

/opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set 
to true; however, failed by the reason below:
 PyArrow >= 0.8.0 must be installed; however, it was not found.
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is 
set to true.
 warnings.warn(msg)

 

I then did

 

pip3 show pyarrow 
Name: pyarrow
Version: 0.17.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: Apache Arrow Developers
Author-email: d...@arrow.apache.org
License: Apache License, Version 2.0
Location: /home/xilinx/.local/lib/python3.6/site-packages
Requires: numpy
Required-by:

 

It definitely exist in my PYTHONPATH as I added the following in bashrc and 
sourced it to activate

export PYTHONPATH=/home/xilinx/.local/lib/python3.6/site-packages:$PYTHONPATH

 

 

 

 

 

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213463#comment-17213463
 ] 

utsav commented on ARROW-10276:
---

[~uwe] according to ARROW-8420  I posted earlier in my issue. Support for armv7 
was added only in 0.17.0. I cannot use 0.8.0. I tried to build and it failed. I 
even set

{{export ARROW_PRE_0_15_IPC_FORMAT=1 in conf/spark-env.sh according to the link 
you sent me but no luck.}}

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213449#comment-17213449
 ] 

utsav edited comment on ARROW-10276 at 10/13/20, 10:07 PM:
---

[~uwe] will try and let you know. I guess the orc and flight flags are separate 
issue in themselves. At the moment it cannot build with them set to On


was (Author: utri092):
[~uwe] will try and let you know. I guess the orc and flight flags are separate 
issue in themselves. At the moment it cannot build with them

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread utsav (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213449#comment-17213449
 ] 

utsav commented on ARROW-10276:
---

[~uwe] will try and let you know. I guess the orc and flight flags are separate 
issue in themselves. At the moment it cannot build with them

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10303) Parallel type transformation in CSV reader

2020-10-13 Thread Sergej Fries (Jira)
Sergej Fries created ARROW-10303:


 Summary: Parallel type transformation in CSV reader
 Key: ARROW-10303
 URL: https://issues.apache.org/jira/browse/ARROW-10303
 Project: Apache Arrow
  Issue Type: Wish
  Components: Rust
Reporter: Sergej Fries
 Attachments: tracing.png

Currently, when the CSV file is read, a single thread is responsible for 
reading the file and for transformation of returned string values into correct 
data types.

In my case, reading a 2 GB CSV file with a dozen of float columns, takes ~40 
seconds. Out of this time, only ~10% of this is reading the file,  and ~68% is 
transformation of the string values into correct data types.

My proposal is to parallelize the part responsible for the data type 
transformation.

It seems to be quite simple to achieve since after the CSV reader reads a 
batch, all projected columns are transformed one by one using an iterator over 
vector and a map function afterwards. I believe that if one uses the rayon 
crate, the only change will be the adjustment of "iter()" into "par_iter()" and

changing

{color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read{color}> 
{color:#00}Reader{color}<{color:#20999d}R{color}>

into:

{color:#0033b3}impl{color}<{color:#20999d}R{color}: {color:#00}Read 
{color}+ 
{color:#00}std{color}::{color:#00}marker{color}::{color:#00}Sync{color}>
 {color:#00}Reader{color}<{color:#20999d}R{color}>

 

But maybe I oversee something crucial (as being quite new in Rust and Arrow). 
Any advise from someone experienced is therefore very welcome!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10253) [Python] Don't bundle plasma-store-server in pyarrow conda package

2020-10-13 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-10253.
--
Resolution: Duplicate

> [Python] Don't bundle plasma-store-server in pyarrow conda package
> --
>
> Key: ARROW-10253
> URL: https://issues.apache.org/jira/browse/ARROW-10253
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>
> We currently have it in the {{arrow-cpp}} and the {{pyarrow}} conda package, 
> we should only have it in {{arrow-cpp}} as this is always there and also the 
> source of the binary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10302) [Python] Don't double-package plasma-store-server

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10302:
---
Labels: pull-request-available  (was: )

> [Python] Don't double-package plasma-store-server
> -
>
> Key: ARROW-10302
> URL: https://issues.apache.org/jira/browse/ARROW-10302
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Uwe Korn
>Assignee: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is part of the {{arrow-cpp}} and {{pyarrow}} conda packages. We 
> shouldn't ship the version in {{pyarrow}} as this is just a copy to a 
> different location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10302) [Python] Don't double-package plasma-store-server

2020-10-13 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-10302:


 Summary: [Python] Don't double-package plasma-store-server
 Key: ARROW-10302
 URL: https://issues.apache.org/jira/browse/ARROW-10302
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, Python
Reporter: Uwe Korn
Assignee: Uwe Korn
 Fix For: 3.0.0


This is part of the {{arrow-cpp}} and {{pyarrow}} conda packages. We shouldn't 
ship the version in {{pyarrow}} as this is just a copy to a different location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10301) Add "all" boolean reducing kernel

2020-10-13 Thread Andrew Wieteska (Jira)
Andrew Wieteska created ARROW-10301:
---

 Summary: Add "all" boolean reducing kernel
 Key: ARROW-10301
 URL: https://issues.apache.org/jira/browse/ARROW-10301
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Andrew Wieteska
Assignee: Andrew Wieteska
 Fix For: 3.0.0


As discussed on GitHub: 
[https://github.com/apache/arrow/pull/8294#discussion_r504034461]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9164) [C++] Provide APIs for adding "docstrings" to arrow::compute::Function classes that can be accessed by bindings

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9164:
--
Labels: pull-request-available  (was: )

> [C++] Provide APIs for adding "docstrings" to arrow::compute::Function 
> classes that can be accessed by bindings
> ---
>
> Key: ARROW-9164
> URL: https://issues.apache.org/jira/browse/ARROW-9164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10300) [Rust] Parquet/CSV TPC-H data

2020-10-13 Thread Remi Dettai (Jira)
Remi Dettai created ARROW-10300:
---

 Summary: [Rust] Parquet/CSV TPC-H data
 Key: ARROW-10300
 URL: https://issues.apache.org/jira/browse/ARROW-10300
 Project: Apache Arrow
  Issue Type: Wish
  Components: Rust
Reporter: Remi Dettai


The TPC-H benchmark for datafusion works with Parquet/CSV data but the data 
generation routine described in the README generates `.tbl` data.

Could we describe how the TPC-H Parquet/CSV data can be generated to make the 
benchmark easier to setup and more reproducible ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10299) [Rust] Support reading and writing V5 of IPC metadata

2020-10-13 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-10299:
--

 Summary: [Rust] Support reading and writing V5 of IPC metadata
 Key: ARROW-10299
 URL: https://issues.apache.org/jira/browse/ARROW-10299
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Affects Versions: 2.0.0
Reporter: Neville Dipale


This is mostly alignment issues and tracking when we encounter the v4 legacy 
padding.

I had done this work in another branch, but discarded it without noticing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10295) [Rust] [DataFusion] Simplify accumulators

2020-10-13 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-10295:

Summary: [Rust] [DataFusion] Simplify accumulators  (was: [Rist] 
[DataFusion] Simplify accumulators)

> [Rust] [DataFusion] Simplify accumulators
> -
>
> Key: ARROW-10295
> URL: https://issues.apache.org/jira/browse/ARROW-10295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Replace Rc> by Box<>.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-10296.
-
Fix Version/s: 2.0.0
 Assignee: Neal Richardson
   Resolution: Duplicate

This is a deliberate 
[feature|https://arrow.apache.org/docs/r/news/index.html#arrow-format-conversion],
 but in the upcoming release you'll be able to [disable 
it|https://github.com/apache/arrow/blob/master/r/NEWS.md#bug-fixes-and-other-enhancements].
 

> [R] Data saved as integer64 loaded as integer
> -
>
> Key: ARROW-10296
> URL: https://issues.apache.org/jira/browse/ARROW-10296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5
> full sessionIfno():
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C   LC_TIME=English_United States.1252 
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
> crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
>  [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
> rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
> [17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
> purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
> [25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
>Reporter: Ofek Shilon
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 2.0.0
>
>
> {{> v <- bit64::as.integer64(1:10)}}
> {{> df <- data.frame(v=v)}}
> {{> class(df$v)}}
> {{[1] "*integer64*"}}
> {{> arrow::write_feather(df, "./tmp")}}
> {{> df2 <- arrow::read_feather("./tmp")}}
> {{> class(df2$v)}}
> {{[1] "*integer*"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10295) [Rist] [DataFusion] Simplify accumulators

2020-10-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10295.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8456
[https://github.com/apache/arrow/pull/8456]

> [Rist] [DataFusion] Simplify accumulators
> -
>
> Key: ARROW-10295
> URL: https://issues.apache.org/jira/browse/ARROW-10295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Replace Rc> by Box<>.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10293) [Rust] [DataFusion] Fix benchmarks

2020-10-13 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10293.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8452
[https://github.com/apache/arrow/pull/8452]

> [Rust] [DataFusion] Fix benchmarks
> --
>
> Key: ARROW-10293
> URL: https://issues.apache.org/jira/browse/ARROW-10293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> They are only benchmarking planning, not execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10145) [C++][Dataset] Integer-like partition field values outside int32 range error on reading

2020-10-13 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-10145:


Assignee: Ben Kietzman

> [C++][Dataset] Integer-like partition field values outside int32 range error 
> on reading
> ---
>
> Key: ARROW-10145
> URL: https://issues.apache.org/jira/browse/ARROW-10145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
> Fix For: 2.0.1
>
>
> From 
> https://stackoverflow.com/questions/64137664/how-to-override-type-inference-for-partition-columns-in-hive-partitioned-dataset
> Small reproducer:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'part': [3760212050]*10, 'col': range(10)})
> pq.write_to_dataset(table, "test_int64_partition", partition_cols=['part'])
> In [35]: pq.read_table("test_int64_partition/")
> ...
> ArrowInvalid: error parsing '3760212050' as scalar of type int32
> In ../src/arrow/scalar.cc, line 333, code: VisitTypeInline(*type_, this)
> In ../src/arrow/dataset/partition.cc, line 218, code: 
> (_error_or_value26).status()
> In ../src/arrow/dataset/partition.cc, line 229, code: 
> (_error_or_value27).status()
> In ../src/arrow/dataset/discovery.cc, line 256, code: 
> (_error_or_value17).status()
> In [36]: pq.read_table("test_int64_partition/", use_legacy_dataset=True)
> Out[36]: 
> pyarrow.Table
> col: int64
> part: dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10298) [Rust] Incorrect offset handling in iterator over dictionary keys

2020-10-13 Thread Jira
Jörn Horstmann created ARROW-10298:
--

 Summary: [Rust] Incorrect offset handling in iterator over 
dictionary keys
 Key: ARROW-10298
 URL: https://issues.apache.org/jira/browse/ARROW-10298
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Jörn Horstmann


The NullableIterator used by DictionaryArray.keys calls ArrayData.is_null 
without taking the offset of that ArrayData into account. It would probably be 
better if ArrayData itself handled the offset in that method.

The iterator implementation could now also be replaced with the PrimitiveIter 
that was recently added



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10276) Armv7 orc and flight not supported for build. Compat error on using with spark

2020-10-13 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213024#comment-17213024
 ] 

Uwe Korn commented on ARROW-10276:
--

According to the Spark documentation, you need {{pyarrow==0.8.0}}: 
http://spark.apache.org/docs/2.4.5/sql-pyspark-pandas-with-arrow.html#ensure-pyarrow-installed
 So this seems rather a mismatch in installed {{pyarrow}} versions then 
actually related to Armv7. 

> Armv7 orc and flight not supported for build. Compat error on using with spark
> --
>
> Key: ARROW-10276
> URL: https://issues.apache.org/jira/browse/ARROW-10276
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.17.0
>Reporter: utsav
>Priority: Major
> Attachments: arrow_compat_error, build_pip_wheel.sh, 
> dpu_stream_spark.ipynb, get_arrow_and_create_venv.sh, run_build.sh
>
>
> I'm using a Arm Cortex A9 processor on the Xilinx Pynq Z2 board. People have 
> tried to use it for the raspberry pi 3 without luck in previous posts.
> I figured out how to successfully build it for armv7 using the script below 
> but cannot use orc and flight flags. People had looked into it in ARROW-8420 
> but I don't know if they faced these issues.
> I tried converting a spark dataframe to pandas using pyarrow but now it 
> complains about a compat feature. I have attached images below
> Any help would be appreciated. Thanks
> Spark Version: 2.4.5.
>  The code is as follows:
> ```
> import pandas as pd
> df_pd = df.toPandas()
> npArr = df_pd.to_numpy()
> ```
> The error is as follows:-
> ```
> /opt/spark/python/pyspark/sql/dataframe.py:2110: UserWarning: toPandas 
> attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is 
> set to true; however, failed by the reason below:
>  module 'pyarrow' has no attribute 'compat'
>  Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' 
> is set to true.
>  warnings.warn(msg)
> ``` 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10297) [Rust] Parameter for parquet-read to output data in json format

2020-10-13 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann reassigned ARROW-10297:
--

Assignee: Jörn Horstmann

> [Rust] Parameter for parquet-read to output data in json format
> ---
>
> Key: ARROW-10297
> URL: https://issues.apache.org/jira/browse/ARROW-10297
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Jörn Horstmann
>Assignee: Jörn Horstmann
>Priority: Minor
>
> When analyzing data related issues I found it really helpful to filter or 
> portprocess the contents of parquet files on the command line using jq 
> (https://stedolan.github.io/jq/manual/).
> Currently the output of parquet-read is in a custom json-like format, I 
> propose to add an optional flag that outputs the contents as json using the 
> serde_json library. This should probably be behind a feature gate to avoid 
> adding the dependency for everyone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10263) [C++][Compute] Improve numerical stability of variances merging

2020-10-13 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-10263.

Fix Version/s: 2.0.0
   Resolution: Fixed

Issue resolved by pull request 8437
[https://github.com/apache/arrow/pull/8437]

> [C++][Compute] Improve numerical stability of variances merging
> ---
>
> Key: ARROW-10263
> URL: https://issues.apache.org/jira/browse/ARROW-10263
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> For chunked array, variance kernel needs to merge variances.
> Tested with two single value chunk, [400800490], [400800400]. 
> The merged variance is 3872. If treated as single array with two values, the 
> variance is 3904, same as numpy outputs.
> So current merging method is not stable in extreme cases when chunks are very 
> short and with approximate mean values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10297) [Rust] Parameter for parquet-read to output data in json format

2020-10-13 Thread Jira
Jörn Horstmann created ARROW-10297:
--

 Summary: [Rust] Parameter for parquet-read to output data in json 
format
 Key: ARROW-10297
 URL: https://issues.apache.org/jira/browse/ARROW-10297
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Jörn Horstmann


When analyzing data related issues I found it really helpful to filter or 
portprocess the contents of parquet files on the command line using jq 
(https://stedolan.github.io/jq/manual/).

Currently the output of parquet-read is in a custom json-like format, I propose 
to add an optional flag that outputs the contents as json using the serde_json 
library. This should probably be behind a feature gate to avoid adding the 
dependency for everyone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-10296:

Description: 
{{> v <- bit64::as.integer64(1:10)}}
{{> df <- data.frame(v=v)}}
{{> class(df$v)}}
{{[1] "*integer64*"}}
{{> arrow::write_feather(df, "./tmp")}}
{{> df2 <- arrow::read_feather("./tmp")}}
{{> class(df2$v)}}
{{[1] "*integer*"}}

  was:
{{> v <- bit64::as.integer64(1:10)}}
{{> v <- as.integer64(1:10)}}
{{> df <- data.frame(v=v)}}
{{> class(df$v)}}
{{[1] "*integer64*"}}
{{> arrow::write_feather(df, "./tmp")}}
{{> df2 <- arrow::read_feather("./tmp")}}
{{> class(df2$v)}}
{{[1] "*integer*"}}


> [R] Data saved as integer64 loaded as integer
> -
>
> Key: ARROW-10296
> URL: https://issues.apache.org/jira/browse/ARROW-10296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5
> full sessionIfno():
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C   LC_TIME=English_United States.1252 
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
> crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
>  [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
> rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
> [17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
> purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
> [25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
>Reporter: Ofek Shilon
>Priority: Major
>
> {{> v <- bit64::as.integer64(1:10)}}
> {{> df <- data.frame(v=v)}}
> {{> class(df$v)}}
> {{[1] "*integer64*"}}
> {{> arrow::write_feather(df, "./tmp")}}
> {{> df2 <- arrow::read_feather("./tmp")}}
> {{> class(df2$v)}}
> {{[1] "*integer*"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-10296:

Description: 
{{> v <- bit64::as.integer64(1:10)}}
{{> v <- as.integer64(1:10)}}
{{> df <- data.frame(v=v)}}
{{> class(df$v)}}
{{[1] "*integer64*"}}
{{> arrow::write_feather(df, "./tmp")}}
{{> df2 <- arrow::read_feather("./tmp")}}
{{> class(df2$v)}}
{{[1] "*integer*"}}

  was:
{{> v <- bit64::as.integer64(1:10)}}
{{ > v <- as.integer64(1:10)}}
{{ > df <- data.frame(v=v)}}
{{ > class(df$v)}}
{{ [1] "*integer64*"}}
{{ > arrow::write_feather(df, "./tmp")}}
{{ > df2 <- arrow::read_feather("./tmp")}}
{{ > class(df2$v)}}
{{ [1] "*integer*"}}


> [R] Data saved as integer64 loaded as integer
> -
>
> Key: ARROW-10296
> URL: https://issues.apache.org/jira/browse/ARROW-10296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5
> full sessionIfno():
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C   LC_TIME=English_United States.1252 
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
> crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
>  [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
> rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
> [17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
> purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
> [25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
>Reporter: Ofek Shilon
>Priority: Major
>
> {{> v <- bit64::as.integer64(1:10)}}
> {{> v <- as.integer64(1:10)}}
> {{> df <- data.frame(v=v)}}
> {{> class(df$v)}}
> {{[1] "*integer64*"}}
> {{> arrow::write_feather(df, "./tmp")}}
> {{> df2 <- arrow::read_feather("./tmp")}}
> {{> class(df2$v)}}
> {{[1] "*integer*"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-10296:

Description: 
{{> v <- bit64::as.integer64(1:10)}}
{{ > v <- as.integer64(1:10)}}
{{ > df <- data.frame(v=v)}}
{{ > class(df$v)}}
{{ [1] "*integer64*"}}
{{ > arrow::write_feather(df, "./tmp")}}
{{ > df2 <- arrow::read_feather("./tmp")}}
{{ > class(df2$v)}}
{{ [1] "*integer*"}}

  was:
> v <- bit64::as.integer64(1:10)
{{ {{> v <- as.integer64(1:10)
{{ {{> df <- data.frame(v=v)
{{> class(df$v)}}
{{[1] "*integer64*"}}
{{ > arrow::write_feather(df, "./tmp")}}
{{ {{> df2 <- arrow::read_feather("./tmp")
{{ {{> class(df2$v)
{{ {{[1] "*integer*"


> [R] Data saved as integer64 loaded as integer
> -
>
> Key: ARROW-10296
> URL: https://issues.apache.org/jira/browse/ARROW-10296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5
> full sessionIfno():
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C   LC_TIME=English_United States.1252 
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
> crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
>  [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
> rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
> [17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
> purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
> [25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
>Reporter: Ofek Shilon
>Priority: Major
>
> {{> v <- bit64::as.integer64(1:10)}}
> {{ > v <- as.integer64(1:10)}}
> {{ > df <- data.frame(v=v)}}
> {{ > class(df$v)}}
> {{ [1] "*integer64*"}}
> {{ > arrow::write_feather(df, "./tmp")}}
> {{ > df2 <- arrow::read_feather("./tmp")}}
> {{ > class(df2$v)}}
> {{ [1] "*integer*"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Ofek Shilon (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ofek Shilon updated ARROW-10296:

Description: 
> v <- bit64::as.integer64(1:10)
{{ {{> v <- as.integer64(1:10)
{{ {{> df <- data.frame(v=v)
{{> class(df$v)}}
{{[1] "*integer64*"}}
{{ > arrow::write_feather(df, "./tmp")}}
{{ {{> df2 <- arrow::read_feather("./tmp")
{{ {{> class(df2$v)
{{ {{[1] "*integer*"

  was:
{{> v <- bit64::as.integer64(1:10)}}
{{> v <- as.integer64(1:10)}}
{{> df <- data.frame(v=v)}}
{{> class(df$v)
}}{{[1] "*integer64*"
> arrow::write_feather(df, "./tmp")}}
{{> df2 <- arrow::read_feather("./tmp")}}
{{> class(df2$v)}}
{{[1] "*integer*"}}


> [R] Data saved as integer64 loaded as integer
> -
>
> Key: ARROW-10296
> URL: https://issues.apache.org/jira/browse/ARROW-10296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.0.1
> Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5
> full sessionIfno():
> R version 3.6.1 (2019-07-05)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 19041)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
> States.1252LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C   LC_TIME=English_United States.1252 
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base 
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
> crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
>  [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
> rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
> [17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
> purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
> [25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
>Reporter: Ofek Shilon
>Priority: Major
>
> > v <- bit64::as.integer64(1:10)
> {{ {{> v <- as.integer64(1:10)
> {{ {{> df <- data.frame(v=v)
> {{> class(df$v)}}
> {{[1] "*integer64*"}}
> {{ > arrow::write_feather(df, "./tmp")}}
> {{ {{> df2 <- arrow::read_feather("./tmp")
> {{ {{> class(df2$v)
> {{ {{[1] "*integer*"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10296) [R] Data saved as integer64 loaded as integer

2020-10-13 Thread Ofek Shilon (Jira)
Ofek Shilon created ARROW-10296:
---

 Summary: [R] Data saved as integer64 loaded as integer
 Key: ARROW-10296
 URL: https://issues.apache.org/jira/browse/ARROW-10296
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 1.0.1
 Environment: R3.6.1, arrow 1.0.1, bit64 4.0.5

full sessionIfno():

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
  LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C   LC_TIME=English_United States.1252   
 

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5   fansi_0.4.1  arrow_1.0.1  dplyr_1.0.2  
crayon_1.3.4 assertthat_0.2.1 R6_2.4.1 lifecycle_0.2.0 
 [9] magrittr_1.5 pillar_1.4.6 cli_2.0.2rlang_0.4.7  
rstudioapi_0.11  generics_0.0.2   vctrs_0.3.4  ellipsis_0.3.1  
[17] tools_3.6.1  bit64_4.0.5  feather_0.3.5glue_1.4.2   
purrr_0.3.4  bit_4.0.4hms_0.5.3compiler_3.6.1  
[25] pkgconfig_2.0.3  tidyselect_1.1.0 tibble_3.0.3
Reporter: Ofek Shilon


{{> v <- bit64::as.integer64(1:10)}}
{{> v <- as.integer64(1:10)}}
{{> df <- data.frame(v=v)}}
{{> class(df$v)
}}{{[1] "*integer64*"
> arrow::write_feather(df, "./tmp")}}
{{> df2 <- arrow::read_feather("./tmp")}}
{{> class(df2$v)}}
{{[1] "*integer*"}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10295) [Rist] [DataFusion] Simplify accumulators

2020-10-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10295:
---
Labels: pull-request-available  (was: )

> [Rist] [DataFusion] Simplify accumulators
> -
>
> Key: ARROW-10295
> URL: https://issues.apache.org/jira/browse/ARROW-10295
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Replace Rc> by Box<>.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)