[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2019-02-06 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762024#comment-16762024
 ] 

Li Jin commented on ARROW-1425:
---

[~emkornfi...@gmail.com] Feel free to finish it up.

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2019-01-30 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756928#comment-16756928
 ] 

Micah Kornfield commented on ARROW-1425:


[~icexelloss] I pushed a new PR for this so if you don't mind, I will try to 
finish it up.

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413108#comment-16413108
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-375987823
 
 
   I see. Let's resolve 0.9.0 packaging issue first. If you have suggestion 
about what to remove in the doc, please let me know as well. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412778#comment-16412778
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-375920940
 
 
   Hey @wesm, I wonder if we should pick this up? (since 0.9 is out)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359476#comment-16359476
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm closed pull request #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/doc/source/index.rst b/python/doc/source/index.rst
index c2ae769b2..452309054 100644
--- a/python/doc/source/index.rst
+++ b/python/doc/source/index.rst
@@ -43,5 +43,6 @@ structures.
plasma
pandas
parquet
+   other_systems
api
getting_involved
diff --git a/python/doc/source/other_systems.rst 
b/python/doc/source/other_systems.rst
new file mode 100644
index 0..76d3afab4
--- /dev/null
+++ b/python/doc/source/other_systems.rst
@@ -0,0 +1,182 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _other_systems:
+
+Using Arrow with other systems
+==
+
+Timestamps
+--
+
+Timestamps are data structures that mark a particular point in time
+and we want to be able to order them, regardless of where
+they originated.
+For human consumption timestamps are usually specified by the date
+together with the time of day, often using the local time zone.
+The problem with this scheme is that if things need to be ordered
+by time across multiple time zones, using local time can be ambiguous.
+Therefore timestamps from multiple time zones should always be collected,
+stored and communicated in UTC to avoid this ambiguity.
+
+Most computer systems do not store timestamps as two part values
+with date part and time within that date, as most of us humans
+think about them. Instead the timestamp is stored as a single value
+offset from a given point in time in some time units, i.e. seconds,
+milliseconds, etc. An example of this is the Unix timestamp which is
+the number of seconds since midnight January 1st, 1970 in the UTC
+time zone. When the timestamp is then presented to an end user
+the scalar value is converted to the familiar date time format.
+
+Note the importance of the time zone in the conversion from scalar
+timestamp value to date and time. The Unix timestamp value 0 is
+translated to '1969-12-31 20:00:00' in the 'America/New_York' time
+zone, because it is defined in UTC, and New York was four hours
+behind UTC at that point in time. Systems that do use the
+local time zone of the server as reference for calculating the
+timestamp offset value can cause problems when those values need to
+be communicated to other systems. 
+
+Timestamps from systems described above are called `non-UTC-normalized`.
+Arrow, on the other hand, does always use UTC as the base for calculating
+timestamp offsets as further described below. Timestamps in Arrow are called
+`UTC-normalized`. Special care must always be taken when data from a
+system that that is `non-UTC-normalized` is read by Arrow.
+
+Timestamp types in Arrow are specified with a resolution and optional
+time zone. Several utility functions exist to convert data from 
+Pandas to Arrow, including functions that convert timestamp values 
+to milliseconds, because Pandas uses nanoseconds and other systems,
+i.e. Parquet use timestamps in milliseconds.
+
+In Arrow, timestamps have two forms, depending if the time zone is
+specified or not:
+
+*   **Time zone naive** (where ``tz=None`` in Python); there is no 
+notion of UTC or local time zone. Python will interpret a 
+timestamp like ``2017-09-12 12:00:00`` to be in the local time
+zone, not UTC. In Arrow on the other hand the timestamp shall
+be displayed as is to the user and not localized to their
+time zone. The value of the timestamp will be treated is if
+it was specified in UTC.
+
+*   **Time zone aware** where the integer values are internally 
+normalized 

[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359477#comment-16359477
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#issuecomment-364665730
 
 
   closing in favor of #1575 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357119#comment-16357119
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364159244
 
 
   We don't yet have a place (outside `format/`) for language-independent or 
cross-language documentation. This would be very helpful to get set up if we 
can agree as a community what tool to use to build this documentation


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356767#comment-16356767
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364069670
 
 
   It was puzzling to the author and me where to place the new piece of
   documentation – we looked for a top-level doc directory but there was none.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356752#comment-16356752
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

xhochy commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364062240
 
 
   @icexelloss @wesm Keep it in Python for now. In future, we should merge all 
documentations into a single sphinx setup. As long as we have not done this, 
Python is a good default place as it is already on sphinx as well as currently 
the most detailed documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356217#comment-16356217
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363943606
 
 
   Well, the scope of ARROW-1425 is to explain to Python users what they need 
to know to make correct joint use of pandas, Arrow, and Spark. I have push 
rights on this branch so I can edit directly, maybe tonight or sometime tomorrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356216#comment-16356216
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363943606
 
 
   Well, the scope of ARROW-1425 is to explain to Python users what they need 
to know to make correct use of pandas, Arrow, and Spark. I have push rights on 
this branch so I can edit directly, maybe tonight or sometime tomorrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356196#comment-16356196
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363939714
 
 
   @wesm This is not a Python specific document. Is there a better place for 
this other than under python?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356195#comment-16356195
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on a change in pull request #1575: ARROW-1425: [Python] 
Document Arrow timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#discussion_r166783544
 
 

 ##
 File path: python/doc/source/timestamps.rst
 ##
 @@ -0,0 +1,433 @@
+All About Timestamps (work in progress)
 
 Review comment:
   It is a big document. It's pretty long right now because there are quite bit 
of concepts to clarify, about 50% of the doc is about concepts and the other 
half is about Arrow <-> Spark.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356182#comment-16356182
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on a change in pull request #1575: ARROW-1425 [Doc] Document 
Arrow timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#discussion_r166779301
 
 

 ##
 File path: python/doc/source/timestamps.rst
 ##
 @@ -0,0 +1,433 @@
+All About Timestamps (work in progress)
 
 Review comment:
   This is a big document. I'd like to see if we can make this about 50% as 
long or less. I will review in more detail as soon as I can and make some 
comments to help


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356158#comment-16356158
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb commented on issue #1575: ARROW-1425 [Doc] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-363929187
 
 
   cc: @icexelloss 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356154#comment-16356154
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb opened a new pull request #1575: ARROW-1425 [Doc] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-01-18 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331258#comment-16331258
 ] 

Li Jin commented on ARROW-1425:
---

Here is my attempt to explain this issue (wip):

https://docs.google.com/document/d/1vfL8gLWKCgf7ZVLglnNffdvwjJjC4MqwnnoCuEJaRrU/edit#heading=h.132ni22bywvl

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329915#comment-16329915
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

cloud-fan commented on a change in pull request #1095: ARROW-1425 [Python] 
Document semantic differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#discussion_r162237780
 
 

 ##
 File path: python/doc/source/other_systems.rst
 ##
 @@ -0,0 +1,182 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _other_systems:
+
+Using Arrow with other systems
+==
+
+Timestamps
+--
+
+Timestamps are data structures that mark a particular point in time
 
 Review comment:
   It seems like parquet has a `FloatingTimestamp` type now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329511#comment-16329511
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

icexelloss commented on issue #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#issuecomment-358455821
 
 
   I can try to take a look at it this week. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-01-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329330#comment-16329330
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#issuecomment-358423404
 
 
   @BryanCutler @icexelloss we should pick up this patch and get this properly 
documented now that Arrow 0.8.0 is out


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242678#comment-16242678
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

cloud-fan commented on a change in pull request #1095: ARROW-1425 [Python] 
Document semantic differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#discussion_r149474373
 
 

 ##
 File path: python/doc/source/other_systems.rst
 ##
 @@ -0,0 +1,182 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _other_systems:
+
+Using Arrow with other systems
+==
+
+Timestamps
+--
+
+Timestamps are data structures that mark a particular point in time
 
 Review comment:
   I think it's not true. In SQL standard, timestamp(by default it's TIMESTAMP 
WITHOUT TIMEZONE) means a "floating" time, which is kind of the seconds from 
local epoch, e.g. use 0 to represent "1970-1-1 00:00:00" no matter which 
timezone you are. In Spark SQL and Parquet, the timestamp is seconds from Unix 
epoch, which is a particular point in time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219917#comment-16219917
 ] 

Wes McKinney commented on ARROW-1425:
-

It seems there is still too much in flux on Spark side. Moving this to the next 
milestone

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-10-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16219844#comment-16219844
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1095: ARROW-1425 [Python] Document semantic 
differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#issuecomment-339527887
 
 
   @icexelloss @heimir-sverrisson it may make sense to engage in 
https://github.com/apache/spark/pull/18664 and at least try to process the 
discussion that is going on around time zones. This is some very thorny stuff 
and I don't have the bandwidth right this moment to properly engage with this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)