Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-20 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1955590870

   Thank you so much all for review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-20 Thread via GitHub


bjornjorgensen commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1955073306

   Great work @itholic Thank you :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1954505946

   Merged to master.
   
   Thank you again, @itholic and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-20 Thread via GitHub


dongjoon-hyun closed pull request #44881: [SPARK-46858][PYTHON][PS][BUILD] 
Upgrade Pandas to 2.2.0
URL: https://github.com/apache/spark/pull/44881


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495345854


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   Just updated to resample work in old Pandas as well. Now it's safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953618046

   Just updated to resample work in old Pandas as well.
   
   I think we can just make it as deprecate for now to avoid breaking the 
existing pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953603875

   Oh, wait.
   
   I just remembered that we just follow the Pandas behavior and separately 
mention the breaking changes into [release 
note](https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_upgrade.rst).
   
   So maybe we should add a release note instead of reverting the breaking 
changes here? @dongjoon-hyun @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953588339

   We should not bring any breaking change. Let me address them.
   
   Thanks, @dongjoon-hyun for double checking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495320737


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   Oh, sorry that was my mistake. This should work even in old Pandas before 
Spark 4.0.0 release.
   
   Let me fix them to work both Pandas 2.2.0 and old Pandas.



##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   ~~However, the current rule is that it should not be accompanied by such a 
breaking change unless the major version changes.~~
   
   ~~This means that users should be able to use their pipeline as is, as long 
as they are using at least version 3.x of Spark.~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495316022


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   However, the current rule is that it should not be accompanied by such a 
breaking change unless the major version changes.
   
   This means that users should be able to use their pipeline as is, as long as 
they are using at least version 3.x of Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495314548


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   > Even for the users who choose old Pandas libraries, Apache Spark enforces 
this breaking change
   
   Yes. In current policy, if we want to use latest Apache Spark then we cannot 
avoid having to follow the behavior of latest Pandas as well IIRC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953573086

   - Is the change of python/pyspark/pandas/resample.py safe?
   
   It breaks the previous behavior, so if we plan to release other minor 
release (Spark 3.5.0) this should not be included.
   
   - What happens when the users decide to use old Pandas (<= 2.2.0)?
   
   Using deprecated aliases (`Y`, `M`, `H`, `T`, `S`) wouldn't work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495305345


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   The background of my question is that `Data Science` team has been 
struggling when they validate their pipelines on new Spark versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495305345


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   The background of my question is that `Data Science` team has been 
struggling when they validates their pipelines on new Spark versions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495304624


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   Ya, that comes to my second question. 
(https://github.com/apache/spark/pull/44881#pullrequestreview-1889581226).
   
   Even for the users who choose old Pandas libraries, Apache Spark enforces 
this breaking change, @itholic ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495302902


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   Yeah, Pandas 2.2.0 brings couple of breaking changes so we should make sure 
we ship this support after Spark 4.0.0.
   
   See [related update from Pandas release 
note](https://pandas.pydata.org/docs/whatsnew/v2.2.0.html#deprecate-aliases-m-q-y-etc-in-favour-of-me-qe-ye-etc-for-offsets)
 for more detail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


dongjoon-hyun commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495297012


##
python/pyspark/pandas/series.py:
##
@@ -7092,15 +7092,15 @@ def resample(
 --
 rule : str
 The offset string or object representing target conversion.
-Currently, supported units are {'Y', 'A', 'M', 'D', 'H',
-'T', 'MIN', 'S'}.
+Currently, supported units are {'YE', 'A', 'ME', 'D', 'h',
+'min', 'MIN', 's'}.

Review Comment:
   Is this a breaking change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1953550280

   I believe now this PR completed to address all of Pandas 2.2.0 behavior. cc 
@HyukjinKwon @dongjoon-hyun FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-19 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1495189446


##
python/pyspark/pandas/namespace.py:
##
@@ -2554,7 +2554,10 @@ def resolve_func(psdf, this_column_labels, 
that_column_labels):
 if isinstance(obj, Series):
 num_series += 1
 series_names.add(obj.name)
-new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))
+if not ignore_index and not should_return_series:
+new_objs.append(obj.to_frame())
+else:
+new_objs.append(obj.to_frame(DEFAULT_SERIES_NAME))

Review Comment:
   Related to https://github.com/pandas-dev/pandas/issues/15047



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-16 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1492108932


##
python/pyspark/pandas/plot/matplotlib.py:
##
@@ -363,10 +364,23 @@ def _args_adjust(self):
 if is_list_like(self.bottom):
 self.bottom = np.array(self.bottom)
 
+def _ensure_frame(self, data):
+return data
+
+def _calculate_bins(self, data, bins):
+return bins

Review Comment:
   Pandas recently pushed couple of commits for refactoring the internal 
plotting structure such as https://github.com/pandas-dev/pandas/pull/55850 or 
https://github.com/pandas-dev/pandas/pull/55872, so we also should inherits 
couple of internal methods to follow the latest Pandas behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-02-13 Thread via GitHub


itholic commented on PR #44881:
URL: https://github.com/apache/spark/pull/44881#issuecomment-1942942082

   Yeah, Pandas fixes many bugs from Pandas 2.2.0 that brings couple of 
behavior changes  
   
   Let me fix them. Thanks for the confirm!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1467133120


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   Got it. btw Pandas 2.2.0 again introduces some breaking changes  Let me 
address it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


HyukjinKwon commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1467123646


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   Let's pin this to 2.2.0 for now. I think we have seen some issues when 
automaticaly using latest pandas version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466155995


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   AFAIK, pip automatically finds the most recent version that meets the 
conditions as below (Has this not worked well so far btw??):
   
   ```
   (pyspark-dev-env) spark % pip install "pandas<=2.2.0"
   Collecting pandas<=2.2.0
   ...
   Installing collected packages: pandas
   Successfully installed pandas-2.2.0
   ```
   
   But I'm okay with the way `==`. WDYT, @HyukjinKwon ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


zhengruifeng commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466096686


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   or using `==`? just because `<=` cannot confirm 2.2.0 is used



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


itholic commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466077514


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   I think maybe the CI would be broken in the future when the higher version 
will be released?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46858][PYTHON][PS][BUILD] Upgrade Pandas to 2.2.0 [spark]

2024-01-25 Thread via GitHub


zhengruifeng commented on code in PR #44881:
URL: https://github.com/apache/spark/pull/44881#discussion_r1466068442


##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml

Review Comment:
   shall we use `>=` to confirm 2.2.0+ is used?



##
dev/infra/Dockerfile:
##
@@ -91,10 +91,10 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
 ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3
 RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3
-RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.1.4' scipy coverage 
matplotlib lxml
+RUN pypy3 -m pip install numpy 'six==1.16.0' 'pandas<=2.2.0' scipy coverage 
matplotlib lxml
 
 
-ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.1.4 scipy 
plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"
+ARG BASIC_PIP_PKGS="numpy pyarrow>=14.0.0 six==1.16.0 pandas<=2.2.0 scipy 
plotly>=4.8 mlflow>=2.8.1 coverage matplotlib openpyxl memory-profiler>=0.61.0 
scikit-learn>=1.3.2"

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org