Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel ( https://github.com/conda-forge/r-sparkr-feedstock). This is fully run by the community unofficially. On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh wrote: > +1 for me > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engi

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428? cc @Yang,Jie(INF) On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim wrote: > Shall we revisit this functionality? The API doc is built with individual > versions, and for each individual version we depend on other released > versions. This

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress https://github.com/apache/spark/pull/44416 we will likely release this in spark 4.0 On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович wrote: > Yes, it's actual data. > > > > Best regards, > > Stanislav Porotikov > > > > *From:*

Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark Sessions. To multiplex or run multiple Drivers, you need some work such as gateway. On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong wrote: > Hi, > > My understanding is there is only one driver/spark context for all user > sessio

Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome! On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun wrote: > Hi, All. > > As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community > starts to have test coverage for all supported Python versions from Today. > > - https://github.com/apache/spark/actions/runs/7061665420 > > Her

Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing. On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri wrote: > This is wonderful news! > > On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote: > >> Dear Apache Spark community, >> >> We are delighted to announce the launch of a groundbreaking tool that >> aims to make Apache

Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks! On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan wrote: > > Thanks Dongjoon ! > > Regards, > Mridul > > On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote: > >> We are happy to announce the availability of Apache Spark 3.4.1! >> >> Spark 3.4.1 is a maintenance release containing st

Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can easily discuss with users and developers. On Tue, 28 Mar 2023 at 03:08, keen wrote: > Hi all, > I really like *Slack *as communication channel for a tech community. > There is a Slack workspace for *delta lake users* ( > http

Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming. On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh wrote: > Thank you for driving the release of Apache Spark 3.3.1, Yuming! > > On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun > wrote: > > > > It's great. Thank you so much, Yuming! > > > > Dongjoon > > > > On Tue, Oct 25, 2022 at 11:23 P

Re: [Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-16 Thread Hyukjin Kwon
You can workaround it by leveraging expr, e.g., expr("unix_micros(col)") for now. Should better have Scala binding first before we have Python one FWIW, On Sat, 15 Oct 2022 at 06:19, Martin wrote: > Hi everyone, > > In *Spark SQL* there are several timestamp related functions > >- unix_micro

Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo On Tue, 14 Jun 2022 at 15:04, Xiao Li wrote: > Hi, all, > > The ASF has an official store at RedBubble > that Apache Community > Development (ComDev) runs. If you are interested in buying Spark Swag, 70 > products featuring the Spark logo are

Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see if

Re: [R] SparkR on conda-forge

2021-12-19 Thread Hyukjin Kwon
Awesome! On Mon, 20 Dec 2021 at 09:43, yonghua wrote: > Nice release. thanks for sharing. > > On 2021/12/20 3:55, Maciej wrote: > > FYI ‒ thanks to good folks from conda-forge we have now these: > > - > To unsubscribe e-mail: us

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
here. We could have a configuration to enable and disable but the implementation of this in DataFrame.toPandas would be complicated due to existing optimization such as Arrow. Haven't taken a deeper look though but my guts say it's not worthwhile. On Sat, Nov 13, 2021 at 12:05 PM Hyuk

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean. Yes, there's an optimization on DataFrame.collect which tries to collect few first partitioms and see if the number of rows are found (and repeat). DataFrame.toPandas does not have such optimization. I suspect that the shuffle isn't actual shuffle but just collects lo

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome! 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성: > We are happy to announce the availability of Spark 3.1.2! > > Spark 3.1.2 is a maintenance release containing stability fixes. This > release is based on the branch-3.1 maintenance branch of Spark. We strongly > recommend all 3.1 users to u

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon
es / >>>>>> Greenplum >>>>>> with Spark SQL and DataFrames, 10~100x faster.* >>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A >>>>>> library that brings excellent and useful functions fro

[ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Hyukjin Kwon
We are excited to announce Spark 3.1.1 today. Apache Spark 3.1.1 is the second release of the 3.x line. This release adds Python type annotations and Python dependency management support as part of Project Zen. Other major updates include improved ANSI SQL compliance support, history server suppor

Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free to file a JIRA for that. On Wed, 7 Oct 2020, 22:44 Jacek Pliszka, wrote: > Hi! > > Is there any place I can find information how to use gapply with arrow? > > I've tried something very simple > > collect(gapply( > df,

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped R 3.5 and below at branch 2.4 as well. On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, wrote: > Hi, All. > > As of today, master branch (Apache Spark 3.1.0) resolved > 852+ JIRA issues and 606+ issues are 3.1.0-only patches. >

PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all, I am trying to write up the main page of PySpark documentation at https://github.com/apache/spark/pull/29320. While I think the current proposal might be good enough, I would like to collect more feedback about the contents, structure and image since this is the entrance page of PySpark d

Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too. 2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성: > I am sending another email to make sure dev people know. Python 2, 3.4 and > 3.5 are now dropped at https://github.com/apache/spark/pull/28957. > > >

Re: Error: Vignette re-building failed. Execution halted

2020-06-24 Thread Hyukjin Kwon
Looks like you haven't installed the 'e1071' package. 2020년 6월 24일 (수) 오후 6:49, Anwar AliKhan 님이 작성: > ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr > -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes > > > > minor error Spark r test f

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Hyukjin Kwon
Yay! 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > Great job everyone ! Congratulations :-) > > Regards, > Mridul > > On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin wrote: > >> Hi all, >> >> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on >> many of the innovations f

Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Hyukjin Kwon
Yay! 2020년 6월 11일 (목) 오전 10:38, Holden Karau 님이 작성: > We are happy to announce the availability of Spark 2.4.6! > > Spark 2.4.6 is a maintenance release containing stability, correctness, > and security fixes. > This release is based on the branch-2.4 maintenance branch of Spark. We > strongly re

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Hyukjin Kwon
Thanks Dongjoon! 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > Happy to hear the release news! > > Bests, > Takeshi > > On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun > wrote: > >> There was a typo in one URL. The correct release note URL is here. >> >> https://spark.apache.org/releases/spa

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Hyukjin Kwon
I was randomly googling out of curiosity, and seems indeed that's the problem ( https://r.789695.n4.nabble.com/Error-in-rbind-info-getNamespaceInfo-env-quot-S3methods-quot-td4755490.html ). Yes, seems we should make sure we build SparkR in an old version. Since that support for R prior to version 3

Re: [VOTE] Shall we release ORC 1.4.5rc1?

2019-12-06 Thread Hyukjin Kwon
+1 (as a Spark user) 2019년 12월 7일 (토) 오전 11:06, Dongjoon Hyun 님이 작성: > +1 for Apache ORC 1.4.5 release. > > Thank you for making the release. > > I'd like to mention some notable changes here. > Apache ORC 1.4.5 is not a drop-in replacement for 1.4.4 because of the > following. > > ORC-498:

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon
+1 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > Sounds reasonable to me. We should make the behavior consistent within > Spark. > > On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > >> Currently, when a PySpark Row is created with keyword arguments, the >> fields are sorted alphabetically.

Re: DataSourceV2: pushFilters() is not invoked for each read call - spark 2.3.2

2019-09-06 Thread Hyukjin Kwon
I believe this issue was fixed in Spark 2.4. Spark DataSource V2 has been still being radically developed - It is not complete yet until now. So, I think the feasible option to get through at the current moment is: 1. upgrade to higher Spark versions 2. disable filter push down at your DataSou

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Hyukjin Kwon
YaY! 2019년 9월 2일 (월) 오후 1:27, Wenchen Fan 님이 작성: > Great! Thanks! > > On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun > wrote: > >> We are happy to announce the availability of Spark 2.4.4! >> >> Spark 2.4.4 is a maintenance release containing stability fixes. This >> release is based on the branch

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon
Adding Shixiong WDYT? 2019년 8월 14일 (수) 오후 2:30, Terry Kim 님이 작성: > Can the following be included? > > [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in > EpochTracker (to support Python UDFs) > > > Thanks, > Terry > > On Tue, A

Re: Continuous processing mode and python udf

2019-08-13 Thread Hyukjin Kwon
that's fixed in https://github.com/apache/spark/commit/b83b7927b3a85c1a4945e2224ed811b5bb804477 2019년 8월 13일 (화) 오후 12:37, zenglong chen 님이 작성: > Does Spark 2.4.0 support Python UDFs with Continuous Processing mode? > I try it and occur error like below: > WARN scheduler.TaskSetManager: Lost

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Hyukjin Kwon
+1 2019년 8월 14일 (수) 오전 9:13, Takeshi Yamamuro 님이 작성: > Hi, > > Thanks for your notification, Dongjoon! > I put some links for the other committers/PMCs to access the info easily: > > A commit list in github from the last release: > https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8

Re: Usage of PyArrow in Spark

2019-07-17 Thread Hyukjin Kwon
Regular Python UDFs don't use PyArrow under the hood. Yes, they can potentially benefit but they can be easily worked around via Pandas UDFs. For instance, both below are virtually identical. @udf(...) def func(col): return col @pandas_udf(...) def pandas_func(col): return a.apply(lambda

Re: Exposing JIRA issue types at GitHub PRs

2019-06-16 Thread Hyukjin Kwon
Labels look good and useful. On Sat, 15 Jun 2019, 02:36 Dongjoon Hyun, wrote: > Now, you can see the exposed component labels (ordered by the number of > PRs) here and click the component to search. > > https://github.com/apache/spark/labels?sort=count-desc > > Dongjoon. > > > On Fri, Jun 14

Re: Exposing JIRA issue types at GitHub PRs

2019-06-12 Thread Hyukjin Kwon
Yea, I think we can automate this process via, for instance, https://github.com/apache/spark/blob/master/dev/github_jira_sync.py +1 for such sort of automatic categorizing and matching metadata between JIRA and github Adding Josh and Sean as well. On Thu, 13 Jun 2019, 13:17 Dongjoon Hyun, wrote

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Hyukjin Kwon
Yay! Good job Takeshi! On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro We are happy to announce the availability of Spark 2.3.3! > > Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 > maintenance branch of Spark. We strongly recommend all 2.3.x users to > upgrade to this stable re

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-16 Thread Hyukjin Kwon
Nice! 2019년 1월 16일 (수) 오전 11:55, Jiaan Geng 님이 작성: > Glad to hear this. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >

Re: Python

2019-01-02 Thread Hyukjin Kwon
Yup, it's supported. On Wed, 2 Jan 2019, 3:35 pm Gourav Sengupta Hi, > Can I please confirm which version of Python 3.x is supported by Spark 2.4? > > Regards, > Gourav >

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-09 Thread Hyukjin Kwon
I took a look for the codes. val source = classOf[MyDataSource].getCanonicalName spark.read.format(source).load().collect() Looks indeed it calls twice. First all: Looks it creates it first to read the schema for a logical plan test.org.apache.spark.sql.sources.v2.MyDataSourceReader.(MyDataSour

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many people still use Python 2. Also, techincally 2.7 support is not officially dropped yet - https://pythonclock.org/ 2018년 9월 17일 (월) 오전 9:31, Aakash Basu 님이 작성: > Removing support for an API in a major release makes poor sense

Re: How to make pyspark use custom python?

2018-09-05 Thread Hyukjin Kwon
Are you doubly sure if it is an issue in Spark? I used custom python several times with setting it in PYSPARK_PYTHON before and it was no problem. 2018년 9월 6일 (목) 오후 2:21, mithril 님이 작성: > For better looking , please see > > https://stackoverflow.com/questions/52178406/howto-make-pyspark-use-cust

Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step? 2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이 작성: > I'm not familiar with PyCharm. But if you can run "pyspark" from the > command line and not hit this, then this might be an issue with > PyCharm or your environment - e.g. havin

Re: Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-15 Thread Hyukjin Kwon
I use PyCharm. Mind if I ask to elaborate what you did step by step? 2018년 6월 16일 (토) 오전 12:11, Marcelo Vanzin 님이 작성: > I'm not familiar with PyCharm. But if you can run "pyspark" from the > command line and not hit this, then this might be an issue with > PyCharm or your environment - e.g. havin

Re: how to set up pyspark eclipse, pyDev, virtualenv? syntaxError: yield from walk(

2018-04-05 Thread Hyukjin Kwon
FYI, there is a PR and JIRA for virtualEnv support in PySpark https://issues.apache.org/jira/browse/SPARK-13587 https://github.com/apache/spark/pull/13599 2018-04-06 7:48 GMT+08:00 Andy Davidson : > FYI > > http://www.learn4master.com/algorithms/pyspark-unit-test- > set-up-sparkcontext > > From

Re: [PySpark SQL] sql function to_date and to_timestamp return the same data type

2018-03-18 Thread Hyukjin Kwon
Mind if I ask a reproducer? seems returning timestamps fine: >>> from pyspark.sql.functions import * >>> spark.range(1).select(to_timestamp(current_timestamp())).printSchema() root |-- to_timestamp(current_timestamp()): timestamp (nullable = false) >>> spark.range(1).select(to_timestamp(current_

Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Hyukjin Kwon
>From a very quick look, I think testthat version issue with SparkR. I had to fix that version to 1.x before in AppVeyor. There are few details in https://github.com/apache/spark/pull/20003 Can you check and lower testthat version? On 14 Feb 2018 6:09 pm, "chandan prakash" wrote: > Hi All, >

Re: Custom line/record delimiter

2018-01-01 Thread Hyukjin Kwon
Hi, There's a PR - https://github.com/apache/spark/pull/18581 and JIRA - SPARK-21289 Alternatively, you could check out multiLine option for CSV and see if applicable. Thanks. 2017-12-30 2:19 GMT+09:00 sk skk : > Hi, > > Do we have an option to write a csv or text file with a custom record/

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible encodings, up to my knowledge, mainly due to line separator and as I investigated in the comment. For when multiLine is set, it appears encoding is not considered. I actually meant encoding does not work at all in this case i

Re: how to set the assignee in JIRA please?

2017-07-25 Thread Hyukjin Kwon
ll, > > I find some PR were created one year ago, the last comment is several > monthes before. > No one to close or reject it. > Such as 6880, just put it like this? > > > ---Original--- > *From:* "Hyukjin Kwon" > *Date:* 2017/7/25 09:25:28 > *To:* "萝卜

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
please? ---Original--- *From:* "Hyukjin Kwon" *Date:* 2017/7/25 09:15:49 *To:* "Marcelo Vanzin"; *Cc:* "user";"萝卜丝炒饭"<1427357...@qq.com>; *Subject:* Re: how to set the assignee in JIRA please? I see. In any event, it sounds not required to work on an is

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
It should be not a big deal anyway. Thanks for the details. 2017-07-25 10:09 GMT+09:00 Marcelo Vanzin : > On Mon, Jul 24, 2017 at 6:04 PM, Hyukjin Kwon wrote: > > However, I see some JIRAs are assigned to someone time to time. Were > those > > mistakes or would you m

Re: how to set the assignee in JIRA please?

2017-07-24 Thread Hyukjin Kwon
However, I see some JIRAs are assigned to someone time to time. Were those mistakes or would you mind if I ask when someone is assigned? When I started to contribute to Spark few years ago, I was confused by this and I am pretty sure some guys are still confused. I do usually say something like "

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yes, I guess it is. 2017-07-20 11:31 GMT+09:00 Matthew cao : > AH, I get it. So that’s why I get the not register error? Cuz it not added > into SQL in 2.1.0? > > On 2017年7月19日, at 22:35, Hyukjin Kwon wrote: > > Yea, but it was added into SQL from Spark 2.2.0 > > 20

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yea, but it was added into SQL from Spark 2.2.0 2017-07-19 23:02 GMT+09:00 Matthew cao : > I am using version 2.1.1 As I could remember, this function was added > since 2.1.0. > > On 2017年7月17日, at 12:05, Burak Yavuz wrote: > > Hi Matthew, > > Which Spark version are you using? The expression `t

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Hyukjin Kwon
Cool! 2017-07-13 9:43 GMT+09:00 Denny Lee : > This is amazingly awesome! :) > > On Wed, Jul 12, 2017 at 13:23 lucas.g...@gmail.com > wrote: > >> That's great! >> >> >> >> On 12 July 2017 at 12:41, Felix Cheung wrote: >> >>> Awesome! Congrats!! >>> >>> -- >>> *From:*

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Hyukjin Kwon
Sounds like it is related with https://github.com/apache/spark/pull/17916 We will allow pick up the internal one if this one gets merged. On 10 May 2017 7:09 am, "Mark Hamstra" wrote: > Looks to me like it is a conflict between a Databricks library and Spark > 2.1. That's an issue for Databrick

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Mon, Mar 27, 2017 at 2:43 PM, Hyukjin Kwon wrote: > I just tried to build against the current master to help ch

Re: Why selectExpr changes schema (to include id column)?

2017-03-27 Thread Hyukjin Kwon
I just tried to build against the current master to help check - https://github.com/apache/spark/commit/3fbf0a5f9297f438bc92db11f106d4a0ae568613 It seems I can't reproduce this as below: scala> spark.range(1).printSchema root |-- id: long (nullable = false) scala> spark.range(1).selectExpr("*

Re: CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread Hyukjin Kwon
I think this is fixed in https://github.com/apache/spark/pull/15767 This should be fixed in 2.1.0. 2017-03-17 3:28 GMT+09:00 George Obama : > Hello, > > > > I am using spark 2.0.2 to read the CSV file with empty columns and is > hitting the issue: > > scala>val df = sqlContext.read.option("head

Re: [Spark CSV]: Use Custom TextInputFormat to Prevent Exceptions

2017-03-15 Thread Hyukjin Kwon
Other options are maybe : - "spark.sql.files.ignoreCorruptFiles" option - DataFrameReader.csv(csvDataset: Dataset[String]) with custom inputformat (this is available from Spark 2.2.0). For example, val rdd = spark.sparkContext.newAPIHadoopFile("/tmp/abcd", classOf[org.apache.hadoop.mapreduce.

Re: DataFrameWriter - Where to find list of Options applicable to particular format(datasource)

2017-03-13 Thread Hyukjin Kwon
Hi, all the options are documented in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter It seems we don't have both options for writing. If the goal is trimming the whitespaces, I think we could do this within dataframe operations (as we talked in the J

Re: using spark-xml_2.10 to extract data from XML file

2017-02-14 Thread Hyukjin Kwon
Hi Carlo, There was a bug in lower versions when accessing to nested values in the library. Otherwise, I suspect another issue about parsing malformed XML. Could you maybe open an issue in https://github.com/databricks/spark-xml/issues with your sample data? I will stick with it until it is so

Re: JavaRDD text matadata(file name) findings

2017-01-31 Thread Hyukjin Kwon
Hi, Are you maybe possible to switch it to text datasource with input_file_name function? Thanks. On 1 Feb 2017 3:58 a.m., "Manohar753" wrote: Hi All, myspark job is reading data from a folder having different files with same structured data. the red JavaRdd processed line by line but is there

Re: Scala Developers

2017-01-25 Thread Hyukjin Kwon
Just as a subscriber in this mailing list, I don't want to recieve job recruiting emails and even make some efforts to set a filter for it. I don't know the policy in details but I feel inappropriate to send them where, in my experience, Spark users usually ask some questions and discuss about Spa

Re: filter rows by all columns

2017-01-16 Thread Hyukjin Kwon
Hi Shawn, Could we do this as below? for any of true scala> val df = spark.range(10).selectExpr("id as a", "id / 2 as b") df: org.apache.spark.sql.DataFrame = [a: bigint, b: double] scala> df.filter(_.toSeq.exists(v => v == 1)).show() +---+---+ | a| b| +---+---+ | 1|0.5| | 2|1.0| +---+---+

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Oh, I mean another job would *not* happen if the schema is explicitly given. 2017-01-09 16:37 GMT+09:00 Hyukjin Kwon : > Hi Appu, > > > I believe that textFile and filter came from... > > https://github.com/apache/spark/blob/branch-2.1/sql/ > core/src/main/scala/org/apach

Re: Unable to explain the job kicked off for spark.read.csv

2017-01-08 Thread Hyukjin Kwon
Hi Appu, I believe that textFile and filter came from... https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L59-L61 It needs to read a first line even if using the header is disabled and schema inference is

Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I haven't tried this by myself but I remember I saw a JIRA in Parquet - https://issues.apache.org/jira/browse/PARQUE

Re: get corrupted rows using columnNameOfCorruptRecord

2016-12-07 Thread Hyukjin Kwon
Let me please just extend the suggestion a bit more verbosely. I think you could try something like this maybe. val jsonDF = spark.read .option("columnNameOfCorruptRecord", "xxx") .option("mode","PERMISSIVE") .schema(StructType(schema.fields :+ StructField("xxx", StringType, true))) .json

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
t this what Michael suggested? > > Thanks, > kant > > On Mon, Dec 5, 2016 at 4:45 AM, Hyukjin Kwon wrote: > >> Hi Kant, >> >> How about doing something like this? >> >> import org.apache.spark.sql.functions._ >> >> // val df2 = df.s

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread Hyukjin Kwon
Hi Kant, How about doing something like this? import org.apache.spark.sql.functions._ // val df2 = df.select(df("body").cast(StringType).as("body")) val df2 = Seq("""{"a": 1}""").toDF("body") val schema = spark.read.json(df2.as[String].rdd).schema df2.select(from_json(col("body"), schema)).show(

Re: Handling windows characters with Spark CSV on Linux

2016-11-17 Thread Hyukjin Kwon
Actually, CSV datasource supports encoding option[1] (although it does not support non-ascii compatible encoding types). [1] https://github.com/apache/spark/blob/44c8bfda793b7655e2bd1da5e9915a09ed9d42ce/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L364 On 17 Nov 2016 10:59 p

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
>> increase executor-memory tomorrow. I will open a new issue as well. >> >> >> >> On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon >> wrote: >> >>> Hi Arun, >>> >>> >>> I have few questions. >>> >>>

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Hyukjin Kwon
Maybe it sounds like you are looking for from_json/to_json functions after en/decoding properly. On 16 Nov 2016 6:45 p.m., "kant kodali" wrote: > > > https://spark.apache.org/docs/2.0.2/sql-programming-guide. > html#json-datasets > > "Spark SQL can automatically infer the schema of a JSON datase

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Hyukjin Kwon
Hi Arun, I have few questions. Dose your XML file have like few huge documents? In this case of a row having a huge size like (like 500MB), it would consume a lot of memory becuase at least it should hold a row to iterate if I remember correctly. I remember this happened to me before while proc

Re: How to read a Multi Line json object via Spark

2016-11-15 Thread Hyukjin Kwon
Hi Sree, There is a blog about that, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ It is pretty old but I am sure that it is helpful. Currently, JSON datasource only supports to rest JSON documents formatted according to http://jsonlines.org/ There is an iss

Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh, Could you please open an issue in https://github.com/databricks/spark-xml with some codes so that reviewers can reproduce the issue you met? Thanks! 2016-11-14 0:20 GMT+09:00 rakesh sharma : > Hi > > I'm trying to convert an XML file to data frame using data bricks spark > XML. But

Re: pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-12 Thread Hyukjin Kwon
Hi Sam, I think I have some answers for two questions. > Humble request: could we replace the "isinstance(col1, str)" tests with "isinstance(col1, basestring)"? IMHO, yes, I believe this should be basestring. Otherwise, some functions would not accept unicode as arguments for columns in Python 2

Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi, Have you maybe tried the quote related options specified in the documentation? http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv Thanks. 2016-11-06 6:58 GMT+09:00 Femi Anthony : > Hi, I am trying to process a very large comma delimited csv

Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla, there is the documentation for setting up IDE - https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup I hope this is helpful. 2016-11-04 9:10 GMT+09:00 shyla deshpande : > Hello Everyone, > > I just installed Spark 2.0.1, spark shell w

Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning leaving data as they are, including prefixes). The problem was, each partition needs to produce each record with knowing the namesapces. It is fine to deal with them if they are within each XML documentation (represented as a

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert, I am curious about your case. I guess the purpose of timestampFormat and dateFormat is to infer timestamps/dates when parsing/inferring but not to exclude the type inference/parsing. Actually, it does try to infer/parse in 2.0.0 as well (but it fails) so actually I guess there wouldn't

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
in spark 2.0.1: > spark.read > .format("csv") > .option("header", true) > .option("inferSchema", true) > .load("test.csv") > .printSchema > > the result is: > root > |-- date: timestamp (nullable = true) > > >

Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType. Do you mind if I ask to share your codes? On 27 Oct 2016 2:16 a.m., "Koert Kuipers" wrote: > is there a reason a column with dates in format -mm-dd in a csv file > is inferred to be TimestampType and not DateType?

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within coming few days.. 2016-10-24 21:32 GMT+09:00 Sean Owen : > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default locale > of the JVM.

Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577 Maybe using explode() would be helpful. Thanks! 2016-10-19 14:05 GMT+09:00 Divya Gehlot : > http://stackoverflow.com/questions/33864389/how-can-i- > create-a-spark-dataframe-from-a-nested-array-of-struc

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json. As far as I know, single line json also conplies the standard. I left a comment with RFC in the PR but please let me know if I am wrong at any point. Thanks! [1]https://github.com/apache/spark/pull/15511 On 19 Oct 2016 7:00 a.m.,

Re: JSON Arrays and Spark

2016-10-11 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu, Siva

Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports [{...}, {...} ...] Or {...} format as input. On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin" wrote: > Thanks Luciano - I think this is my issue :( > > On Oct 10, 2016, at 2:08 PM, Luciano Resende wrote: > > Please take a look at > http://spark.apache.org/docs/latest/sql-pro

Re: Support for uniVocity in Spark 2.x

2016-10-06 Thread Hyukjin Kwon
Yeap, there is an option to switch Apache common CSV to Univocity in external CSV library but it become univocity by default and Apache common CSV was removed during porting it into Spark 2.0. On 7 Oct 2016 2:53 a.m., "Sean Owen" wrote: > It still uses univocity, but this is an implementation de

Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR, https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347 +1 for creating a JIRA and PR. If you have any problem with this, I would like to do this quickly. On 5 Oct 2016 9:12 p.m., "Laurent Legrand" wrote: > Hello,

Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181 2016-09-29 18:58 GMT+09:00 Hitesh Goyal : > Hi team, > > > > I have a json document. I want to put spark SQL to it. > > Can you please send me an example app built i

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for stri

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? BTW, it seems there is something wrong with your email address. I am sending this again. On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon" wrote: > It seems not an issue in Spark. Does

Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark with the data? On 20 Sep 2016 2:15 a.m., "Mohamed ismail" wrote: > Hi all > > I am trying to read: > > sc.textFile(DataFile).mapPartitions(lines => { > val parser = new CSVParser(",") >

How many are there PySpark Windows users?

2016-09-17 Thread Hyukjin Kwon
Hi all, We are currently testing SparkR on Windows[1] and it seems several problems are being identified time to time. Although It seems it is not easy to automate Spark's tests in Scala on Windows because I think we should introduce a proper change detection to run only related tests rather than

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin, I have few questions on this. Does that only not work with write.json() ? I just wonder if write.text, csv or another API does not work as well and it is a JSON specific issue. Also, does that work with small data? I want to make sure if this happen only on large data. Thanks! 2016

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
> | reader.readAll().map(data => Row(data(3),data(4),data(7), > data(9),data(14)))} > > The above code throws arrayoutofbounce exception for empty line and report > line. > > > On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon wrote: > >> Hi Selvam, >>

  1   2   >