[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/11947 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-216006137 OK I'm going to merge this in master and manually update the commit message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-216003501 LGTM. (Maybe we should not forget, for documentation, `nullValue` has the highest priority than other options such as `nanValue` if the same value is given as option) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215984253 @HyukjinKwon would be great if you can review this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215984080 @falaki can you update the pr description? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215979899 please also provide a way for strings to be converted to null upon reading --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215947150 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57423/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215947147 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215947097 **[Test build #57423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57423/consoleFull)** for PR 11947 at commit [`6facd26`](https://github.com/apache/spark/commit/6facd262f897883499e0fb46a4304e4b7c5c0c05). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215943795 LGTM pending tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215943744 **[Test build #57423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57423/consoleFull)** for PR 11947 at commit [`6facd26`](https://github.com/apache/spark/commit/6facd262f897883499e0fb46a4304e4b7c5c0c05). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user falaki commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215943595 @rxin done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215940817 @falaki sorry this no longer merges cleanly. Do you mind bringing it up to date? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215930852 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215930854 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57394/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215930751 **[Test build #57394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57394/consoleFull)** for PR 11947 at commit [`698b4b4`](https://github.com/apache/spark/commit/698b4b41baa1ebd5d66ea6242bcb39bcd0887f8b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215926089 **[Test build #57394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57394/consoleFull)** for PR 11947 at commit [`698b4b4`](https://github.com/apache/spark/commit/698b4b41baa1ebd5d66ea6242bcb39bcd0887f8b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215924124 As discussed offline, we should just have a single option for setting null, another for nan, another for inf and negative inf. Basically just 4. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215196735 i personally would have been happy with a simple single values for nulls for all datatypes. and the usage of that single value should be consistent across reading and writing. so when that value is encountered during reading it becomes null (except for double/float columns it becomes NaN perhaps), and when writing a null values gets written out as this value. for example when dealing with text files dumped from hive this value is typically "\N" across all columns and datatypes. when i read this sort of data i simply want every "\N" to become null, and when writing out data that needs to be compatible with hive i would like to write out nulls across all columns as "\N". for cascading/scalding this value is typically "" (the empty value). so again i would want all empty values to be converted to nulls when reading, and when writing i would want every null to be written out as the empty value. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215194241 do these settings roundtrip correctly? say i set doubleNaNValue to "XY", and i create a dataframe with a Double.NaN in it, does it get written out correctly as XY, and then XY gets read back in correctly as Double.NaN? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-215192562 hello! why is there no stringNullValue? basically i want for a column with type string to read in all empty strings as nulls. this is what the old option "treatEmptyStringsAsNulls" used to do. its the natural complement for writing out nulls as empty strings (without this data does not roundtrip). thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-208662536 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/9/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-208662534 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-208662375 **[Test build #9 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/9/consoleFull)** for PR 11947 at commit [`161a3eb`](https://github.com/apache/spark/commit/161a3ebeb9201d68c97e771def2a77b994e3b217). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-208638423 **[Test build #9 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/9/consoleFull)** for PR 11947 at commit [`161a3eb`](https://github.com/apache/spark/commit/161a3ebeb9201d68c97e771def2a77b994e3b217). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-206023267 **[Test build #55033 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55033/consoleFull)** for PR 11947 at commit [`124873b`](https://github.com/apache/spark/commit/124873bd469b827ef8de11931001ba1186157dbb). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-206023276 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55033/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-206023274 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-206020403 **[Test build #55033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55033/consoleFull)** for PR 11947 at commit [`124873b`](https://github.com/apache/spark/commit/124873bd469b827ef8de11931001ba1186157dbb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-205955742 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55010/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-205955671 **[Test build #55010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55010/consoleFull)** for PR 11947 at commit [`180a900`](https://github.com/apache/spark/commit/180a9000af49f46ad4d6e0e4b424309c46f3bfa6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-205955740 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-205951008 **[Test build #55010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55010/consoleFull)** for PR 11947 at commit [`180a900`](https://github.com/apache/spark/commit/180a9000af49f46ad4d6e0e4b424309c46f3bfa6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user falaki commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r58596297 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -177,35 +177,57 @@ private[csv] object CSVTypeCast { datum: String, castType: DataType, nullable: Boolean = true, - nullValue: String = ""): Any = { + params: CSVOptions = CSVOptions()): Any = { -if (datum == nullValue && nullable && (!castType.isInstanceOf[StringType])) { - null -} else { - castType match { -case _: ByteType => datum.toByte -case _: ShortType => datum.toShort -case _: IntegerType => datum.toInt -case _: LongType => datum.toLong -case _: FloatType => Try(datum.toFloat) - .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue()) -case _: DoubleType => Try(datum.toDouble) - .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).doubleValue()) -case _: BooleanType => datum.toBoolean -case dt: DecimalType => +castType match { + case _: ByteType => if (datum == params.byteNullValue && nullable) null else datum.toByte + case _: ShortType => if (datum == params.shortNullValue && nullable) null else datum.toShort + case _: IntegerType => if (datum == params.integerNullValue && nullable) null else datum.toInt + case _: LongType => if (datum == params.longNullValue && nullable) null else datum.toLong + case _: FloatType => +if (datum == params.floatNullValue && nullable) { + null +} else if (datum == params.floatNaNValue) { + Float.NaN +} else if (datum == params.floatNegativeInf) { + Float.NegativeInfinity +} else if (datum == params.floatPositiveInf) { + Float.PositiveInfinity +} else { + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue()) +} + case _: DoubleType => +if (datum == params.doubleNullValue && nullable) { + null +} else if (datum == params.doubleNaNValue) { + Double.NaN +} else if (datum == params.doubleNegativeInf) { + Double.NegativeInfinity +} else if (datum == params.doublePositiveInf) { + Double.PositiveInfinity +} else { + Try(datum.toDouble) --- End diff -- I think in this case, in a private and unexposed method, this seem OK. There are many other instances of it in `CSVInferSchema` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r57813530 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala --- @@ -177,35 +177,57 @@ private[csv] object CSVTypeCast { datum: String, castType: DataType, nullable: Boolean = true, - nullValue: String = ""): Any = { + params: CSVOptions = CSVOptions()): Any = { -if (datum == nullValue && nullable && (!castType.isInstanceOf[StringType])) { - null -} else { - castType match { -case _: ByteType => datum.toByte -case _: ShortType => datum.toShort -case _: IntegerType => datum.toInt -case _: LongType => datum.toLong -case _: FloatType => Try(datum.toFloat) - .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue()) -case _: DoubleType => Try(datum.toDouble) - .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).doubleValue()) -case _: BooleanType => datum.toBoolean -case dt: DecimalType => +castType match { + case _: ByteType => if (datum == params.byteNullValue && nullable) null else datum.toByte + case _: ShortType => if (datum == params.shortNullValue && nullable) null else datum.toShort + case _: IntegerType => if (datum == params.integerNullValue && nullable) null else datum.toInt + case _: LongType => if (datum == params.longNullValue && nullable) null else datum.toLong + case _: FloatType => +if (datum == params.floatNullValue && nullable) { + null +} else if (datum == params.floatNaNValue) { + Float.NaN +} else if (datum == params.floatNegativeInf) { + Float.NegativeInfinity +} else if (datum == params.floatPositiveInf) { + Float.PositiveInfinity +} else { + Try(datum.toFloat) + .getOrElse(NumberFormat.getInstance(Locale.getDefault).parse(datum).floatValue()) +} + case _: DoubleType => +if (datum == params.doubleNullValue && nullable) { + null +} else if (datum == params.doubleNaNValue) { + Double.NaN +} else if (datum == params.doubleNegativeInf) { + Double.NegativeInfinity +} else if (datum == params.doublePositiveInf) { + Double.PositiveInfinity +} else { + Try(datum.toDouble) --- End diff -- (Also, it looks the use of `Try` API is discouraged [scala-style-guide#exception](https://github.com/databricks/scala-style-guide#exception).) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-202843052 I'm not sure how complicated the use case will be, but it really scares me with so many options... If we decide to do it, I think we should also add these options to JSON, to make them consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r57708347 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -27,6 +27,8 @@ import org.apache.spark.unsafe.types.UTF8String class CSVTypeCastSuite extends SparkFunSuite { + private def isNull(v: Any) = assert(v == null) --- End diff -- nit: `isNull` looks like something that return boolean, how about `assertNull`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-202711498 I found both `NaN` and `Infinity` are handled in JSON data source and it was fixed in this PR, https://github.com/apache/spark/commit/7a9dcbc91d55dbc0cbf4812319bde65f4509b467. cc @yhuai for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-202682552 For codes, overall, it looks good to me. However, I am not used to and have a lot of experience of dealing with `NaN`, `Inf ` or `-Inf`. If the values can be different in many cases, I think it is reasonable. Nevertheless, I feel a bit questionable for the options for `null` for each type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r57657765 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVTypeCastSuite.scala --- @@ -64,17 +66,21 @@ class CSVTypeCastSuite extends SparkFunSuite { } test("Nullable types are handled") { -assert(CSVTypeCast.castTo("", IntegerType, nullable = true) == null) +assert(CSVTypeCast.castTo("", IntegerType, nullable = true, CSVOptions()) == null) --- End diff -- I just noticed that third argument has a default value `CSVOptions()` in `CSVTypeCast.castTo()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r57656879 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -478,4 +479,34 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils { verifyCars(cars, withHeader = false, checkTypes = false) } + + test("nulls, NaNs and Infinity values can be parsed") { +val numbers = sqlContext + .read + .format("csv") + .schema(StructType(List( +StructField("int", IntegerType, true), +StructField("long", LongType, true), +StructField("float", FloatType, true), +StructField("double", DoubleType, true) + ))) + .options(Map( +"header" -> "true", +"mode" -> "DROPMALFORMED", +"integerNullValue" -> "--", +"longNullValue" -> "++", +"floatNullValue" -> "null", +"doubleNullValue" -> "NULL", +"floatNaNValue" -> "FNAN", +"doubleNaNValue" -> "DNAN", +"floatNegativeInf" -> "-FINF", +"floatPositiveInf" -> "FINF", +"doublePositiveInf" -> "DINF", +"doubleNegativeInf" -> "-DINF")) + .load(testFile(numbersFile)) + +assert(numbers.count() == 8) + + --- End diff -- Maybe remove those double spaces? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/11947#discussion_r57656806 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala --- @@ -101,3 +125,14 @@ private[sql] class CSVOptions( val rowSeparator = "\n" } + +object CSVOptions { + + /** Used for convenient construction in unit tests */ + def apply(): CSVOptions = new CSVOptions(Map.empty) --- End diff -- For me, I a bit hesitated if this `CSVOptions` companion object is only used in unit tests. I'd just use `new CSVOptions(Map("key" -> "value"))` or `new CSVOptions(Map.empty)` in tests. Otherwise, I'd just make this object in the tests if this object is required for some reasons or just make a function in tests for convenient construction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user falaki commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-202570253 @cloud-fan would you take a look at this if you have time? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user falaki commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-202502231 ping @HyukjinKwon and @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-201080503 **[Test build #54113 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54113/consoleFull)** for PR 11947 at commit [`93ac6bb`](https://github.com/apache/spark/commit/93ac6bb3eb63efb775b48af090a37a6cbe4f30c4). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-201080508 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54113/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-201080505 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11947#issuecomment-201080156 **[Test build #54113 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54113/consoleFull)** for PR 11947 at commit [`93ac6bb`](https://github.com/apache/spark/commit/93ac6bb3eb63efb775b48af090a37a6cbe4f30c4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14143] Options for parsing NaNs, Infini...
GitHub user falaki opened a pull request: https://github.com/apache/spark/pull/11947 [SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types ## What changes were proposed in this pull request? 1. Adds following options for parsing type-specfic nulls to CSV data source: * byteNullValue * integerNullValue * shortNullValue * longNullValue * floatNullValue * doubleNullValue * decimalNullValue 2. Adds following options for parsing NaNs: * floatNaNValue * doubleNaNValue 3. And following options for parsing infinity: * floatNegativeInf * floatPositiveInf * doubleNegativeInf * doublePositiveInf ## How was this patch tested? `TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite` You can merge this pull request into a Git repository by running: $ git pull https://github.com/falaki/spark SPARK-14143 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11947.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11947 commit 93ac6bb3eb63efb775b48af090a37a6cbe4f30c4 Author: HosseinDate: 2016-03-24T23:31:38Z Added support for null, NaN and Inf options for numeric types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org