[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437585#comment-17437585 ] mathieu longtin commented on SPARK-37185: - It seems to try to optimize for a simple query, but not more complex queries. It kind of make sense for "select * from t", but any where clause can make it quite restrictive. It looks like it scans the first part, doesn't find enough data, then scans four parts, then decides to scan everything. This is nice, but meanwhile, I have 20 workers already reserved, it wouldn't cost anything more to just go ahead right away. Timing, table is not cached, contains 69 csv.gz files with anywhere from 1MB to 2.2GB of data: {code:java} In [1]: %time spark.sql("select * from t where x = 99").take(10) CPU times: user 83.9 ms, sys: 112 ms, total: 196 ms Wall time: 6min 44s ... In [2]: %time spark.sql("select * from t where x = 99").limit(10).rdd.collect() CPU times: user 45.7 ms, sys: 73.9 ms, total: 120 ms Wall time: 3min 59s ... {code} I ran the two tests a few times to make sure there was no OS level caching effect, the timing didn't change much. If I cache the table first, then "take(10)" is faster than "limit(10).rdd.collect()". > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37185) DataFrame.take() only uses one worker
[ https://issues.apache.org/jira/browse/SPARK-37185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437009#comment-17437009 ] mathieu longtin commented on SPARK-37185: - Additional note: if there's a "group by" in the query, this is not an issue. > DataFrame.take() only uses one worker > - > > Key: SPARK-37185 > URL: https://issues.apache.org/jira/browse/SPARK-37185 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 > Environment: CentOS 7 >Reporter: mathieu longtin >Priority: Major > > Say you have query: > {code:java} > >>> df = spark.sql("select * from mytable where x = 99"){code} > Now, out of billions of row, there's only ten rows where x is 99. > If I do: > {code:java} > >>> df.limit(10).collect() > [Stage 1:> (0 + 1) / 1]{code} > It only uses one worker. This takes a really long time since one CPU is > reading the billions of row. > However, if I do this: > {code:java} > >>> df.limit(10).rdd.collect() > [Stage 1:> (0 + 10) / 22]{code} > All the workers are running. > I think there's some optimization issue DataFrame.take(...). > This did not use to be an issue, but I'm not sure if it was working with 3.0 > or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37185) DataFrame.take() only uses one worker
mathieu longtin created SPARK-37185: --- Summary: DataFrame.take() only uses one worker Key: SPARK-37185 URL: https://issues.apache.org/jira/browse/SPARK-37185 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.1 Environment: CentOS 7 Reporter: mathieu longtin Say you have query: {code:java} >>> df = spark.sql("select * from mytable where x = 99"){code} Now, out of billions of row, there's only ten rows where x is 99. If I do: {code:java} >>> df.limit(10).collect() [Stage 1:> (0 + 1) / 1]{code} It only uses one worker. This takes a really long time since one CPU is reading the billions of row. However, if I do this: {code:java} >>> df.limit(10).rdd.collect() [Stage 1:> (0 + 10) / 22]{code} All the workers are running. I think there's some optimization issue DataFrame.take(...). This did not use to be an issue, but I'm not sure if it was working with 3.0 or 2.4. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24753) bad backslah parsing in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-24753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539023#comment-16539023 ] mathieu longtin edited comment on SPARK-24753 at 7/10/18 5:59 PM: -- Thanks for the response. Yes, it does work with escapedStringLiterals. However, this is an inconsitent behavior. In the doc example: ([https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike]) {code:java} SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' {code} The examples are totally wrong. In fact, they produce an error. To reproduce the example, using the *spark-sql* command with _escapedStringLiterals=False_, I need this: {code:java} When spark.sql.parser.escapedStringLiterals is disabled (default). > SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' true{code} Notice the double and quadruple backslash. Somehow, the right side of rlike gets decoded, and then passed to the rlike function, which then decodes it again. BTW, from spark-sql, single backslash are no good: {code:java} > SELECT '%SystemDrive%\Users\John' ; %SystemDrive%UsersJohn {code} Oops, the backslash get swallowed. was (Author: mathieulongtin): Thanks for the response. Yes, it does work with escapedStringLiterals. However, this is an inconsitent behavior. In the doc example: (https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike) {code:java} SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' {code} The examples are totally wrong. In fact, they produce an error. To reproduce the example, using the *spark-sql* command with _escapedStringLiterals=False_, I need this: {code:java} When spark.sql.parser.escapedStringLiterals is disabled (default). > SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' true{code} Notice the double and quadruple backslash. Somehow, the right side of rlike gets decoded, and then passed to the rlike function, which then decodes it again. BTW, from spark-sql: {code:java} > SELECT '%SystemDrive%\Users\John' ; %SystemDrive%UsersJohn {code} Oops, the backslash get swallowed. > bad backslah parsing in SQL statements > -- > > Key: SPARK-24753 > URL: https://issues.apache.org/jira/browse/SPARK-24753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Python version 2.7.12 (default, Jul 15 2016 11:23:12) >Reporter: mathieu longtin >Priority: Minor > > When putting backslashes in SQL code, you need to double them (or rather > double double them). > Code in Python but I verified the problem is the same in Scala. > Line [3] should return the line, and line 4 shouldn't. > > {code:java} > In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"]) > In [2]: df.filter(df.s.rlike('\\bdef\\b')).show() > +---+ > | s| > +---+ > |abc def ghi| > +---+ > In [3]: df.filter("s rlike '\\bdef\\b'").show() > +---+ > | s| > +---+ > +---+ > In [4]: df.filter("s rlike 'bdefb'").show() > +---+ > | s| > +---+ > |abc def ghi| > +---+ > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24753) bad backslah parsing in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-24753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539023#comment-16539023 ] mathieu longtin commented on SPARK-24753: - Thanks for the response. Yes, it does work with escapedStringLiterals. However, this is an inconsitent behavior. In the doc example: (https://spark.apache.org/docs/2.3.0/api/sql/index.html#rlike) {code:java} SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*' {code} The examples are totally wrong. In fact, they produce an error. To reproduce the example, using the *spark-sql* command with _escapedStringLiterals=False_, I need this: {code:java} When spark.sql.parser.escapedStringLiterals is disabled (default). > SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%Users.*' true{code} Notice the double and quadruple backslash. Somehow, the right side of rlike gets decoded, and then passed to the rlike function, which then decodes it again. BTW, from spark-sql: {code:java} > SELECT '%SystemDrive%\Users\John' ; %SystemDrive%UsersJohn {code} Oops, the backslash get swallowed. > bad backslah parsing in SQL statements > -- > > Key: SPARK-24753 > URL: https://issues.apache.org/jira/browse/SPARK-24753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.0 > /_/ > Using Python version 2.7.12 (default, Jul 15 2016 11:23:12) >Reporter: mathieu longtin >Priority: Minor > > When putting backslashes in SQL code, you need to double them (or rather > double double them). > Code in Python but I verified the problem is the same in Scala. > Line [3] should return the line, and line 4 shouldn't. > > {code:java} > In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"]) > In [2]: df.filter(df.s.rlike('\\bdef\\b')).show() > +---+ > | s| > +---+ > |abc def ghi| > +---+ > In [3]: df.filter("s rlike '\\bdef\\b'").show() > +---+ > | s| > +---+ > +---+ > In [4]: df.filter("s rlike 'bdefb'").show() > +---+ > | s| > +---+ > |abc def ghi| > +---+ > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24753) bad backslah parsing in SQL statements
mathieu longtin created SPARK-24753: --- Summary: bad backslah parsing in SQL statements Key: SPARK-24753 URL: https://issues.apache.org/jira/browse/SPARK-24753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Environment: __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.0 /_/ Using Python version 2.7.12 (default, Jul 15 2016 11:23:12) Reporter: mathieu longtin When putting backslashes in SQL code, you need to double them (or rather double double them). Code in Python but I verified the problem is the same in Scala. Line [3] should return the line, and line 4 shouldn't. {code:java} In [1]: df = spark.createDataFrame([("abc def ghi",)], schema=["s"]) In [2]: df.filter(df.s.rlike('\\bdef\\b')).show() +---+ | s| +---+ |abc def ghi| +---+ In [3]: df.filter("s rlike '\\bdef\\b'").show() +---+ | s| +---+ +---+ In [4]: df.filter("s rlike 'bdefb'").show() +---+ | s| +---+ |abc def ghi| +---+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15531) spark-class tries to use too much memory when running Launcher
[ https://issues.apache.org/jira/browse/SPARK-15531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300986#comment-15300986 ] mathieu longtin commented on SPARK-15531: - Correct, on a 128G server, just running {{java}} with no argument will try to allocate 32G, regardless of ulimit. It's "expected behavior" according to Oracle. > spark-class tries to use too much memory when running Launcher > -- > > Key: SPARK-15531 > URL: https://issues.apache.org/jira/browse/SPARK-15531 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1, 2.0.0 > Environment: Linux running in Univa or Sun Grid Engine >Reporter: mathieu longtin >Priority: Minor > Labels: launcher > > When running Java on a server with a lot of memory but a rather small virtual > memory ulimit, Java will try to allocate a large memory pool and fail: > {code} > # System has 128GB of Ram but ulimit set to 7.5G > $ ulimit -v > 7812500 > $ java -client > Error occurred during initialization of VM > Could not reserve enough space for object heap > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > {code} > This is a known issue with Java, but unlikely to get fixed. > As a result, when starting various Spark process (spark-submit, master or > workers), they fail when {{spark-class}} tries to run > {{org.apache.spark.launcher.Main}}. > To fix this, add {{-Xmx128m}} to this line > {code} > "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$\@" > {code} > (https://github.com/apache/spark/blob/master/bin/spark-class#L71) > We've been using 128m and that works in our setup. Considering all the > launcher does is analyze the arguments and env var and spit out some command, > it should be plenty. All other calls to Java seem to include some value for > -Xmx, so it is not an issue elsewhere. > I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m > (bigger, smaller, configurable, ...), so I'd rather it would be discussed > first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15531) spark-class tries to use too much memory when running Launcher
[ https://issues.apache.org/jira/browse/SPARK-15531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300666#comment-15300666 ] mathieu longtin commented on SPARK-15531: - The VM that spark-class launches afterwards has an -Xmx argument. It comes from --executor-memory or --driver-memory or some place else. This is only a problem when letting Java decide what -Xmx should be. By default, it's a quarter of the physical memory, and it tries to reserve it right away, regardless of actual need. > spark-class tries to use too much memory when running Launcher > -- > > Key: SPARK-15531 > URL: https://issues.apache.org/jira/browse/SPARK-15531 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1, 2.0.0 > Environment: Linux running in Univa or Sun Grid Engine >Reporter: mathieu longtin >Priority: Minor > Labels: launcher > > When running Java on a server with a lot of memory but a rather small virtual > memory ulimit, Java will try to allocate a large memory pool and fail: > {code} > # System has 128GB of Ram but ulimit set to 7.5G > $ ulimit -v > 7812500 > $ java -client > Error occurred during initialization of VM > Could not reserve enough space for object heap > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > {code} > This is a known issue with Java, but unlikely to get fixed. > As a result, when starting various Spark process (spark-submit, master or > workers), they fail when {{spark-class}} tries to run > {{org.apache.spark.launcher.Main}}. > To fix this, add {{-Xmx128m}} to this line > {code} > "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$\@" > {code} > (https://github.com/apache/spark/blob/master/bin/spark-class#L71) > We've been using 128m and that works in our setup. Considering all the > launcher does is analyze the arguments and env var and spit out some command, > it should be plenty. All other calls to Java seem to include some value for > -Xmx, so it is not an issue elsewhere. > I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m > (bigger, smaller, configurable, ...), so I'd rather it would be discussed > first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15531) spark-class tries to use too much memory when running Launcher
mathieu longtin created SPARK-15531: --- Summary: spark-class tries to use too much memory when running Launcher Key: SPARK-15531 URL: https://issues.apache.org/jira/browse/SPARK-15531 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.6.1, 2.0.0 Environment: Linux running in Univa or Sun Grid Engine Reporter: mathieu longtin Priority: Minor When running Java on a server with a lot of memory but a rather small virtual memory ulimit, Java will try to allocate a large memory pool and fail: {code} # System has 128GB of Ram but ulimit set to 7.5G $ ulimit -v 7812500 $ java -client Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. {code} This is a known issue with Java, but unlikely to get fixed. As a result, when starting various Spark process (spark-submit, master or workers), they fail when {{spark-class}} tries to run {{org.apache.spark.launcher.Main}}. To fix this, add {{-Xmx128m}} to this line {code} "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$\@" {code} (https://github.com/apache/spark/blob/master/bin/spark-class#L71) We've been using 128m and that works in our setup. Considering all the launcher does is analyze the arguments and env var and spit out some command, it should be plenty. All other calls to Java seem to include some value for -Xmx, so it is not an issue elsewhere. I don't mind submitting a PR, but I'm sure somebody has opinions on the 128m (bigger, smaller, configurable, ...), so I'd rather it would be discussed first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null
[ https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mathieu longtin updated SPARK-13266: Comment: was deleted (was: https://github.com/apache/spark/pull/11305) > Python DataFrameReader converts None to "None" instead of null > -- > > Key: SPARK-13266 > URL: https://issues.apache.org/jira/browse/SPARK-13266 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.0 > Environment: Linux standalone but probably applies to all >Reporter: mathieu longtin > Labels: easyfix, patch > > If you do something like this: > {code:none} > tsv_loader = sqlContext.read.format('com.databricks.spark.csv') > tsv_loader.options(quote=None, escape=None) > {code} > The loader sees the string "None" as the _quote_ and _escape_ options. The > loader should get a _null_. > An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, > correct the _to_str_ function. Here's the patch: > {code:none} > diff --git a/python/pyspark/sql/readwriter.py > b/python/pyspark/sql/readwriter.py > index a3d7eca..ba18d13 100644 > --- a/python/pyspark/sql/readwriter.py > +++ b/python/pyspark/sql/readwriter.py > @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"] > def to_str(value): > """ > -A wrapper over str(), but convert bool values to lower case string > +A wrapper over str(), but convert bool values to lower case string, and > keep None > """ > if isinstance(value, bool): > return str(value).lower() > +elif value is None: > +return value > else: > return str(value) > {code} > This has been tested and works great. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null
[ https://issues.apache.org/jira/browse/SPARK-13266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157260#comment-15157260 ] mathieu longtin commented on SPARK-13266: - https://github.com/apache/spark/pull/11305 > Python DataFrameReader converts None to "None" instead of null > -- > > Key: SPARK-13266 > URL: https://issues.apache.org/jira/browse/SPARK-13266 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.0 > Environment: Linux standalone but probably applies to all >Reporter: mathieu longtin > Labels: easyfix, patch > > If you do something like this: > {code:none} > tsv_loader = sqlContext.read.format('com.databricks.spark.csv') > tsv_loader.options(quote=None, escape=None) > {code} > The loader sees the string "None" as the _quote_ and _escape_ options. The > loader should get a _null_. > An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, > correct the _to_str_ function. Here's the patch: > {code:none} > diff --git a/python/pyspark/sql/readwriter.py > b/python/pyspark/sql/readwriter.py > index a3d7eca..ba18d13 100644 > --- a/python/pyspark/sql/readwriter.py > +++ b/python/pyspark/sql/readwriter.py > @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"] > def to_str(value): > """ > -A wrapper over str(), but convert bool values to lower case string > +A wrapper over str(), but convert bool values to lower case string, and > keep None > """ > if isinstance(value, bool): > return str(value).lower() > +elif value is None: > +return value > else: > return str(value) > {code} > This has been tested and works great. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13290) wholeTextFile and binaryFiles are really slow
[ https://issues.apache.org/jira/browse/SPARK-13290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mathieu longtin reopened SPARK-13290: - Slow relative to reading the exact same file on a local disk on the same machine. Python will read the same file *70 times* faster. I ran these tests a few times to make sure it's not a cache issue. Point me to the code that reads files and maybe I can help. Just closing the bug doesn't mean the problem isn't there. > wholeTextFile and binaryFiles are really slow > - > > Key: SPARK-13290 > URL: https://issues.apache.org/jira/browse/SPARK-13290 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0 > Environment: Linux stand-alone >Reporter: mathieu longtin > > Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely > slow. It takes 3 minutes in Java versus 2.5 seconds in Python. > The java process balloons to 4.3GB of memory and uses 100% CPU the whole > time. I suspects Spark reads it in small chunks and assembles it at the end, > hence the large amount of CPU. > {code} > In [49]: rdd = sc.binaryFiles(pathToOneFile) > In [50]: %time path, text = rdd.first() > CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s > Wall time: 3min 32s > In [51]: len(text) > Out[51]: 191376122 > In [52]: %time text = open(pathToOneFile).read() > CPU times: user 8 ms, sys: 691 ms, total: 699 ms > Wall time: 2.43 s > In [53]: len(text) > Out[53]: 191376122 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13290) wholeTextFile and binaryFiles are really slow
mathieu longtin created SPARK-13290: --- Summary: wholeTextFile and binaryFiles are really slow Key: SPARK-13290 URL: https://issues.apache.org/jira/browse/SPARK-13290 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.6.0 Environment: Linux stand-alone Reporter: mathieu longtin Reading biggish files (175MB) with wholeTextFile or binaryFiles is extremely slow. It takes 3 minutes in Java versus 2.5 seconds in Python. The java process balloons to 4.3GB of memory and uses 100% CPU the whole time. I suspects Spark reads it in small chunks and assembles it at the end, hence the large amount of CPU. {code} In [49]: rdd = sc.binaryFiles(pathToOneFile) In [50]: %time path, text = rdd.first() CPU times: user 1.91 s, sys: 1.13 s, total: 3.04 s Wall time: 3min 32s In [51]: len(text) Out[51]: 191376122 In [52]: %time text = open(pathToOneFile).read() CPU times: user 8 ms, sys: 691 ms, total: 699 ms Wall time: 2.43 s In [53]: len(text) Out[53]: 191376122 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13266) Python DataFrameReader converts None to "None" instead of null
mathieu longtin created SPARK-13266: --- Summary: Python DataFrameReader converts None to "None" instead of null Key: SPARK-13266 URL: https://issues.apache.org/jira/browse/SPARK-13266 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.6.0 Environment: Linux standalone but probably applies to all Reporter: mathieu longtin If you do something like this: {code:none} tsv_loader = sqlContext.read.format('com.databricks.spark.csv') tsv_loader.options(quote=None, escape=None) {code} The loader sees the string "None" as the _quote_ and _escape_ options. The loader should get a _null_. An easy fix is to modify *python/pyspark/sql/readwriter.py* near the top, correct the _to_str_ function. Here's the patch: {code:none} diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index a3d7eca..ba18d13 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -33,10 +33,12 @@ __all__ = ["DataFrameReader", "DataFrameWriter"] def to_str(value): """ -A wrapper over str(), but convert bool values to lower case string +A wrapper over str(), but convert bool values to lower case string, and keep None """ if isinstance(value, bool): return str(value).lower() +elif value is None: +return value else: return str(value) {code} This has been tested and works great. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org