[jira] [Updated] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs
[ https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11176: --- Summary: Umbrella ticket for wholeTextFiles bugs (was: Umbrella ticket for wholeTextFiles + S3 bugs) > Umbrella ticket for wholeTextFiles bugs > --- > > Key: SPARK-11176 > URL: https://issues.apache.org/jira/browse/SPARK-11176 > Project: Spark > Issue Type: Umbrella > Components: Input/Output, Spark Core >Reporter: Josh Rosen > > This umbrella ticket gathers together several distinct bug reports related to > problems using the wholeTextFiles method to read files from S3. > These issues may have a common underlying cause and should be investigated > together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10994) Clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Yang updated SPARK-10994: -- Description: The Clustering Coefficient (CC) is a fundamental measure in social (or other type of) network analysis assessing the degree to which nodes tend to cluster together [1][2]. Clustering coefficient, along with density, node degree, path length, diameter, connectedness, and node centrality are seven most important properties to characterise a network [3]. We found that GraphX has already implemented connectedness, node centrality, path length, but does not have a componenet for computing clustering coefficient. This actually was the first intention for us to implement an algorithm to compute clustering coefficient for each vertex of a given graph. Clustering coefficient is very helpful to many real applications, such as user behaviour prediction and structure prediction (like link prediction). We did that before in a bunch of papers (e.g., [4-5]), and also found many other publication papers using this metric in their work [6-8]. We are very confident that this feature will benefit GraphX and attract a large number of users. References [1] https://en.wikipedia.org/wiki/Clustering_coefficient [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 citations). [3] https://en.wikipedia.org/wiki/Network_science [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106. [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi Li, and Liangwei Wang. Mining Competitive Relationships by Learning across Heterogeneous Networks. In Proceedings of the Twenty-First Conference on Information and Knowledge Management (CIKM'12). pp. 1432-1441. [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical structure and the prediction of missing links in networks. Nature 453.7191 (2008): 98-101. (with 973 citations) [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social networks 25.3 (2003): 211-230. (1238 citations) [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New perspectives and methods in link prediction. In KDD'10. was: The Clustering Coefficient (CC) is a fundamental measure in social (or other type of) network analysis assessing the degree to which nodes tend to cluster together. We propose to implement an algorithm to compute the clustering coefficient for each vertex of a given graph in GraphX. Specifically, The clustering coefficient of a vertex (node) in a graph quantifies how close its neighbours are to being a clique (complete graph). More formally, the clustering coefficient C_i for a vertex v_i is given by the proportion of links between the vertices within its neighbourhood divided by the number of links that could possibly exist between them. Clustering coefficient is well known and has wide applications. Duncan J. Watts and Steven Strogatz introduced the measure in 1998 to determine whether a graph is a small-world network (1). Their paper has attacted 27266 citations by now. Similar features are included in NetworkX (2), SNAP (3), etc. (1) Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of ‘small-world’networks." nature 393.6684 (1998): 440-442. (2) http://networkx.github.io/ (3) http://snap.stanford.edu/ > Clustering coefficient computation in GraphX > > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > The Clustering Coefficient (CC) is a fundamental measure in social (or other > type of) network analysis assessing the degree to which nodes tend to cluster > together [1][2]. Clustering coefficient, along with density, node degree, > path length, diameter, connectedness, and node centrality are seven most > important properties to characterise a network [3]. > We found that GraphX has already implemented connectedness, node centrality, > path length, but does not have a componenet for computing clustering > coefficient. This actually was the first intention for us to implement an > algorithm to compute clustering coefficient for each vertex of a given graph. > Clustering coefficient is very helpful to many real applications, such as > user behaviour prediction and structure prediction (like link prediction). We > did that before in a bunch of papers (e.g., [4-5]), and also found many other > publication papers using this metric in their
[jira] [Commented] (SPARK-10994) Clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962929#comment-14962929 ] Yang Yang commented on SPARK-10994: --- update with describing our motivation in more details > Clustering coefficient computation in GraphX > > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > The Clustering Coefficient (CC) is a fundamental measure in social (or other > type of) network analysis assessing the degree to which nodes tend to cluster > together [1][2]. Clustering coefficient, along with density, node degree, > path length, diameter, connectedness, and node centrality are seven most > important properties to characterise a network [3]. > We found that GraphX has already implemented connectedness, node centrality, > path length, but does not have a componenet for computing clustering > coefficient. This actually was the first intention for us to implement an > algorithm to compute clustering coefficient for each vertex of a given graph. > Clustering coefficient is very helpful to many real applications, such as > user behaviour prediction and structure prediction (like link prediction). We > did that before in a bunch of papers (e.g., [4-5]), and also found many other > publication papers using this metric in their work [6-8]. We are very > confident that this feature will benefit GraphX and attract a large number of > users. > References > [1] https://en.wikipedia.org/wiki/Clustering_coefficient > [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of > ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 > citations). > [3] https://en.wikipedia.org/wiki/Network_science > [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of > "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge > and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106. > [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi > Li, and Liangwei Wang. Mining Competitive Relationships by Learning across > Heterogeneous Networks. In Proceedings of the Twenty-First Conference on > Information and Knowledge Management (CIKM'12). pp. 1432-1441. > [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical > structure and the prediction of missing links in networks. Nature 453.7191 > (2008): 98-101. (with 973 citations) > [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social > networks 25.3 (2003): 211-230. (1238 citations) > [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New > perspectives and methods in link prediction. In KDD'10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.
prakhar jauhari created SPARK-11181: --- Summary: Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled. Key: SPARK-11181 URL: https://issues.apache.org/jira/browse/SPARK-11181 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core, YARN Affects Versions: 1.3.1 Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. All servers in cluster running Linux version 2.6.32. Job in yarn-client mode. Reporter: prakhar jauhari Fix For: 1.3.2, 1.5.2 Spark driver reduces total executors count even when Dynamic Allocation is not enabled. To reproduce this: 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores. 2. When the application launches and gets it required executors, One of the DN's losses connectivity and is timed out. 3. Spark issues a killExecutor for the executor on the DN which was timed out. 4. Even with dynamic allocation off, spark's scheduler reduces the "targetNumExecutors". 5. Thus the job runs with reduced executor count. Note : The severity of the issue increases : If some of the DN that were running my job's executors lose connectivity intermittently, spark scheduler reduces "targetNumExecutors", thus not asking for new executors on any other nodes, causing the job to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
Nitin Goyal created SPARK-11179: --- Summary: Push filters through aggregate if filters are subset of 'group by' expressions Key: SPARK-11179 URL: https://issues.apache.org/jira/browse/SPARK-11179 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nitin Goyal Priority: Minor Fix For: 1.6.0 Push filters through aggregate if filters are subset of 'group by' expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11144) Add SparkLauncher for Spark Streaming, Spark SQL, etc
[ https://issues.apache.org/jira/browse/SPARK-11144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962903#comment-14962903 ] Jean-Baptiste Onofré commented on SPARK-11144: -- Hi Yuhang, just to confirm: an utility like spark-submit but programmatically (like SparkLauncher), right ? > Add SparkLauncher for Spark Streaming, Spark SQL, etc > - > > Key: SPARK-11144 > URL: https://issues.apache.org/jira/browse/SPARK-11144 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL, Streaming >Affects Versions: 1.5.1 > Environment: Linux x64 >Reporter: Yuhang Chen >Priority: Minor > Labels: launcher > > Now we hava org.apache.spark.launcher.SparkLauncher to lauch spark as a child > process. However, it does not support other libs, such as Spark Streaming and > Spark SQL. > What I'm looking for is an utility like spark-submit, with which you can > submit any spark lib jobs to all supported resource manager(Standalone, YARN, > Mesos, etc) in Java/Scala code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11157) Allow Spark to be built without assemblies
[ https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962897#comment-14962897 ] Jean-Baptiste Onofré commented on SPARK-11157: -- Agree with Marcelo. It's something that I planned: create more fine grained jar file instead of big Spark jar too. > Allow Spark to be built without assemblies > -- > > Key: SPARK-11157 > URL: https://issues.apache.org/jira/browse/SPARK-11157 > Project: Spark > Issue Type: Umbrella > Components: Build, Spark Core, YARN >Reporter: Marcelo Vanzin > Attachments: no-assemblies.pdf > > > For reasoning, discussion of pros and cons, and other more detailed > information, please see attached doc. > The idea is to be able to build a Spark distribution that has just a > directory full of jars instead of the huge assembly files we currently have. > Getting there requires changes in a bunch of places, I'll try to list the > ones I identified in the document, in the order that I think would be needed > to not break things: > * make streaming backends not be assemblies > Since people may depend on the current assembly artifacts in their > deployments, we can't really remove them; but we can make them be dummy jars > and rely on dependency resolution to download all the jars. > PySpark tests would also need some tweaking here. > * make examples jar not be an assembly > Probably requires tweaks to the {{run-example}} script. The location of the > examples jar would have to change (it won't be able to live in the same place > as the main Spark jars anymore). > * update YARN backend to handle a directory full of jars when launching apps > Currently YARN localizes the Spark assembly (depending on the user > configuration); it needs to be modified so that it can localize all needed > libraries instead of a single jar. > * Modify launcher library to handle the jars directory > This should be trivial > * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory > depending on which profile is enabled. > We should keep the option to build with the assembly on by default, for > backwards compatibility, to give people time to prepare. > Filing this bug as an umbrella; please file sub-tasks if you plan to work on > a specific part of the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs
[ https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11176: --- Description: This umbrella ticket gathers together several distinct bug reports related to problems using the wholeTextFiles method to read files. Most of these bugs deal with reading files from S3, but it's not clear whether S3 is necessary to hit these bugs. These issues may have a common underlying cause and should be investigated together. was: This umbrella ticket gathers together several distinct bug reports related to problems using the wholeTextFiles method to read files from S3. These issues may have a common underlying cause and should be investigated together. > Umbrella ticket for wholeTextFiles bugs > --- > > Key: SPARK-11176 > URL: https://issues.apache.org/jira/browse/SPARK-11176 > Project: Spark > Issue Type: Umbrella > Components: Input/Output, Spark Core >Reporter: Josh Rosen > > This umbrella ticket gathers together several distinct bug reports related to > problems using the wholeTextFiles method to read files. Most of these bugs > deal with reading files from S3, but it's not clear whether S3 is necessary > to hit these bugs. > These issues may have a common underlying cause and should be investigated > together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11180: Assignee: Apache Spark > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Assignee: Apache Spark >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > We want to set "isOfficer" false whenever there is null. > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962923#comment-14962923 ] Apache Spark commented on SPARK-11180: -- User 'rishabhbhardwaj' has created a pull request for this issue: https://github.com/apache/spark/pull/9166 > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > We want to set "isOfficer" false whenever there is null. > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11180: Assignee: (was: Apache Spark) > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > We want to set "isOfficer" false whenever there is null. > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
[ https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962931#comment-14962931 ] Apache Spark commented on SPARK-11179: -- User 'nitin2goyal' has created a pull request for this issue: https://github.com/apache/spark/pull/9167 > Push filters through aggregate if filters are subset of 'group by' expressions > -- > > Key: SPARK-11179 > URL: https://issues.apache.org/jira/browse/SPARK-11179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nitin Goyal >Priority: Minor > Fix For: 1.6.0 > > > Push filters through aggregate if filters are subset of 'group by' > expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
[ https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11179: Assignee: Apache Spark > Push filters through aggregate if filters are subset of 'group by' expressions > -- > > Key: SPARK-11179 > URL: https://issues.apache.org/jira/browse/SPARK-11179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nitin Goyal >Assignee: Apache Spark >Priority: Minor > Fix For: 1.6.0 > > > Push filters through aggregate if filters are subset of 'group by' > expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11132) Mean Shift algorithm integration
[ https://issues.apache.org/jira/browse/SPARK-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962960#comment-14962960 ] Beck Gaël commented on SPARK-11132: --- Thank you, It's not the case for mean shift, i hope it will be. I've deposited the algorithm on http://spark-packages.org/package/Kybe67/Mean-Shift-LSH. I will prepare it as a Spark package as soon as i can because i've some sbt issues with spark-package. If something is missing it will be a pleasure to remedy it. Thank you again for your support. > Mean Shift algorithm integration > > > Key: SPARK-11132 > URL: https://issues.apache.org/jira/browse/SPARK-11132 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Beck Gaël >Priority: Minor > > I made a version of the clustering algorithm Mean Shift in scala/Spark and > would like to contribute if you think that it is a good idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.
[ https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962961#comment-14962961 ] prakhar jauhari commented on SPARK-11181: - On analysing the code (Spark 1.3.1): When my DN goes unreachable: Spark core's HeartbeatReceiver invokes _expireDeadHosts()_: which checks if Dynamic Allocation is supported and then invokes _"sc.killExecutor()"_ {quote} if (sc.supportDynamicAllocation) \{ sc.killExecutor(executorId) } {quote} Surprisingly _supportDynamicAllocation_ in _sparkContext.scala_ is defined to result "True" if _dynamicAllocationTesting_ flag is enabled or spark is running over _yarn_ {quote} private\[spark\] def supportDynamicAllocation = master.contains("yarn") || dynamicAllocationTesting {quote} _"sc.killExecutor()"_ matches it to configured _"schedulerBackend"_ (CoarseGrainedSchedulerBackend in this case) and invokes _"killExecutors(executorIds)"_ CoarseGrainedSchedulerBackend calculates a _"newTotal"_ for the total number of executors required, and sends a update to application master by invoking _"doRequestTotalExecutors(newTotal)"_ CoarseGrainedSchedulerBackend then invokes a _"doKillExecutors(filteredExecutorIds)"_ for the lost executors. Thus reducing the total number of executors in a host intermittently unreachable scenario. > Spark Yarn : Spark reducing total executors count even when Dynamic > Allocation is disabled. > --- > > Key: SPARK-11181 > URL: https://issues.apache.org/jira/browse/SPARK-11181 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, YARN >Affects Versions: 1.3.1 > Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. > All servers in cluster running Linux version 2.6.32. > Job in yarn-client mode. >Reporter: prakhar jauhari > Fix For: 1.3.2 > > > Spark driver reduces total executors count even when Dynamic Allocation is > not enabled. > To reproduce this: > 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores. > 2. When the application launches and gets it required executors, One of the > DN's losses connectivity and is timed out. > 3. Spark issues a killExecutor for the executor on the DN which was timed > out. > 4. Even with dynamic allocation off, spark's scheduler reduces the > "targetNumExecutors". > 5. Thus the job runs with reduced executor count. > Note : The severity of the issue increases : If some of the DN that were > running my job's executors lose connectivity intermittently, spark scheduler > reduces "targetNumExecutors", thus not asking for new executors on any other > nodes, causing the job to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.
[ https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] prakhar jauhari updated SPARK-11181: Fix Version/s: (was: 1.5.2) > Spark Yarn : Spark reducing total executors count even when Dynamic > Allocation is disabled. > --- > > Key: SPARK-11181 > URL: https://issues.apache.org/jira/browse/SPARK-11181 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, YARN >Affects Versions: 1.3.1 > Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. > All servers in cluster running Linux version 2.6.32. > Job in yarn-client mode. >Reporter: prakhar jauhari > Fix For: 1.3.2 > > > Spark driver reduces total executors count even when Dynamic Allocation is > not enabled. > To reproduce this: > 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores. > 2. When the application launches and gets it required executors, One of the > DN's losses connectivity and is timed out. > 3. Spark issues a killExecutor for the executor on the DN which was timed > out. > 4. Even with dynamic allocation off, spark's scheduler reduces the > "targetNumExecutors". > 5. Thus the job runs with reduced executor count. > Note : The severity of the issue increases : If some of the DN that were > running my job's executors lose connectivity intermittently, spark scheduler > reduces "targetNumExecutors", thus not asking for new executors on any other > nodes, causing the job to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11180) DataFrameNaFunctions fills does not support Boolean Type:
Satya Narayan created SPARK-11180: - Summary: DataFrameNaFunctions fills does not support Boolean Type: Key: SPARK-11180 URL: https://issues.apache.org/jira/browse/SPARK-11180 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.1, 1.5.0 Reporter: Satya Narayan Priority: Minor Currently DataFrame.na.fill does not support Boolean primitive type. We have use cases where while data massaging/preparation we want to fill boolean columns with false/true value. Ex: val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer") empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, isOfficer: boolean] scala> empDf.show +-+---+-+ |EmpId|Designation|isOfficer| +-+---+-+ |1| null| null| |2|SVP| true| |3|Dir|false| +-+---+-+ We want to set "isOfficer" false whenever there is null. scala> empDf.na.fill(Map("isOfficer"->false)) throws exception java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean (false). at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) ... Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes
[ https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962913#comment-14962913 ] Josh Rosen commented on SPARK-11177: It looks like this is caused by MAPREDUCE-4470, which is not patched in Apache Hadoop 1.x releases. If Spark users cannot upgrade to Hadoop 2.x and absolutely need a fix for this, then one somewhat hacky solution is to use a modified copy of CombineFileInputFormat which lives in the Spark source tree and includes the three-line fix for MAPREDUCE-4470. While this works (I have tests!), it's not an approach which is suitable for inclusion in a Spark release: it's going to be borderline impossible to maintain source- and binary-compatibility with all of our supported Hadoop versions while using this approach. > sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero > bytes > --- > > Key: SPARK-11177 > URL: https://issues.apache.org/jira/browse/SPARK-11177 > Project: Spark > Issue Type: Sub-task >Reporter: Josh Rosen >Assignee: Josh Rosen > > From a user report: > {quote} > When I upload a series of text files to an S3 directory and one of the files > is empty (0 bytes). The sc.wholeTextFiles method stack traces. > java.lang.ArrayIndexOutOfBoundsException: 0 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245) > at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > {quote} > It looks like this has been a longstanding issue: > * > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html > * > https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark > * > https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.
[ https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] prakhar jauhari updated SPARK-11181: Target Version/s: 1.3.2 (was: 1.3.2, 1.5.2) > Spark Yarn : Spark reducing total executors count even when Dynamic > Allocation is disabled. > --- > > Key: SPARK-11181 > URL: https://issues.apache.org/jira/browse/SPARK-11181 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, YARN >Affects Versions: 1.3.1 > Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. > All servers in cluster running Linux version 2.6.32. > Job in yarn-client mode. >Reporter: prakhar jauhari > Fix For: 1.3.2 > > > Spark driver reduces total executors count even when Dynamic Allocation is > not enabled. > To reproduce this: > 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores. > 2. When the application launches and gets it required executors, One of the > DN's losses connectivity and is timed out. > 3. Spark issues a killExecutor for the executor on the DN which was timed > out. > 4. Even with dynamic allocation off, spark's scheduler reduces the > "targetNumExecutors". > 5. Thus the job runs with reduced executor count. > Note : The severity of the issue increases : If some of the DN that were > running my job's executors lose connectivity intermittently, spark scheduler > reduces "targetNumExecutors", thus not asking for new executors on any other > nodes, causing the job to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically
[ https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6541: --- Assignee: (was: Apache Spark) > Executor table on Stage page should sort by Executor ID numerically, not > lexically > -- > > Key: SPARK-6541 > URL: https://issues.apache.org/jira/browse/SPARK-6541 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Ryan Williams >Priority: Minor > > Page loads with a table like this: > !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png! > After clicking "Executor ID" to sort by that column, it sorts numerically: > !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Satya Narayan updated SPARK-11180: -- Summary: DataFrame.na.fill does not support Boolean Type: (was: DataFrameNaFunctions fills does not support Boolean Type:) > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > val empDf = > sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > +-+---+-+ > |EmpId|Designation|isOfficer| > +-+---+-+ > |1| null| null| > |2|SVP| true| > |3|Dir|false| > +-+---+-+ > We want to set "isOfficer" false whenever there is null. > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Satya Narayan updated SPARK-11180: -- Description: Currently DataFrame.na.fill does not support Boolean primitive type. We have use cases where while data massaging/preparation we want to fill boolean columns with false/true value. Ex: val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] ((1,null,null),(2,"SVP",true),(3,"Dir",false))) .toDF("EmpId","Designation","isOfficer") empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, isOfficer: boolean] scala> empDf.show |EmpId|Designation|isOfficer| |1| null| null| |2|SVP| true| |3|Dir|false| We want to set "isOfficer" false whenever there is null. scala> empDf.na.fill(Map("isOfficer"->false)) throws exception java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean (false). at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) ... Can you add support for Boolean into na.fill function. was: Currently DataFrame.na.fill does not support Boolean primitive type. We have use cases where while data massaging/preparation we want to fill boolean columns with false/true value. Ex: val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer") empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, isOfficer: boolean] scala> empDf.show +-+---+-+ |EmpId|Designation|isOfficer| +-+---+-+ |1| null| null| |2|SVP| true| |3|Dir|false| +-+---+-+ We want to set "isOfficer" false whenever there is null. scala> empDf.na.fill(Map("isOfficer"->false)) throws exception java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean (false). at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) ... Can you add support for Boolean into na.fill function. > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > We want to set "isOfficer" false whenever there is null. > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically
[ https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962895#comment-14962895 ] Apache Spark commented on SPARK-6541: - User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9165 > Executor table on Stage page should sort by Executor ID numerically, not > lexically > -- > > Key: SPARK-6541 > URL: https://issues.apache.org/jira/browse/SPARK-6541 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Ryan Williams >Priority: Minor > > Page loads with a table like this: > !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png! > After clicking "Executor ID" to sort by that column, it sorts numerically: > !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically
[ https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6541: --- Assignee: Apache Spark > Executor table on Stage page should sort by Executor ID numerically, not > lexically > -- > > Key: SPARK-6541 > URL: https://issues.apache.org/jira/browse/SPARK-6541 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Ryan Williams >Assignee: Apache Spark >Priority: Minor > > Page loads with a table like this: > !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png! > After clicking "Executor ID" to sort by that column, it sorts numerically: > !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures
[ https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962900#comment-14962900 ] Sun Rui commented on SPARK-11167: - For a DataFrame, each column is a collection of values of same type. No heterogeneous values are expected for a specific column. We can enhance the robustness of inferring type by adding check for such case and report error. > Incorrect type resolution on heterogeneous data structures > -- > > Key: SPARK-11167 > URL: https://issues.apache.org/jira/browse/SPARK-11167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Maciej Szymkiewicz > > If structure contains heterogeneous incorrectly assigns type of the > encountered element as type of a whole structure. This problem affects both > lists: > {code} > SparkR:::infer_type(list(a=1, b="a") > ## [1] "array" > SparkR:::infer_type(list(a="a", b=1)) > ## [1] "array" > {code} > and environments: > {code} > SparkR:::infer_type(as.environment(list(a=1, b="a"))) > ## [1] "map" > SparkR:::infer_type(as.environment(list(a="a", b=1))) > ## [1] "map " > {code} > This results in errors during data collection and other operations on > DataFrames: > {code} > ldf <- data.frame(row.names=1:2) > ldf$foo <- list(list("1", 2), list(3, 4)) > sdf <- createDataFrame(sqlContext, ldf) > collect(sdf) > ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID > 9) > ## scala.MatchError: 2.0 (of class java.lang.Double) > ## ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
[ https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11179: Assignee: (was: Apache Spark) > Push filters through aggregate if filters are subset of 'group by' expressions > -- > > Key: SPARK-11179 > URL: https://issues.apache.org/jira/browse/SPARK-11179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nitin Goyal >Priority: Minor > Fix For: 1.6.0 > > > Push filters through aggregate if filters are subset of 'group by' > expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11128) strange NPE when writing in non-existing S3 bucket
[ https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11128. --- Resolution: Not A Problem Not a problem with Spark, that is. > strange NPE when writing in non-existing S3 bucket > -- > > Key: SPARK-11128 > URL: https://issues.apache.org/jira/browse/SPARK-11128 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.1 >Reporter: mathieu despriee >Priority: Minor > > For the record, as it's relatively minor, and related to s3n (not tested with > s3a). > By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, > with a simple df.write.parquet(s3path). > We got a NPE (see stack trace below), which is very misleading. > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10352) Replace SQLTestData internal usages of String with UTF8String
[ https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963194#comment-14963194 ] Harsh Rathi commented on SPARK-10352: - Why this is not a problem ? I am writing custom explode function. If I try to use CatalystTypeConverters for type conversions, it gives error in StructConverter since InternalRow is not added as a case there. If I don't use CatalystTypeConverters, it gives casting error saying java.lang.String cannot be cast into UTF8String. > Replace SQLTestData internal usages of String with UTF8String > - > > Key: SPARK-10352 > URL: https://issues.apache.org/jira/browse/SPARK-10352 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Feynman Liang > > Running the code: > {code} > val inputString = "abc" > val row = InternalRow.apply(inputString) > val unsafeRow = > UnsafeProjection.create(Array[DataType](StringType)).apply(row) > {code} > generates the error: > {code} > [info] java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > [info] at > org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46) > ***snip*** > {code} > Although {{StringType}} should in theory only have internal type > {{UTF8String}}, we [are inconsistent with this > constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131] > and being more strict would [break existing > code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41] > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11184) Declare most of .mllib code not-Experimental
Sean Owen created SPARK-11184: - Summary: Declare most of .mllib code not-Experimental Key: SPARK-11184 URL: https://issues.apache.org/jira/browse/SPARK-11184 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.1 Reporter: Sean Owen Priority: Minor Comments please [~mengxr] and [~josephkb]: my proposal is to remove most {{@Experimental}} annotations from the {{.mllib}} code, on the theory that it's not intended to change much more. I can easily take a shot at this, but wanted to collect thoughts before I started. Does the theory sound reasonable? Part of it is a desire to keep this annotation meaningful, and also encourage people to at least view MLlib as stable, because it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
Liangliang Gu created SPARK-11182: - Summary: HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode Key: SPARK-11182 URL: https://issues.apache.org/jira/browse/SPARK-11182 Project: Spark Issue Type: Bug Components: YARN Reporter: Liangliang Gu In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node automatically, which will not be updated when Spark update Credentials for the current user. Spark should update these tokens in order to avoid Token Expired Error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.
[ https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11181: -- Flags: (was: Patch,Important) Target Version/s: (was: 1.3.2) Fix Version/s: (was: 1.3.2) [~prakhar088] Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a JIRA, as there are a number of issues here: it can't have a Fix/Target verison; the flags aren't valid. Please try reproducing vs master as 1.3.1 is relatively old, and many things have been fixed since. I suspect this is a duplicate. > Spark Yarn : Spark reducing total executors count even when Dynamic > Allocation is disabled. > --- > > Key: SPARK-11181 > URL: https://issues.apache.org/jira/browse/SPARK-11181 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, YARN >Affects Versions: 1.3.1 > Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. > All servers in cluster running Linux version 2.6.32. > Job in yarn-client mode. >Reporter: prakhar jauhari > > Spark driver reduces total executors count even when Dynamic Allocation is > not enabled. > To reproduce this: > 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores. > 2. When the application launches and gets it required executors, One of the > DN's losses connectivity and is timed out. > 3. Spark issues a killExecutor for the executor on the DN which was timed > out. > 4. Even with dynamic allocation off, spark's scheduler reduces the > "targetNumExecutors". > 5. Thus the job runs with reduced executor count. > Note : The severity of the issue increases : If some of the DN that were > running my job's executors lose connectivity intermittently, spark scheduler > reduces "targetNumExecutors", thus not asking for new executors on any other > nodes, causing the job to hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData
[ https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10921: -- Assignee: Jacek Laskowski > Completely remove the use of SparkContext.preferredNodeLocationData > --- > > Key: SPARK-10921 > URL: https://issues.apache.org/jira/browse/SPARK-10921 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.5.1 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Minor > Fix For: 1.6.0 > > > SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} > yet the code makes it less obvious as it says (see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96): > {code} > // This is used only by YARN for now, but should be relevant to other > cluster types (Mesos, > // etc) too. This is typically generated from > InputFormatInfo.computePreferredLocations. It > // contains a map from hostname to a list of input format splits on the > host. > private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = > Map() > {code} > It turns out that there are places where the initialization does take place > that only adds up to the confusion. > When you search for the use of {{SparkContext.preferredNodeLocationData}}, > you'll find 3 places - one constructor marked {{@deprecated}}, the other with > {{logWarning}} telling us that _"Passing in preferred locations has no > effect at all, see SPARK-8949"_, and in > {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method. > There is no consistent approach to deal with it given it's no longer used in > theory. > [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265] > method > caught my eye and I found that it does the following in > client.register: > {code} > if (sc != null) sc.preferredNodeLocationData else Map() > {code} > However, {{client.register}} [ignores the input parameter > completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78], > but the scaladoc says (note {{preferredNodeLocations}} param): > {code} > /** >* Registers the application master with the RM. >* >* @param conf The Yarn configuration. >* @param sparkConf The Spark configuration. >* @param preferredNodeLocations Map with hints about where to allocate > containers. >* @param uiAddress Address of the SparkUI. >* @param uiHistoryAddress Address of the application on the History Server. >*/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.
[ https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10633. --- Resolution: Not A Problem > Persisting Spark stream to MySQL - Spark tries to create the table for every > stream even if it exist already. > - > > Key: SPARK-10633 > URL: https://issues.apache.org/jira/browse/SPARK-10633 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 1.4.0, 1.5.0 > Environment: Ubuntu 14.04 > IntelliJ IDEA 14.1.4 > sbt > mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1) >Reporter: Lunen > > Persisting Spark Kafka stream to MySQL > Spark 1.4 + tries to create a table automatically every time the stream gets > sent to a specified table. > Please note, Spark 1.3.1 works. > Code sample: > val url = "jdbc:mysql://host:port/db?user=user=password > val crp = RowSetProvider.newFactory() > val crsSql: CachedRowSet = crp.createCachedRowSet() > val crsTrg: CachedRowSet = crp.createCachedRowSet() > crsSql.beforeFirst() > crsTrg.beforeFirst() > //Read Stream from Kafka > //Produce SQL INSERT STRING > > streamT.foreachRDD { rdd => > if (rdd.toLocalIterator.nonEmpty) { > sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events") > while (crsSql.next) { > sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", > new Properties) > println("Persisted Data: " + 'SQL INSERT STRING') > } > crsSql.beforeFirst() > } > stmt.close() > conn.close() > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
[ https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11182: Assignee: (was: Apache Spark) > HDFS Delegation Token will be expired when calling > "UserGroupInformation.getCurrentUser.addCredentials" in HA mode > -- > > Key: SPARK-11182 > URL: https://issues.apache.org/jira/browse/SPARK-11182 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Liangliang Gu > > In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node > automatically, which will not be updated when Spark update Credentials for > the current user. > Spark should update these tokens in order to avoid Token Expired Error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
[ https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963092#comment-14963092 ] Apache Spark commented on SPARK-11182: -- User 'marsishandsome' has created a pull request for this issue: https://github.com/apache/spark/pull/9168 > HDFS Delegation Token will be expired when calling > "UserGroupInformation.getCurrentUser.addCredentials" in HA mode > -- > > Key: SPARK-11182 > URL: https://issues.apache.org/jira/browse/SPARK-11182 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Liangliang Gu > > In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node > automatically, which will not be updated when Spark update Credentials for > the current user. > Spark should update these tokens in order to avoid Token Expired Error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
[ https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963091#comment-14963091 ] Liangliang Gu commented on SPARK-11182: --- https://github.com/apache/spark/pull/9168 > HDFS Delegation Token will be expired when calling > "UserGroupInformation.getCurrentUser.addCredentials" in HA mode > -- > > Key: SPARK-11182 > URL: https://issues.apache.org/jira/browse/SPARK-11182 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Liangliang Gu > > In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node > automatically, which will not be updated when Spark update Credentials for > the current user. > Spark should update these tokens in order to avoid Token Expired Error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
[ https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11182: Assignee: Apache Spark > HDFS Delegation Token will be expired when calling > "UserGroupInformation.getCurrentUser.addCredentials" in HA mode > -- > > Key: SPARK-11182 > URL: https://issues.apache.org/jira/browse/SPARK-11182 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Liangliang Gu >Assignee: Apache Spark > > In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node > automatically, which will not be updated when Spark update Credentials for > the current user. > Spark should update these tokens in order to avoid Token Expired Error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11183) enable support for mesos 0.24+
Ioannis Polyzos created SPARK-11183: --- Summary: enable support for mesos 0.24+ Key: SPARK-11183 URL: https://issues.apache.org/jira/browse/SPARK-11183 Project: Spark Issue Type: Bug Components: Deploy, Mesos Reporter: Ioannis Polyzos mesos 0.24, the mesos leader info in ZK has changed to json tis result to spark failed to running on 0.24+. References : https://issues.apache.org/jira/browse/MESOS-2340 http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E https://github.com/mesos/elasticsearch/issues/338 https://github.com/spark-jobserver/spark-jobserver/issues/267 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963110#comment-14963110 ] Mojmir Vinkler commented on SPARK-5250: --- Yes, it's caused by reading a corrupt file (we only experienced this for compressed (gzipped) files). I think the file got corrupted when it was saved to S3, but we used boto for that, not Spark. What's weird is that I'm able to read the file with pandas without any problems. > EOFException in when reading gzipped files from S3 with wholeTextFiles > -- > > Key: SPARK-5250 > URL: https://issues.apache.org/jira/browse/SPARK-5250 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Mojmir Vinkler >Priority: Critical > > I get an `EOFException` error when reading *some* gzipped files using > `sc.wholeTextFiles`. It happens to just a few files, I thought that the file > is corrupted, but I was able to read it without problems using `sc.textFile` > (and pandas). > Traceback for command > `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()` > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect() > /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self) > 674 """ > 675 with SCCallSiteSync(self.context) as css: > --> 676 bytesInJava = self._jrdd.collect().iterator() > 677 return list(self._collect_iterator_through_file(bytesInJava)) > 678 > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o1576.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: > Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77) > at java.io.InputStream.read(InputStream.java:101) > at com.google.common.io.ByteStreams.copy(ByteStreams.java:207) > at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252) > at > org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) >
[jira] [Resolved] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData
[ https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10921. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8976 [https://github.com/apache/spark/pull/8976] > Completely remove the use of SparkContext.preferredNodeLocationData > --- > > Key: SPARK-10921 > URL: https://issues.apache.org/jira/browse/SPARK-10921 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.5.1 >Reporter: Jacek Laskowski >Priority: Minor > Fix For: 1.6.0 > > > SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} > yet the code makes it less obvious as it says (see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96): > {code} > // This is used only by YARN for now, but should be relevant to other > cluster types (Mesos, > // etc) too. This is typically generated from > InputFormatInfo.computePreferredLocations. It > // contains a map from hostname to a list of input format splits on the > host. > private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = > Map() > {code} > It turns out that there are places where the initialization does take place > that only adds up to the confusion. > When you search for the use of {{SparkContext.preferredNodeLocationData}}, > you'll find 3 places - one constructor marked {{@deprecated}}, the other with > {{logWarning}} telling us that _"Passing in preferred locations has no > effect at all, see SPARK-8949"_, and in > {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method. > There is no consistent approach to deal with it given it's no longer used in > theory. > [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265] > method > caught my eye and I found that it does the following in > client.register: > {code} > if (sc != null) sc.preferredNodeLocationData else Map() > {code} > However, {{client.register}} [ignores the input parameter > completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78], > but the scaladoc says (note {{preferredNodeLocations}} param): > {code} > /** >* Registers the application master with the RM. >* >* @param conf The Yarn configuration. >* @param sparkConf The Spark configuration. >* @param preferredNodeLocations Map with hints about where to allocate > containers. >* @param uiAddress Address of the SparkUI. >* @param uiHistoryAddress Address of the application on the History Server. >*/ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10861) Univariate Statistics: Adding range support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963029#comment-14963029 ] Jeff Zhang commented on SPARK-10861: [~JihongMA] what's your progress on this ? > Univariate Statistics: Adding range support as UDAF > --- > > Key: SPARK-10861 > URL: https://issues.apache.org/jira/browse/SPARK-10861 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Range support for continuous -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6645) StructField/StructType and related classes are not in the Scaladoc
[ https://issues.apache.org/jira/browse/SPARK-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963345#comment-14963345 ] Rishabh Bhardwaj commented on SPARK-6645: - I can see StructField/StructType classes in ScalaDoc in org.apache.sql.types package https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType Can you please elaborate? Correct me If I have misunderstood something here. > StructField/StructType and related classes are not in the Scaladoc > -- > > Key: SPARK-6645 > URL: https://issues.apache.org/jira/browse/SPARK-6645 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Aaron Defazio >Priority: Minor > > The current programming guide uses StructField in the Scala examples, yet it > doesn't appear to exist in the Scaladoc. This is related to SPARK-6592, in > that several classes that a user might use do not appear in the Scaladoc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11185) Add more task metrics to the "all Stages Page"
Thomas Graves created SPARK-11185: - Summary: Add more task metrics to the "all Stages Page" Key: SPARK-11185 URL: https://issues.apache.org/jira/browse/SPARK-11185 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.5.1 Reporter: Thomas Graves The "All Stages Page" on the History page could have more information about the stage to allow users to quickly see which stage potentially has long tasks. Indicator or skewed data or bad nodes, etc. Currently to get this information you have to click on every stage. If you have a hundreds of stages this can be very cumbersome. For instance pulling out the max task time and the median to the all stages page would allow me to see the difference and if the max task time is much greater then the median this stage may have had tasks with problems. We already had some discussion about this under https://github.com/apache/spark/pull/9051 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-11186: - Description: Default catalog behaviour for caseness is different in {{SQLContext}} and {{HiveContext}}. {code} test("Catalog caseness (SQL)") { val sqlc = new SQLContext(sc) val relationName = "MyTable" sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { override def sqlContext: SQLContext = sqlc override def schema: StructType = StructType(Nil) })) val tables = sqlc.tableNames() assert(tables.contains(relationName)) } test("Catalog caseness (Hive)") { val sqlc = new HiveContext(sc) val relationName = "MyTable" sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { override def sqlContext: SQLContext = sqlc override def schema: StructType = StructType(Nil) })) val tables = sqlc.tableNames() assert(tables.contains(relationName)) } {code} Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. But the reason that this is needed seems undocumented (both in the manual or in the source code comments). was: Default catalog behaviour for caseness is different in {{SQLContext}} and {{HiveContext}}. {code} test("Catalog caseness (SQL)") { val sqlc = new SQLContext(sc) val relationName = "MyTable" sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { override def sqlContext: SQLContext = sqlc override def schema: StructType = StructType(Nil) })) val tables = sqlc.tableNames() assert(tables.contains(relationName)) } test("Catalog caseness (Hive)") { val sqlc = new HiveContext(sc) val relationName = "MyTable" sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { override def sqlContext: SQLContext = sqlc override def schema: StructType = StructType(Nil) })) val tables = sqlc.tableNames() assert(tables.contains(relationName)) } {/code} Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. But the reason that this is needed seems undocumented (both in the manual or in the source code comments). > Caseness inconsistency between SQLContext and HiveContext > - > > Key: SPARK-11186 > URL: https://issues.apache.org/jira/browse/SPARK-11186 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Santiago M. Mola >Priority: Minor > > Default catalog behaviour for caseness is different in {{SQLContext}} and > {{HiveContext}}. > {code} > test("Catalog caseness (SQL)") { > val sqlc = new SQLContext(sc) > val relationName = "MyTable" > sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new > BaseRelation { > override def sqlContext: SQLContext = sqlc > override def schema: StructType = StructType(Nil) > })) > val tables = sqlc.tableNames() > assert(tables.contains(relationName)) > } > test("Catalog caseness (Hive)") { > val sqlc = new HiveContext(sc) > val relationName = "MyTable" > sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new > BaseRelation { > override def sqlContext: SQLContext = sqlc > override def schema: StructType = StructType(Nil) > })) > val tables = sqlc.tableNames() > assert(tables.contains(relationName)) > } > {code} > Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. > But the reason that this is needed seems undocumented (both in the manual or > in the source code comments). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509 ] Ryan Williams edited comment on SPARK-11162 at 10/19/15 4:01 PM: - In the second message (of 2 total, afaict) on the thread, "eric wong" lays out two steps for enabling DEBUG logging; the first step involves changing a local copy of {{log4j.properties}}, whereas the second involves passing certain parameters to {{spark-submit}}. I was hoping for a way to not have to modify a local {{log4j.properties}} file, but to get debug logging by only passing parameters to {{spark-submit}}. I suppose this issue is confusingly named since technically all of this can be accomlished "from the command line", so I'll rename it to reflect that I'd like a config flag to {{spark-submit}} to enable different logging levels. Also, even modifying {{log4j.properties}} in various places and passing it to the {{--files}} flag, I am unable to get DEBUG logging on the client in {{yarn-client}} mode, i.e. {{--files log4j.properties}} makes all of my YARN containers have debug logging, but I still only get INFO logging in e.g. my {{spark-shell}} session that is running locally. was (Author: rdub): In the second message (of 2 total, afaict) on the thread, "eric wong" lays out two steps for enabling DEBUG logging; the first step involves changing a local copy of {{log4j.properties}}, whereas the second involves passing certain parameters to {{spark-submit}}. I was hoping for a way to not have to modify a local {{log4j.properties}} file, but to get debug logging by only passing parameters to {{spark-submit}}. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509 ] Ryan Williams commented on SPARK-11162: --- In the second message (of 2 total, afaict) on the thread, "eric wong" lays out two steps for enabling DEBUG logging; the first step involves changing a local copy of {{log4j.properties}}, whereas the second involves passing certain parameters to {{spark-submit}}. I was hoping for a way to not have to modify a local {{log4j.properties}} file, but to get debug logging by only passing parameters to {{spark-{submit,shell}}}. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509 ] Ryan Williams edited comment on SPARK-11162 at 10/19/15 3:59 PM: - In the second message (of 2 total, afaict) on the thread, "eric wong" lays out two steps for enabling DEBUG logging; the first step involves changing a local copy of {{log4j.properties}}, whereas the second involves passing certain parameters to {{spark-submit}}. I was hoping for a way to not have to modify a local {{log4j.properties}} file, but to get debug logging by only passing parameters to {{spark-submit}}. was (Author: rdub): In the second message (of 2 total, afaict) on the thread, "eric wong" lays out two steps for enabling DEBUG logging; the first step involves changing a local copy of {{log4j.properties}}, whereas the second involves passing certain parameters to {{spark-submit}}. I was hoping for a way to not have to modify a local {{log4j.properties}} file, but to get debug logging by only passing parameters to {{spark-{submit,shell}}}. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963511#comment-14963511 ] Sean Owen commented on SPARK-11162: --- Related: https://issues.apache.org/jira/browse/SPARK-11105 In general configuring log4j does mean configuring a log4j.properties. You should be able to achieve something similar with -D flags but I find it ugly. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD
[ https://issues.apache.org/jira/browse/SPARK-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963522#comment-14963522 ] Sean Owen commented on SPARK-11161: --- Why would it be useful to continue to cache an RDD that can't be used any more? there is no more reference to it in the controlling driver program in this case, and it's the only thing that can use it. There isn't an RDD registry, but if there were, then it would prevent this situation from occurring, which seems like what you'd expect at least. I expect RDDs to behave like JVM objects in this regard. I would not expect something to be hanging on to references to all my objects since I have the references I need, and indeed, doing so prevents the GC that I want. You can't unpersist RDDs from the web UI, though that would make sense as a feature. That's something different. > Viewing the web UI for the first time unpersists a cached RDD > - > > Key: SPARK-11161 > URL: https://issues.apache.org/jira/browse/SPARK-11161 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > This one is a real head-scratcher. [Here's a > screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]: > !http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif! > The three windows, left-to-right, are: > * a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one > executor. [Here's an example app's > environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac]. > * [Spree|https://github.com/hammerlab/spree], opened to the above app's > "Storage" tab. > * my YARN resource manager, showing a link to the web UI running on the > driver. > At the start, nothing has been run in the shell, and I've not visited the web > UI. > I run a simple job in the shell and cache a small RDD that it computes: > {code} > sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, > 100).setName("foo").cache.count > {code} > As the second stage runs, you can see the partitions show up as cached in > Spree. > After the job finishes, a few requested executors continue to fill in, which > you can see in the console at left or the nav bar of Spree in the middle. > Once that has finished, everything is at rest with the RDD "foo" 100% cached. > Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI > on the driver for the first time. > Immediately, the console prints some activity, including that RDD 2 has been > removed: > {code} > 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 > on 172.29.46.15:33156 in memory (size: 1517.0 B, free: 7.2 GB) > 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 > on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1517.0 B, > free: 12.2 GB) > 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2 > 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 > on 172.29.46.15:33156 in memory (size: 1666.0 B, free: 7.2 GB) > 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 > on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1666.0 B, > free: 12.2 GB) > 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1 > 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0 > 15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2 > 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2 > {code} > Accordingly, Spree shows that the RDD has been unpersisted, and I can see in > the event log (not pictured in the screencast) that an Unpersist event has > made its way through the various SparkListeners: > {code} > {"Event":"SparkListenerUnpersistRDD","RDD ID":2} > {code} > Simply loading the web UI causes an RDD unpersist event to fire! > I can't nail down exactly what's causing this, and I've seen evidence that > there are other sequences of events that can also cause it: > * I've repro'd the above steps ~20 times. The RDD always gets unpersisted > when I've not visited the web UI until the RDD is cached, and when the app is > dynamically allocating executors. > * One time, I observed the unpersist to fire without my even visiting the web > UI at all. Other times I wait a long time before visiting the web UI, so that > it is clear that the loading of the web UI is causal, and it always is, but > apparently there's another way for the unpersist to happen, seemingly rarely, > without visiting the web UI. > * I tried a couple of times without dynamic allocation and could not > reproduce it. > * I've tried a couple of times with dynamic
[jira] [Assigned] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs
[ https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-11176: -- Assignee: Josh Rosen > Umbrella ticket for wholeTextFiles bugs > --- > > Key: SPARK-11176 > URL: https://issues.apache.org/jira/browse/SPARK-11176 > Project: Spark > Issue Type: Umbrella > Components: Input/Output, Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > > This umbrella ticket gathers together several distinct bug reports related to > problems using the wholeTextFiles method to read files. Most of these bugs > deal with reading files from S3, but it's not clear whether S3 is necessary > to hit these bugs. > These issues may have a common underlying cause and should be investigated > together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API
[ https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963827#comment-14963827 ] Jayant Shekhar commented on SPARK-10780: Sounds good [~xusen] and [~josephkb] In the process of updating the PR. Thanks! > Set initialModel in KMeans in Pipelines API > --- > > Key: SPARK-10780 > URL: https://issues.apache.org/jira/browse/SPARK-10780 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley > > This is for the Scala version. After this is merged, create a JIRA for > Python version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs
[ https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963825#comment-14963825 ] Josh Rosen commented on SPARK-11176: Going to close this as now, since all child tickets have been resolved as either "Won't Fix" or "Cannot Reproduce." Will re-open if new issues are discovered. > Umbrella ticket for wholeTextFiles bugs > --- > > Key: SPARK-11176 > URL: https://issues.apache.org/jira/browse/SPARK-11176 > Project: Spark > Issue Type: Umbrella > Components: Input/Output, Spark Core >Reporter: Josh Rosen > > This umbrella ticket gathers together several distinct bug reports related to > problems using the wholeTextFiles method to read files. Most of these bugs > deal with reading files from S3, but it's not clear whether S3 is necessary > to hit these bugs. > These issues may have a common underlying cause and should be investigated > together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs
[ https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11176. Resolution: Incomplete > Umbrella ticket for wholeTextFiles bugs > --- > > Key: SPARK-11176 > URL: https://issues.apache.org/jira/browse/SPARK-11176 > Project: Spark > Issue Type: Umbrella > Components: Input/Output, Spark Core >Reporter: Josh Rosen > > This umbrella ticket gathers together several distinct bug reports related to > problems using the wholeTextFiles method to read files. Most of these bugs > deal with reading files from S3, but it's not clear whether S3 is necessary > to hit these bugs. > These issues may have a common underlying cause and should be investigated > together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11027) Better group distinct columns in query compilation
[ https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11027. -- Resolution: Won't Fix > Better group distinct columns in query compilation > -- > > Key: SPARK-11027 > URL: https://issues.apache.org/jira/browse/SPARK-11027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > In AggregationQuerySuite, we have a test > {code} > checkAnswer( > sqlContext.sql( > """ > |SELECT sum(distinct value1), kEY - 100, count(distinct value1) > |FROM agg2 > |GROUP BY Key - 100 > """.stripMargin), > Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, > 3) :: Nil) > {code} > We will treat it as having two distinct columns because sum causes a cast on > value1. Maybe we can ignore the cast when we group distinct columns. So, it > will not be treated as having two distinct columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver
Phil Kallos created SPARK-11193: --- Summary: Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver Key: SPARK-11193 URL: https://issues.apache.org/jira/browse/SPARK-11193 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.1, 1.5.0 Reporter: Phil Kallos After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis Spark Streaming application, and am being consistently greeted with this exception: java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast to scala.collection.mutable.SynchronizedMap at org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175) at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148) at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Worth noting that I am able to reproduce this issue locally, and also on Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0). Also, I am not able to run the included kinesis-asl example. Built locally using: git checkout v1.5.1 mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package Example run command: bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector https://kinesis.us-east-1.amazonaws.com -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11192) When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time
Blake Livingston created SPARK-11192: Summary: When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time Key: SPARK-11192 URL: https://issues.apache.org/jira/browse/SPARK-11192 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1 Environment: java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) org.apache.spark/spark-sql_2.10 "1.5.1" Embedded, in-process spark. Have not tested on standalone or yarn clusters. Reporter: Blake Livingston Priority: Minor Noticed that slowly, over the course of a day or two, heap memory usage on a long running spark process increased monotonically. After doing a heap dump and examining in jvisualvm, saw there were over 15M org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 500MB. Accumulation does not occur when I removed metrics.properties. metrics.properties content: # Enable Graphite *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=x *.sink.graphite.port=2003 *.sink.graphite.period=10 # Enable jvm source for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
[ https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11194: Assignee: Yin Huai (was: Apache Spark) > Use a single URLClassLoader for jars added through SQL's "ADD JAR" command > -- > > Key: SPARK-11194 > URL: https://issues.apache.org/jira/browse/SPARK-11194 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we stack a new URLClassLoader when a user add a jar through SQL's > add jar command. This approach can introduce issues caused by the ordering of > added jars when a class of a jar depends on another class of another jar. > For example, > {code} > ClassLoader1 for Jar1.jar (A.class) >| >|- ClassLoader2 for Jar2.jar (B.class depending on A.class) > {code} > In this case, when we lookup class B, we will not be able to find class A > because Jar2 is the parent of Jar1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
[ https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11194: Assignee: Apache Spark (was: Yin Huai) > Use a single URLClassLoader for jars added through SQL's "ADD JAR" command > -- > > Key: SPARK-11194 > URL: https://issues.apache.org/jira/browse/SPARK-11194 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > Right now, we stack a new URLClassLoader when a user add a jar through SQL's > add jar command. This approach can introduce issues caused by the ordering of > added jars when a class of a jar depends on another class of another jar. > For example, > {code} > ClassLoader1 for Jar1.jar (A.class) >| >|- ClassLoader2 for Jar2.jar (B.class depending on A.class) > {code} > In this case, when we lookup class B, we will not be able to find class A > because Jar2 is the parent of Jar1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
[ https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11194: - Description: Right now, we stack a new URLClassLoader when a user add a jar through SQL's add jar command. This approach can introduce issues caused by the ordering of added jars when a class of a jar depends on another class of another jar. For example, {code} ClassLoader1 for Jar1.jar (A.class) | |- ClassLoader2 for Jar2.jar (B.class depending on A.class) {code} In this case, when we lookup class B, we will not be able to find class A because Jar2 is the parent of Jar1. > Use a single URLClassLoader for jars added through SQL's "ADD JAR" command > -- > > Key: SPARK-11194 > URL: https://issues.apache.org/jira/browse/SPARK-11194 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we stack a new URLClassLoader when a user add a jar through SQL's > add jar command. This approach can introduce issues caused by the ordering of > added jars when a class of a jar depends on another class of another jar. > For example, > {code} > ClassLoader1 for Jar1.jar (A.class) >| >|- ClassLoader2 for Jar2.jar (B.class depending on A.class) > {code} > In this case, when we lookup class B, we will not be able to find class A > because Jar2 is the parent of Jar1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11190) SparkR support for cassandra collection types.
Bilind Hajer created SPARK-11190: Summary: SparkR support for cassandra collection types. Key: SPARK-11190 URL: https://issues.apache.org/jira/browse/SPARK-11190 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Environment: SparkR Version: 1.5.1 Cassandra Version: 2.1.6 R Version: 3.2.2 Cassandra Connector version: 1.5.0-M2 Reporter: Bilind Hajer Fix For: 1.5.2 I want to create a data frame from a Cassandra keyspace and column family in sparkR. I am able to create data frames from tables which do not include any Cassandra collection datatypes, such as Map, Set and List. But, many of the schemas that I need data from, do include these collection data types. Here is my local environment. SparkR Version: 1.5.1 Cassandra Version: 2.1.6 R Version: 3.2.2 Cassandra Connector version: 1.5.0-M2 To test this issue, I did the following iterative process. sudo ./sparkR --packages com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf spark.cassandra.connection.host=127.0.0.1 Running this command, with sparkR gives me access to the spark cassandra connector package I need, and connects me to my local cqlsh server ( which is up and running while running this code in sparkR shell ). CREATE TABLE test_table ( column_1 int, column_2 text, column_3 float, column_4 uuid, column_5 timestamp, column_6 boolean, column_7 timeuuid, column_8 bigint, column_9 blob, column_10 ascii, column_11 decimal, column_12 double, column_13 inet, column_14 varchar, column_15 varint, PRIMARY KEY( ( column_1, column_2 ) ) ); All of the above data types are supported. I insert dummy data after creating this test schema. For example, now in my sparkR shell, I run the following code. df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", keyspace = "datahub", table = "test_table") assigns with no errors, then, > schema(df.test) StructType |-name = "column_1", type = "IntegerType", nullable = TRUE |-name = "column_2", type = "StringType", nullable = TRUE |-name = "column_10", type = "StringType", nullable = TRUE |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE |-name = "column_12", type = "DoubleType", nullable = TRUE |-name = "column_13", type = "InetAddressType", nullable = TRUE |-name = "column_14", type = "StringType", nullable = TRUE |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE |-name = "column_3", type = "FloatType", nullable = TRUE |-name = "column_4", type = "UUIDType", nullable = TRUE |-name = "column_5", type = "TimestampType", nullable = TRUE |-name = "column_6", type = "BooleanType", nullable = TRUE |-name = "column_7", type = "UUIDType", nullable = TRUE |-name = "column_8", type = "LongType", nullable = TRUE |-name = "column_9", type = "BinaryType", nullable = TRUE Schema is correct. > class(df.test) [1] "DataFrame" attr(,"package") [1] "SparkR" df.test is clearly defined to be a DataFrame Object. > head(df.test) column_1 column_2 column_10 column_11 column_12 column_13 column_14 column_15 11helloNANANANANANA column_3 column_4 column_5 column_6 column_7 column_8 column_9 1 3.4 NA NA NA NA NA NA sparkR is reading from the column_family correctly, but now lets add a collection data type to the schema. Now I will drop that test_table, and recreate the table, with with an extra column of data type mapCREATE TABLE test_table ( column_1 int, column_2 text, column_3 float, column_4 uuid, column_5 timestamp, column_6 boolean, column_7 timeuuid, column_8 bigint, column_9 blob, column_10ascii, column_11decimal, column_12double, column_13inet, column_14varchar, column_15varint, column_16map , PRIMARY KEY( ( column_1, column_2 ) ) ); After inserting dummy data into the new test schema, > df.test <- read.df(sqlContext, source =
[jira] [Updated] (SPARK-10955) Warn if dynamic allocation is enabled for Streaming jobs
[ https://issues.apache.org/jira/browse/SPARK-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10955: -- Summary: Warn if dynamic allocation is enabled for Streaming jobs (was: Disable dynamic allocation for Streaming jobs) > Warn if dynamic allocation is enabled for Streaming jobs > > > Key: SPARK-10955 > URL: https://issues.apache.org/jira/browse/SPARK-10955 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Fix For: 1.5.2, 1.6.0 > > > Spark streaming can be tricky with dynamic allocation and can lose dataWe > should disable dynamic allocation or at least log that it is dangerous. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11027) Better group distinct columns in query compilation
[ https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963990#comment-14963990 ] Yin Huai commented on SPARK-11027: -- As pointed out by [~joshrosen] (see https://github.com/apache/spark/pull/9115), it is not always safe to evaluate cast after we do distinct because cast operation can affect the result of distinct. So, I am closing this JIRA for now. > Better group distinct columns in query compilation > -- > > Key: SPARK-11027 > URL: https://issues.apache.org/jira/browse/SPARK-11027 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > In AggregationQuerySuite, we have a test > {code} > checkAnswer( > sqlContext.sql( > """ > |SELECT sum(distinct value1), kEY - 100, count(distinct value1) > |FROM agg2 > |GROUP BY Key - 100 > """.stripMargin), > Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, > 3) :: Nil) > {code} > We will treat it as having two distinct columns because sum causes a cast on > value1. Maybe we can ignore the cast when we group distinct columns. So, it > will not be treated as having two distinct columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
David Ross created SPARK-11191: -- Summary: [1.5] Can't create UDF's using hive thrift service Key: SPARK-11191 URL: https://issues.apache.org/jira/browse/SPARK-11191 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1, 1.5.0 Reporter: David Ross Since upgrading to spark 1.5 we've been unable to create and use UDF's when we run in thrift server mode. Our setup: We start the thrift-server running against yarn in client mode, (we've also built our own spark from github branch-1.5 with the following args: {{-Pyarn -Phive -Phive-thrifeserver}} If i run the following after connecting via JDBC (in this case via beeline): {{add jar 'hdfs://path/to/jar"}} (this command succeeds with no errors) {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} (this command succeeds with no errors) {{select testUDF(col1) from table1;}} I get the following error in the logs: {code} org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 8 at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) at scala.util.Try.getOrElse(Try.scala:77) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) {code} (cutting the bulk for ease of report, more than happy to send the full output) {code} 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 100 at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} When I ran the same against 1.4 it worked. I've also changed the {{spark.sql.hive.metastore.version}} version to be 0.13 (similar to what it was in 1.4) and 0.14 but I still get the same errors. Also, in 1.5, when you run it against the {{spark-sql}} shell, it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964048#comment-14964048 ] David Ross commented on SPARK-11191: I will add that the exact same thing happens when you don't use {{TEMPORARY}} i.e.: {code} CREATE FUNCTION testUDF AS 'com.foo.class.UDF'; {code} > [1.5] Can't create UDF's using hive thrift service > -- > > Key: SPARK-11191 > URL: https://issues.apache.org/jira/browse/SPARK-11191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: David Ross > > Since upgrading to spark 1.5 we've been unable to create and use UDF's when > we run in thrift server mode. > Our setup: > We start the thrift-server running against yarn in client mode, (we've also > built our own spark from github branch-1.5 with the following args: {{-Pyarn > -Phive -Phive-thrifeserver}} > If i run the following after connecting via JDBC (in this case via beeline): > {{add jar 'hdfs://path/to/jar"}} > (this command succeeds with no errors) > {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} > (this command succeeds with no errors) > {{select testUDF(col1) from table1;}} > I get the following error in the logs: > {code} > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 8 > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > {code} > (cutting the bulk for ease of report, more than happy to send the full output) > {code} > 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive > query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 100 > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at
[jira] [Updated] (SPARK-11180) Support BooleanType in DataFrame.na.fill
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11180: Summary: Support BooleanType in DataFrame.na.fill (was: DataFrame.na.fill does not support Boolean Type:) > Support BooleanType in DataFrame.na.fill > > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > Fix For: 1.6.0 > > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > {code} > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > {code} > We want to set "isOfficer" false whenever there is null. > {code} > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > {code} > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.
[ https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964041#comment-14964041 ] Yin Huai commented on SPARK-10754: -- Can you use {{HiveContext}}, which set {{spark.sql.caseSensitive}} to false by default. > table and column name are case sensitive when json Dataframe was registered > as tempTable using JavaSparkContext. > - > > Key: SPARK-10754 > URL: https://issues.apache.org/jira/browse/SPARK-10754 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.3.1, 1.4.1 > Environment: Linux ,Hadoop Version 1.3 >Reporter: Babulal > > Create a dataframe using json data source > SparkConf conf=new > SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble"); > JavaSparkContext javacontext=new JavaSparkContext(conf); > SQLContext sqlContext=new SQLContext(javacontext); > > DataFrame df = > sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json"); > > df.registerTempTable("sparktable"); > > Run the Query > > sqlContext.sql("select * from sparktable").show()// this will PASs > > > sqlContext.sql("select * from sparkTable").show()/// This will FAIL > > java.lang.RuntimeException: Table Not Found: sparkTable > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at > org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233) > > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11192) When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time
[ https://issues.apache.org/jira/browse/SPARK-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Blake Livingston updated SPARK-11192: - Description: Noticed that slowly, over the course of a day or two, heap memory usage on a long running spark process increased monotonically. After doing a heap dump and examining in jvisualvm, saw there were over 15M org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 500MB. Accumulation does not occur when I removed metrics.properties. metrics.properties content: *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=x *.sink.graphite.port=2003 *.sink.graphite.period=10 master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource was: Noticed that slowly, over the course of a day or two, heap memory usage on a long running spark process increased monotonically. After doing a heap dump and examining in jvisualvm, saw there were over 15M org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 500MB. Accumulation does not occur when I removed metrics.properties. metrics.properties content: # Enable Graphite *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink *.sink.graphite.host=x *.sink.graphite.port=2003 *.sink.graphite.period=10 # Enable jvm source for instance master, worker, driver and executor master.source.jvm.class=org.apache.spark.metrics.source.JvmSource worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource > When graphite metric sink is enabled, spark sql leaks > org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time > > > Key: SPARK-11192 > URL: https://issues.apache.org/jira/browse/SPARK-11192 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > org.apache.spark/spark-sql_2.10 "1.5.1" > Embedded, in-process spark. Have not tested on standalone or yarn clusters. >Reporter: Blake Livingston >Priority: Minor > > Noticed that slowly, over the course of a day or two, heap memory usage on a > long running spark process increased monotonically. > After doing a heap dump and examining in jvisualvm, saw there were over 15M > org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking > over 500MB. > Accumulation does not occur when I removed metrics.properties. > metrics.properties content: > *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink > *.sink.graphite.host=x > *.sink.graphite.port=2003 > *.sink.graphite.period=10 > master.source.jvm.class=org.apache.spark.metrics.source.JvmSource > worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource > driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource > executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11184) Declare most of .mllib code not-Experimental
[ https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964077#comment-14964077 ] Apache Spark commented on SPARK-11184: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/9169 > Declare most of .mllib code not-Experimental > > > Key: SPARK-11184 > URL: https://issues.apache.org/jira/browse/SPARK-11184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.1 >Reporter: Sean Owen >Priority: Minor > > Comments please [~mengxr] and [~josephkb]: my proposal is to remove most > {{@Experimental}} annotations from the {{.mllib}} code, on the theory that > it's not intended to change much more. > I can easily take a shot at this, but wanted to collect thoughts before I > started. Does the theory sound reasonable? Part of it is a desire to keep > this annotation meaningful, and also encourage people to at least view MLlib > as stable, because it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11184) Declare most of .mllib code not-Experimental
[ https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11184: Assignee: Apache Spark > Declare most of .mllib code not-Experimental > > > Key: SPARK-11184 > URL: https://issues.apache.org/jira/browse/SPARK-11184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.1 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > Comments please [~mengxr] and [~josephkb]: my proposal is to remove most > {{@Experimental}} annotations from the {{.mllib}} code, on the theory that > it's not intended to change much more. > I can easily take a shot at this, but wanted to collect thoughts before I > started. Does the theory sound reasonable? Part of it is a desire to keep > this annotation meaningful, and also encourage people to at least view MLlib > as stable, because it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11184) Declare most of .mllib code not-Experimental
[ https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11184: Assignee: (was: Apache Spark) > Declare most of .mllib code not-Experimental > > > Key: SPARK-11184 > URL: https://issues.apache.org/jira/browse/SPARK-11184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.1 >Reporter: Sean Owen >Priority: Minor > > Comments please [~mengxr] and [~josephkb]: my proposal is to remove most > {{@Experimental}} annotations from the {{.mllib}} code, on the theory that > it's not intended to change much more. > I can easily take a shot at this, but wanted to collect thoughts before I > started. Does the theory sound reasonable? Part of it is a desire to keep > this annotation meaningful, and also encourage people to at least view MLlib > as stable, because it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
[ https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964113#comment-14964113 ] Apache Spark commented on SPARK-11194: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/9170 > Use a single URLClassLoader for jars added through SQL's "ADD JAR" command > -- > > Key: SPARK-11194 > URL: https://issues.apache.org/jira/browse/SPARK-11194 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we stack a new URLClassLoader when a user add a jar through SQL's > add jar command. This approach can introduce issues caused by the ordering of > added jars when a class of a jar depends on another class of another jar. > For example, > {code} > ClassLoader1 for Jar1.jar (A.class) >| >|- ClassLoader2 for Jar2.jar (B.class depending on A.class) > {code} > In this case, when we lookup class B, we will not be able to find class A > because Jar2 is the parent of Jar1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11180: Description: Currently DataFrame.na.fill does not support Boolean primitive type. We have use cases where while data massaging/preparation we want to fill boolean columns with false/true value. Ex: {code} val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] ((1,null,null),(2,"SVP",true),(3,"Dir",false))) .toDF("EmpId","Designation","isOfficer") empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, isOfficer: boolean] scala> empDf.show |EmpId|Designation|isOfficer| |1| null| null| |2|SVP| true| |3|Dir|false| {code} We want to set "isOfficer" false whenever there is null. {code} scala> empDf.na.fill(Map("isOfficer"->false)) throws exception java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean (false). at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) ... {code} Can you add support for Boolean into na.fill function. was: Currently DataFrame.na.fill does not support Boolean primitive type. We have use cases where while data massaging/preparation we want to fill boolean columns with false/true value. Ex: val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] ((1,null,null),(2,"SVP",true),(3,"Dir",false))) .toDF("EmpId","Designation","isOfficer") empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, isOfficer: boolean] scala> empDf.show |EmpId|Designation|isOfficer| |1| null| null| |2|SVP| true| |3|Dir|false| We want to set "isOfficer" false whenever there is null. scala> empDf.na.fill(Map("isOfficer"->false)) throws exception java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean (false). at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) ... Can you add support for Boolean into na.fill function. > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > {code} > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > {code} > We want to set "isOfficer" false whenever there is null. > {code} > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > {code} > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10645) Bivariate Statistics: Spearman's Correlation support as UDAF
[ https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964068#comment-14964068 ] Arvind Surve commented on SPARK-10645: -- Spearman's correlation coefficient (SpCoeff) does not fit into the UDAF model, as rank needs to be calculated for every column independently. I have created a stand-alone method to have holistic approach to evaluate SpCoeff which is outlined below. This method takes two arrays -- representing two columns -- (This can be converted to taking two RDDs as input parameters) and returns SpCoeff. This method can be added in org.apache.spark.sql.execution.stat.StatFunction.scala, with coff() method invoked for "spearman" method. Please provide feedback on this approach and then go from there. // This function will calculate Spearman's rank correlation coefficient // Reference: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient def computeSpearmanCorrCoeff(sc: SparkContext, data1:Array[Int], data2:Array[Int]): Double = { val rddData1 = sc.parallelize(data1) val rddData2 = sc.parallelize(data2) //Calculate Rank for first vector data. val rddData1Rank = rddData1 .zipWithIndex() .sortByKey() .zipWithIndex() .map{case((a,b),c)=> (a,((c+1.0),1.0))} .reduceByKey{case(a,b) => (((a._1*a._2+b._1*b._2)/(a._2+b._2),(a._2 + b._2 )))} .map { case (a,(b,c)) => (a,b)} //Calculate Rank for second vector data. val rddData2Rank = rddData2 .zipWithIndex() .sortByKey() .zipWithIndex() .map{case((a,b),c)=> (a,((c+1.0),1.0))} .reduceByKey{case(a,b) => (((a._1*a._2+b._1*b._2)/(a._2+b._2),(a._2 + b._2 )))} .map { case (a,(b,c)) => (a,b)} //Calculate sum of square of diffrence of ranks between two vector corresponding elements in original order. val sumSqRankDiff = rddData1.zip(rddData2) .join(rddData1Rank).map{case (a,(b,c)) => (b, (a, c))} .join(rddData2Rank).map{case (a,((b,c),d)) => (d-c)*(d-c)}.sum() //Length of vector. val dataLen = rddData1Rank.count() // Return Spearman's rank correlation coefficient. return (1 - (6 * sumSqRankDiff)/(dataLen*(dataLen*dataLen -1))) } -Arvind Surve > Bivariate Statistics: Spearman's Correlation support as UDAF > > > Key: SPARK-10645 > URL: https://issues.apache.org/jira/browse/SPARK-10645 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA > > Spearman's rank correlation coefficient : > https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:
[ https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11180. - Resolution: Fixed Fix Version/s: 1.6.0 > DataFrame.na.fill does not support Boolean Type: > - > > Key: SPARK-11180 > URL: https://issues.apache.org/jira/browse/SPARK-11180 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Satya Narayan >Priority: Minor > Fix For: 1.6.0 > > > Currently DataFrame.na.fill does not support Boolean primitive type. We > have use cases where while data massaging/preparation we want to fill boolean > columns with false/true value. > Ex: > {code} > val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)] > ((1,null,null),(2,"SVP",true),(3,"Dir",false))) > .toDF("EmpId","Designation","isOfficer") > empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, > isOfficer: boolean] > scala> empDf.show > |EmpId|Designation|isOfficer| > |1| null| null| > |2|SVP| true| > |3|Dir|false| > {code} > We want to set "isOfficer" false whenever there is null. > {code} > scala> empDf.na.fill(Map("isOfficer"->false)) > throws exception > java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean > (false). > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370) > ... > {code} > Can you add support for Boolean into na.fill function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10955) Warn if dynamic allocation is enabled for Streaming jobs
[ https://issues.apache.org/jira/browse/SPARK-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-10955: -- Description: Spark streaming can be tricky with dynamic allocation and can lose data if not used properly (with WAL, or with WAL-free solutions like Direct Kafka and Kinesis since 1.5). If dynamic allocation is enabled, we should issue a log4j warning. (was: Spark streaming can be tricky with dynamic allocation and can lose dataWe should disable dynamic allocation or at least log that it is dangerous.) > Warn if dynamic allocation is enabled for Streaming jobs > > > Key: SPARK-10955 > URL: https://issues.apache.org/jira/browse/SPARK-10955 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Fix For: 1.5.2, 1.6.0 > > > Spark streaming can be tricky with dynamic allocation and can lose data if > not used properly (with WAL, or with WAL-free solutions like Direct Kafka and > Kinesis since 1.5). If dynamic allocation is enabled, we should issue a log4j > warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.
[ https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964032#comment-14964032 ] Shivaram Venkataraman commented on SPARK-11190: --- cc [~sunrui] Could you try this on the master branch ? We recently added support for Lists, Maps etc. in the master branch > SparkR support for cassandra collection types. > --- > > Key: SPARK-11190 > URL: https://issues.apache.org/jira/browse/SPARK-11190 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 > Environment: SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 >Reporter: Bilind Hajer > Labels: cassandra, dataframe, sparkR > Fix For: 1.5.2 > > > I want to create a data frame from a Cassandra keyspace and column family in > sparkR. > I am able to create data frames from tables which do not include any > Cassandra collection datatypes, > such as Map, Set and List. But, many of the schemas that I need data from, > do include these collection data types. > Here is my local environment. > SparkR Version: 1.5.1 > Cassandra Version: 2.1.6 > R Version: 3.2.2 > Cassandra Connector version: 1.5.0-M2 > To test this issue, I did the following iterative process. > sudo ./sparkR --packages > com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf > spark.cassandra.connection.host=127.0.0.1 > Running this command, with sparkR gives me access to the spark cassandra > connector package I need, > and connects me to my local cqlsh server ( which is up and running while > running this code in sparkR shell ). > CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8 bigint, > column_9 blob, > column_10 ascii, > column_11 decimal, > column_12 double, > column_13 inet, > column_14 varchar, > column_15 varint, > PRIMARY KEY( ( column_1, column_2 ) ) > ); > All of the above data types are supported. I insert dummy data after creating > this test schema. > For example, now in my sparkR shell, I run the following code. > df.test <- read.df(sqlContext, source = "org.apache.spark.sql.cassandra", > keyspace = "datahub", table = "test_table") > assigns with no errors, then, > > schema(df.test) > StructType > |-name = "column_1", type = "IntegerType", nullable = TRUE > |-name = "column_2", type = "StringType", nullable = TRUE > |-name = "column_10", type = "StringType", nullable = TRUE > |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE > |-name = "column_12", type = "DoubleType", nullable = TRUE > |-name = "column_13", type = "InetAddressType", nullable = TRUE > |-name = "column_14", type = "StringType", nullable = TRUE > |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE > |-name = "column_3", type = "FloatType", nullable = TRUE > |-name = "column_4", type = "UUIDType", nullable = TRUE > |-name = "column_5", type = "TimestampType", nullable = TRUE > |-name = "column_6", type = "BooleanType", nullable = TRUE > |-name = "column_7", type = "UUIDType", nullable = TRUE > |-name = "column_8", type = "LongType", nullable = TRUE > |-name = "column_9", type = "BinaryType", nullable = TRUE > Schema is correct. > > class(df.test) > [1] "DataFrame" > attr(,"package") > [1] "SparkR" > df.test is clearly defined to be a DataFrame Object. > > head(df.test) > column_1 column_2 column_10 column_11 column_12 column_13 column_14 > column_15 > 11helloNANANANANA > NA > column_3 column_4 column_5 column_6 column_7 column_8 column_9 > 1 3.4 NA NA NA NA NA NA > sparkR is reading from the column_family correctly, but now lets add a > collection data type to the schema. > Now I will drop that test_table, and recreate the table, with with an extra > column of data type map> CREATE TABLE test_table ( > column_1 int, > column_2 text, > column_3 float, > column_4 uuid, > column_5 timestamp, > column_6 boolean, > column_7 timeuuid, > column_8
[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964043#comment-14964043 ] Charles Allen commented on SPARK-11016: --- [~srowen] I confirmed locally that https://github.com/metamx/spark/pull/1 prevents this error, but as per your prior comment a "more correct" implementation would probably provide a Kryo Externalizable bridge of some kind. > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
Yin Huai created SPARK-11194: Summary: Use a single URLClassLoader for jars added through SQL's "ADD JAR" command Key: SPARK-11194 URL: https://issues.apache.org/jira/browse/SPARK-11194 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context
[ https://issues.apache.org/jira/browse/SPARK-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963655#comment-14963655 ] buckhx commented on SPARK-5929: --- I also included an add module that will bundle and ship a module that has already been imported by the driver > Pyspark: Register a pip requirements file with spark_context > > > Key: SPARK-5929 > URL: https://issues.apache.org/jira/browse/SPARK-5929 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: buckhx >Priority: Minor > > I've been doing a lot of dependency work with shipping dependencies to > workers as it is non-trivial for me to have my workers include the proper > dependencies in their own environments. > To circumvent this, I added a addRequirementsFile() method that takes a pip > requirements file, downloads the packages, repackages them to be registered > with addPyFiles and ship them to workers. > Here is a comparison of what I've done on the Palantir fork > https://github.com/buckheroux/spark/compare/palantir:master...master -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions
[ https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11179: -- Fix Version/s: (was: 1.6.0) [~nitin2goyal] this can't have a Fix version. > Push filters through aggregate if filters are subset of 'group by' expressions > -- > > Key: SPARK-11179 > URL: https://issues.apache.org/jira/browse/SPARK-11179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nitin Goyal >Priority: Minor > > Push filters through aggregate if filters are subset of 'group by' > expressions. This optimisation can be added in Spark SQL's Optimizer class -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11119) cleanup unsafe array and map
[ https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9131 [https://github.com/apache/spark/pull/9131] > cleanup unsafe array and map > > > Key: SPARK-9 > URL: https://issues.apache.org/jira/browse/SPARK-9 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5250. --- Resolution: Cannot Reproduce > EOFException in when reading gzipped files from S3 with wholeTextFiles > -- > > Key: SPARK-5250 > URL: https://issues.apache.org/jira/browse/SPARK-5250 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Mojmir Vinkler >Priority: Critical > > I get an `EOFException` error when reading *some* gzipped files using > `sc.wholeTextFiles`. It happens to just a few files, I thought that the file > is corrupted, but I was able to read it without problems using `sc.textFile` > (and pandas). > Traceback for command > `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()` > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect() > /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self) > 674 """ > 675 with SCCallSiteSync(self.context) as css: > --> 676 bytesInJava = self._jrdd.collect().iterator() > 677 return list(self._collect_iterator_through_file(bytesInJava)) > 678 > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o1576.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: > Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77) > at java.io.InputStream.read(InputStream.java:101) > at com.google.common.io.ByteStreams.copy(ByteStreams.java:207) > at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252) > at > org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at
[jira] [Created] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions
Michael Armbrust created SPARK-11188: Summary: Elide stacktraces in bin/spark-sql for AnalysisExceptions Key: SPARK-11188 URL: https://issues.apache.org/jira/browse/SPARK-11188 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust For analysis exceptions in the sql-shell, we should only print the error message to the screen. The stacktrace will never have useful information since this error is used to signify an error with the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11184) Declare most of .mllib code not-Experimental
[ https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963692#comment-14963692 ] Joseph K. Bradley commented on SPARK-11184: --- I agree we need to remove more of those tags; thanks for working on this! I'll be happy to help review. > Declare most of .mllib code not-Experimental > > > Key: SPARK-11184 > URL: https://issues.apache.org/jira/browse/SPARK-11184 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.1 >Reporter: Sean Owen >Priority: Minor > > Comments please [~mengxr] and [~josephkb]: my proposal is to remove most > {{@Experimental}} annotations from the {{.mllib}} code, on the theory that > it's not intended to change much more. > I can easily take a shot at this, but wanted to collect thoughts before I > started. Does the theory sound reasonable? Part of it is a desire to keep > this annotation meaningful, and also encourage people to at least view MLlib > as stable, because it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes
[ https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11177. Resolution: Won't Fix I'm going to resolve this as "Won't Fix", since I think that the difficultly / risk of fixing this in Spark is too high right now. Affected users should upgrade to Hadoop 2.x. > sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero > bytes > --- > > Key: SPARK-11177 > URL: https://issues.apache.org/jira/browse/SPARK-11177 > Project: Spark > Issue Type: Sub-task > Components: Input/Output >Reporter: Josh Rosen >Assignee: Josh Rosen > > From a user report: > {quote} > When I upload a series of text files to an S3 directory and one of the files > is empty (0 bytes). The sc.wholeTextFiles method stack traces. > java.lang.ArrayIndexOutOfBoundsException: 0 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245) > at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > {quote} > It looks like this has been a longstanding issue: > * > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html > * > https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark > * > https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11187) Add Newton-Raphson Step per Tree to GBDT Implementation
[ https://issues.apache.org/jira/browse/SPARK-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-11187: Shepherd: DB Tsai Affects Version/s: (was: 1.5.1) 1.6.0 Component/s: ML > Add Newton-Raphson Step per Tree to GBDT Implementation > --- > > Key: SPARK-11187 > URL: https://issues.apache.org/jira/browse/SPARK-11187 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Joseph Babcock > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963817#comment-14963817 ] Josh Rosen commented on SPARK-5250: --- Ah, gotcha. I'm going to resolve this as "Cannot Reproduce" for the time being, since I don' really have any means to debug this right now. > EOFException in when reading gzipped files from S3 with wholeTextFiles > -- > > Key: SPARK-5250 > URL: https://issues.apache.org/jira/browse/SPARK-5250 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Mojmir Vinkler >Priority: Critical > > I get an `EOFException` error when reading *some* gzipped files using > `sc.wholeTextFiles`. It happens to just a few files, I thought that the file > is corrupted, but I was able to read it without problems using `sc.textFile` > (and pandas). > Traceback for command > `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()` > {code} > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect() > /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self) > 674 """ > 675 with SCCallSiteSync(self.context) as css: > --> 676 bytesInJava = self._jrdd.collect().iterator() > 677 return list(self._collect_iterator_through_file(bytesInJava)) > 678 > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py > in __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name) > 539 > 540 for temp_arg in temp_args: > /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py > in get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > Py4JJavaError: An error occurred while calling o1576.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: > Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77) > at java.io.InputStream.read(InputStream.java:101) > at com.google.common.io.ByteStreams.copy(ByteStreams.java:207) > at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252) > at > org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) > at
[jira] [Commented] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963849#comment-14963849 ] Ruslan Dautkhanov commented on SPARK-11150: --- Will partition-wise join will also be handled by this JIRA as well? https://blogs.oracle.com/datawarehousing/entry/partition_wise_joins E.g. in a two-table join by a common key, if both tables are hash-partitioned the same way, there is no need for shuffling. > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Younes > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11189) History server is not able to parse some application report
Jean-Baptiste Onofré created SPARK-11189: Summary: History server is not able to parse some application report Key: SPARK-11189 URL: https://issues.apache.org/jira/browse/SPARK-11189 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.6.0 Reporter: Jean-Baptiste Onofré In some case, history server is not able to parse an application report. For instance, with JavaTC example: {code} com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing '"' for name at [Source: {"Event":"SparkListenerTaskEnd","Stage ID":245,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Rea; line: 1, column: 241] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:445) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:34) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161) at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19) at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) at org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:950) at org.apache.spark.deploy.master.Master.removeApplication(Master.scala:812) at org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:790) at org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382) at org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382) at scala.Option.foreach(Option.scala:236) at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:382) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:206) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:99) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:224) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11189) History server is not able to parse some application report
[ https://issues.apache.org/jira/browse/SPARK-11189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963936#comment-14963936 ] Sean Owen commented on SPARK-11189: --- It looks like you have a truncated input file. Are there any other problems leading up to this? > History server is not able to parse some application report > --- > > Key: SPARK-11189 > URL: https://issues.apache.org/jira/browse/SPARK-11189 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jean-Baptiste Onofré > > In some case, history server is not able to parse an application report. > For instance, with JavaTC example: > {code} > com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was > expecting closing '"' for name > at [Source: {"Event":"SparkListenerTaskEnd","Stage ID":245,"Stage Attempt > ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Rea; line: 1, column: > 241] > at > com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419) > at > com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508) > at > com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:445) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268) > at > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618) > at > org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:34) > at > org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42) > at > org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) > at > com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066) > at > com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161) > at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19) > at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:950) > at > org.apache.spark.deploy.master.Master.removeApplication(Master.scala:812) > at > org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:790) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:382) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:206) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:99) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:224) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
[ https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963648#comment-14963648 ] Joseph K. Bradley commented on SPARK-4240: -- This conversation slipped under my radar somehow; my apologies! I think it'd be fine to copy the implementation of GBTs to spark.ml, especially if we want to restructure it to support TreeBoost. As far as updating or replacing the spark.mllib implementation, I'd say: Ideally it would eventually be a wrapper for the spark.ml implementation, but we should focus on the spark.ml API and implementation for now, even if it means temporarily having a copy of the code. I think it'd be hard to combine this work with generic boosting because TreeBoost relies on the fact that trees are a space-partitioning algorithm, but we could discuss feasibility if there is a way to leverage the same implementation. [~dbtsai] expressed interest in this work, so I'll ping him here. > Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy. > > > Key: SPARK-4240 > URL: https://issues.apache.org/jira/browse/SPARK-4240 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sung Chung > > The gradient boosting as currently implemented estimates the loss-gradient in > each iteration using regression trees. At every iteration, the regression > trees are trained/split to minimize predicted gradient variance. > Additionally, the terminal node predictions are computed to minimize the > prediction variance. > However, such predictions won't be optimal for loss functions other than the > mean-squared error. The TreeBoosting refinement can help mitigate this issue > by modifying terminal node prediction values so that those predictions would > directly minimize the actual loss function. Although this still doesn't > change the fact that the tree splits were done through variance reduction, it > should still lead to improvement in gradient estimations, and thus better > performance. > The details of this can be found in the R vignette. This paper also shows how > to refine the terminal node predictions. > http://www.saedsayad.com/docs/gbm2.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes
[ https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11177: -- Component/s: Input/Output > sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero > bytes > --- > > Key: SPARK-11177 > URL: https://issues.apache.org/jira/browse/SPARK-11177 > Project: Spark > Issue Type: Sub-task > Components: Input/Output >Reporter: Josh Rosen >Assignee: Josh Rosen > > From a user report: > {quote} > When I upload a series of text files to an S3 directory and one of the files > is empty (0 bytes). The sc.wholeTextFiles method stack traces. > java.lang.ArrayIndexOutOfBoundsException: 0 > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245) > at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > {quote} > It looks like this has been a longstanding issue: > * > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html > * > https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark > * > https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small
[ https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-10668. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8884 [https://github.com/apache/spark/pull/8884] > Use WeightedLeastSquares in LinearRegression with L2 regularization if the > number of features is small > -- > > Key: SPARK-10668 > URL: https://issues.apache.org/jira/browse/SPARK-10668 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki >Priority: Critical > Fix For: 1.6.0 > > > If the number of features is small (<=4096) and the regularization is L2, we > should use WeightedLeastSquares to solve the problem rather than L-BFGS. The > former requires only one pass to the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets
[ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4414. --- Resolution: Won't Fix I'm going to resolve this as "Won't Fix", since I think that the difficultly / risk of fixing this in Spark is too high right now. While in principle we could fix this by inlining the affected Hadoop classes in Spark, it's going to be extremely difficult to do this in a way that is source- and binary-compatible with all of the Hadoop versions that we need to support. Affected users should upgrade to Hadoop 1.2.1 or higher, which do not seem to be affected by this bug. > SparkContext.wholeTextFiles Doesn't work with S3 Buckets > > > Key: SPARK-4414 > URL: https://issues.apache.org/jira/browse/SPARK-4414 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pedro Rodriguez >Assignee: Josh Rosen >Priority: Critical > > SparkContext.wholeTextFiles does not read files which SparkContext.textFile > can read. Below are general steps to reproduce, my specific case is following > that on a git repo. > Steps to reproduce. > 1. Create Amazon S3 bucket, make public with multiple files > 2. Attempt to read bucket with > sc.wholeTextFiles("s3n://mybucket/myfile.txt") > 3. Spark returns the following error, even if the file exists. > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /myfile.txt > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) > 4. Change the call to > sc.textFile("s3n://mybucket/myfile.txt") > and there is no error message, the application should run fine. > There is a question on StackOverflow as well on this: > http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist > This is link to repo/lines of code. The uncommented call doesn't work, the > commented call works as expected: > https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 > It would be easy to use textFile with a multifile argument, but this should > work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11187) Add Newton-Raphson Step per Tree to GBDT Implementation
Joseph Babcock created SPARK-11187: -- Summary: Add Newton-Raphson Step per Tree to GBDT Implementation Key: SPARK-11187 URL: https://issues.apache.org/jira/browse/SPARK-11187 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.5.1 Reporter: Joseph Babcock -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9643) Error serializing datetimes with timezones using Dataframes and Parquet
[ https://issues.apache.org/jira/browse/SPARK-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9643: - Assignee: Alex Angelini > Error serializing datetimes with timezones using Dataframes and Parquet > --- > > Key: SPARK-9643 > URL: https://issues.apache.org/jira/browse/SPARK-9643 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1 >Reporter: Alex Angelini >Assignee: Alex Angelini > Labels: upgrade > Fix For: 1.6.0 > > > Trying to serialize a DataFrame with a datetime column that includes a > timezone fails with the following error. > {code} > net.razorvine.pickle.PickleException: invalid pickle data for datetime; > expected 1 or 7 args, got 2 > at > net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69) > at > net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151) > at > org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.org$apache$spark$sql$execution$datasources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:185) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > According to [~davies] timezone serialization is done directly in Spark and > not dependent on Pyrolite, but I was not able to prove that. > Upgrading to Pyrolite 4.9 fixed this issue > https://github.com/apache/spark/pull/7950 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10994) Clustering coefficient computation in GraphX
[ https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963806#comment-14963806 ] Reynold Xin commented on SPARK-10994: - [~sherlockbourne] I am sure this is a pretty good algorithm, but lately we have been pushing to have more robust implementations outside as http://spark-packages.org/ In many ways it is better for this to be maintained outside: 1. You can iterate on it really quickly without the overhead of the Apache Software Foundation processes. 2. You can promote this easier, since with so many changes in each Spark release, it is getting harder and harder for users to discover new features. If this is a 3rd-party package, you can write dedicated blog posts and have good entry point READMEs on github. 3. It is just as easy to use this. As soon as you publish the package to maven, users can use the package directly in the repl by adding a command line flag. > Clustering coefficient computation in GraphX > > > Key: SPARK-10994 > URL: https://issues.apache.org/jira/browse/SPARK-10994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Yang Yang > Original Estimate: 336h > Remaining Estimate: 336h > > The Clustering Coefficient (CC) is a fundamental measure in social (or other > type of) network analysis assessing the degree to which nodes tend to cluster > together [1][2]. Clustering coefficient, along with density, node degree, > path length, diameter, connectedness, and node centrality are seven most > important properties to characterise a network [3]. > We found that GraphX has already implemented connectedness, node centrality, > path length, but does not have a componenet for computing clustering > coefficient. This actually was the first intention for us to implement an > algorithm to compute clustering coefficient for each vertex of a given graph. > Clustering coefficient is very helpful to many real applications, such as > user behaviour prediction and structure prediction (like link prediction). We > did that before in a bunch of papers (e.g., [4-5]), and also found many other > publication papers using this metric in their work [6-8]. We are very > confident that this feature will benefit GraphX and attract a large number of > users. > References > [1] https://en.wikipedia.org/wiki/Clustering_coefficient > [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of > ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 > citations). > [3] https://en.wikipedia.org/wiki/Network_science > [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of > "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge > and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106. > [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi > Li, and Liangwei Wang. Mining Competitive Relationships by Learning across > Heterogeneous Networks. In Proceedings of the Twenty-First Conference on > Information and Knowledge Management (CIKM'12). pp. 1432-1441. > [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical > structure and the prediction of missing links in networks. Nature 453.7191 > (2008): 98-101. (with 973 citations) > [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social > networks 25.3 (2003): 211-230. (1238 citations) > [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New > perspectives and methods in link prediction. In KDD'10. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext
[ https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963835#comment-14963835 ] kevin yu commented on SPARK-11186: -- Hello Santiago: How did you run the above code? did you get any stack trace? I tried on spark-shell, I got the error, it seems that the SQLContext.value is a protected field. can't access the . scala> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { | override def sqlContext: SQLContext = sqlc | override def schema: StructType = StructType(Nil) | })) :26: error: lazy value catalog in class SQLContext cannot be accessed in org.apache.spark.sql.SQLContext Access to protected value catalog not permitted because enclosing class $iwC is not a subclass of class SQLContext in package sql where target is defined sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new BaseRelation { > Caseness inconsistency between SQLContext and HiveContext > - > > Key: SPARK-11186 > URL: https://issues.apache.org/jira/browse/SPARK-11186 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Santiago M. Mola >Priority: Minor > > Default catalog behaviour for caseness is different in {{SQLContext}} and > {{HiveContext}}. > {code} > test("Catalog caseness (SQL)") { > val sqlc = new SQLContext(sc) > val relationName = "MyTable" > sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new > BaseRelation { > override def sqlContext: SQLContext = sqlc > override def schema: StructType = StructType(Nil) > })) > val tables = sqlc.tableNames() > assert(tables.contains(relationName)) > } > test("Catalog caseness (Hive)") { > val sqlc = new HiveContext(sc) > val relationName = "MyTable" > sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new > BaseRelation { > override def sqlContext: SQLContext = sqlc > override def schema: StructType = StructType(Nil) > })) > val tables = sqlc.tableNames() > assert(tables.contains(relationName)) > } > {code} > Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. > But the reason that this is needed seems undocumented (both in the manual or > in the source code comments). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions
[ https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11188: - Target Version/s: 1.4.2, 1.5.2, 1.6.0 (was: 1.6.0) > Elide stacktraces in bin/spark-sql for AnalysisExceptions > - > > Key: SPARK-11188 > URL: https://issues.apache.org/jira/browse/SPARK-11188 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust > > For analysis exceptions in the sql-shell, we should only print the error > message to the screen. The stacktrace will never have useful information > since this error is used to signify an error with the query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11196) Support for equality and pushdown of filters on some UDTs
Michael Armbrust created SPARK-11196: Summary: Support for equality and pushdown of filters on some UDTs Key: SPARK-11196 URL: https://issues.apache.org/jira/browse/SPARK-11196 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Today if you try and do any comparisons with UDTs it fails due to bad casting. However, in some cases the UDT is just a thing wrapper around a SQL type (StringType for example). In these cases we could just convert the UDT to its SQL type. Rough prototype: https://github.com/apache/spark/compare/apache:master...marmbrus:uuid-udt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org