[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433406#comment-16433406 ] Liang-Chi Hsieh commented on SPARK-23928: - If no assignee and no one announces, it is no problem you to take an jira. > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
[ https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23958: Assignee: Apache Spark > HadoopRdd filters empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > - > > Key: SPARK-23958 > URL: https://issues.apache.org/jira/browse/SPARK-23958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Assignee: Apache Spark >Priority: Minor > > HadoopRdd filter empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > Empty file's length is zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
[ https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23958: Assignee: (was: Apache Spark) > HadoopRdd filters empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > - > > Key: SPARK-23958 > URL: https://issues.apache.org/jira/browse/SPARK-23958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > > HadoopRdd filter empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > Empty file's length is zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
[ https://issues.apache.org/jira/browse/SPARK-23958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1640#comment-1640 ] Apache Spark commented on SPARK-23958: -- User 'guoxiaolongzte' has created a pull request for this issue: https://github.com/apache/spark/pull/21036 > HadoopRdd filters empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > - > > Key: SPARK-23958 > URL: https://issues.apache.org/jira/browse/SPARK-23958 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > > HadoopRdd filter empty files to avoid generating empty tasks that affect the > performance of the Spark computing performance. > Empty file's length is zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23958) HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance.
guoxiaolongzte created SPARK-23958: -- Summary: HadoopRdd filters empty files to avoid generating empty tasks that affect the performance of the Spark computing performance. Key: SPARK-23958 URL: https://issues.apache.org/jira/browse/SPARK-23958 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: guoxiaolongzte HadoopRdd filter empty files to avoid generating empty tasks that affect the performance of the Spark computing performance. Empty file's length is zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23955: - Priority: Trivial (was: Minor) > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Priority: Trivial > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433316#comment-16433316 ] Nicholas Chammas commented on SPARK-23945: -- I always looked at DataFrames and SQL as two different interfaces to the same underlying logical model, so I just assumed that the vision was for them to be equally powerful. Is that not the case? So in the grand scheme of things I'd expect DataFrames to be able to do everything that SQL can and vice versa, but for the narrow purposes of this ticket I'm just interested in {{IN }}and {{NOT IN.}} > Column.isin() should accept a single-column DataFrame as input > -- > > Key: SPARK-23945 > URL: https://issues.apache.org/jira/browse/SPARK-23945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > In SQL you can filter rows based on the result of a subquery: > {code:java} > SELECT * > FROM table1 > WHERE name NOT IN ( > SELECT name > FROM table2 > );{code} > In the Spark DataFrame API, the equivalent would probably look like this: > {code:java} > (table1 > .where( > ~col('name').isin( > table2.select('name') > ) > ) > ){code} > However, .isin() currently [only accepts a local list of > values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin]. > I imagine making this enhancement would happen as part of a larger effort to > support correlated subqueries in the DataFrame API. > Or perhaps there is no plan to support this style of query in the DataFrame > API, and queries like this should instead be written in a different way? How > would we write a query like the one I have above in the DataFrame API, > without needing to collect values locally for the NOT IN filter? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23847) Add asc_nulls_first, asc_nulls_last to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433314#comment-16433314 ] Apache Spark commented on SPARK-23847: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/21035 > Add asc_nulls_first, asc_nulls_last to PySpark > -- > > Key: SPARK-23847 > URL: https://issues.apache.org/jira/browse/SPARK-23847 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > > Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last, > desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in > PySpark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23955) typo in parameter name 'rawPredicition'
[ https://issues.apache.org/jira/browse/SPARK-23955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433302#comment-16433302 ] Hyukjin Kwon commented on SPARK-23955: -- Fixing a typo doesn't need a JIRA. Let's avoid this next time. > typo in parameter name 'rawPredicition' > --- > > Key: SPARK-23955 > URL: https://issues.apache.org/jira/browse/SPARK-23955 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: John Bauer >Priority: Minor > > classifier.py MultilayerPerceptronClassifier.__init__ API call had typo > rawPredicition instead of rawPrediction > also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23954) Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors.
[ https://issues.apache.org/jira/browse/SPARK-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433298#comment-16433298 ] Hyukjin Kwon commented on SPARK-23954: -- Can you check other JIRAs and see if there are duplicates? I feel sure there are duplicates about this, for example, SPARK-14326 or SPARK-12360. > Converting spark dataframe containing int64 fields to R dataframes leads to > impredictable errors. > - > > Key: SPARK-23954 > URL: https://issues.apache.org/jira/browse/SPARK-23954 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: nicolas paris >Priority: Minor > > Converting spark dataframe containing int64 fields to R dataframes leads to > impredictable errors. > The problems comes from R that does not handle int64 natively. As a result a > good workaround would be to convert bigint as strings when transforming spark > dataframes into R dataframes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23950) Coalescing an empty dataframe to 1 partition
[ https://issues.apache.org/jira/browse/SPARK-23950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433296#comment-16433296 ] Hyukjin Kwon commented on SPARK-23950: -- Seems fixed in the current master. Let me leave this resolved but it would be great if we can find which one fixes it and backports if applicable. > Coalescing an empty dataframe to 1 partition > > > Key: SPARK-23950 > URL: https://issues.apache.org/jira/browse/SPARK-23950 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 > Environment: Operating System: Windows 7 > Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3. > Hardware specs not relevant to the issue. >Reporter: João Neves >Priority: Major > > Coalescing an empty dataframe to 1 partition returns an error. > The funny thing is that coalescing an empty dataframe to 2 or more partitions > seem to work. > The test case is the following: > {code} > from pyspark.sql.types import StructType > df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([])) > print(df.coalesce(2).count()) > print(df.coalesce(3).count()) > print(df.coalesce(4).count()) > df.coalesce(1).count(){code} > Output: > {code:java} > 0 > 0 > 0 > --- > Py4JJavaError Traceback (most recent call last) > in () > 7 print(df.coalesce(4).count()) > 8 > > 9 print(df.coalesce(1).count()) > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self) > 425 2 > 426 """ > --> 427 return int(self._jdf.count()) > 428 > 429 @ignore_unicode_prefix > c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) > 1131 answer = self.gateway_client.send_command(command) > 1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, self.name) > 1134 > 1135 for temp_arg in temp_args: > C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, > gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > Py4JJavaError: An error occurred while calling o176.count. > : java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435) > at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434) > at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) > at org.apache.spark.sql.Dataset.count(Dataset.scala:2434) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Unknown Source){code} > Shouldn't this be consistent? > Thank you very much. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels
[ https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19947. --- Resolution: Fixed Fix Version/s: 2.4.0 I'll mark this as complete. Those earlier PRs fixed some issues, and [SPARK-23562] should fix the rest. > RFormulaModel always throws Exception on transforming data with NULL or > Unseen labels > - > > Key: SPARK-19947 > URL: https://issues.apache.org/jira/browse/SPARK-19947 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Andrey Yatsuk >Priority: Major > Fix For: 2.4.0 > > > I have trained ML model and big data table in parquet. I want add new column > to this table with predicted values. I can't lose any data, but can having > null values in it. > RFormulaModel.fit() method creates new StringIndexer with default > (handleInvalid="error") parameter. Also VectorAssembler on NULL values > throwing Exception. So I must call df.na.drop() to transform this DataFrame > and I don't want to do this. > Need add to RFormula new parameter like handleInvalid in StringIndexer. > Or add transform(Seq): Vector method which user can use as UDF method > in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, > Seq)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23562. --- Resolution: Fixed Fix Version/s: 2.4.0 I think everything has been fixed, so I'll close this. Thanks [~yogeshgarg] and [~huaxingao]! > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. > --> Discussed offline: null values can be converted to NaN values for "keep" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23562: -- Shepherd: Joseph K. Bradley > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. > --> Discussed offline: null values can be converted to NaN values for "keep" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433222#comment-16433222 ] Michael Armbrust commented on SPARK-23337: -- The checkpoint will only grow if you are doing an aggregation, otherwise the watermark will not affect computation. You can set a watermark on the nested column, you just need to project it to a top level column using {{withColumn}} > withWatermark raises an exception on struct objects > --- > > Key: SPARK-23337 > URL: https://issues.apache.org/jira/browse/SPARK-23337 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 > Environment: Linux Ubuntu, Spark on standalone mode >Reporter: Aydin Kocas >Priority: Major > > Hi, > > when using a nested object (I mean an object within a struct, here concrete: > _source.createTime) from a json file as the parameter for the > withWatermark-method, I get an exception (see below). > Anything else works flawlessly with the nested object. > > +*{color:#14892c}works:{color}*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime", > "10 seconds").toDF();{code} > > json structure: > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- myTime: timestamp (nullable = true) > ..{code} > +*{color:#d04437}does not work - nested json{color}:*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime", > "10 seconds").toDF();{code} > > json structure: > > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- _source: struct (nullable = true) > | |-- createTime: timestamp (nullable = true) > .. > > Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > 'EventTimeWatermark '_source.createTime, interval 10 seconds > +- Deduplicate [_id#0], true > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true), > StructField(_index,StringType,true), StructField(_score,LongType,true), > StructField(_source,StructType(StructField(additionalData,StringType,true), > StructField(client,StringType,true), > StructField(clientDomain,BooleanType,true), > StructField(clientVersion,StringType,true), > StructField(country,StringType,true), > StructField(countryName,StringType,true), > StructField(createTime,TimestampType,true), > StructField(externalIP,StringType,true), > StructField(hostname,StringType,true), > StructField(internalIP,StringType,true), > StructField(location,StringType,true), > StructField(locationDestination,StringType,true), > StructField(login,StringType,true), > StructField(originalRequestString,StringType,true), > StructField(password,StringType,true), > StructField(peerIdent,StringType,true), > StructField(peerType,StringType,true), > StructField(recievedTime,TimestampType,true), > StructField(sessionEnd,StringType,true), > StructField(sessionStart,StringType,true), > StructField(sourceEntryAS,StringType,true), > StructField(sourceEntryIp,StringType,true), > StructField(sourceEntryPort,StringType,true), > StructField(targetCountry,StringType,true), > StructField(targetCountryName,StringType,true), > StructField(targetEntryAS,StringType,true), > StructField(targetEntryIp,StringType,true), > StructField(targetEntryPort,StringType,true), > StructField(targetport,StringType,true), > StructField(username,StringType,true), > StructField(vulnid,StringType,true)),true), > StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), > FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > at >
[jira] [Resolved] (SPARK-23944) Add Param set functions to LSHModel types
[ https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23944. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21015 [https://github.com/apache/spark/pull/21015] > Add Param set functions to LSHModel types > - > > Key: SPARK-23944 > URL: https://issues.apache.org/jira/browse/SPARK-23944 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > Fix For: 2.4.0 > > > 2 param set methods ( setInputCol, setOutputCol) are added to the two > LSHModel types for min hash and random projections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types
[ https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23944: - Assignee: Lu Wang > Add Param set functions to LSHModel types > - > > Key: SPARK-23944 > URL: https://issues.apache.org/jira/browse/SPARK-23944 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > > 2 param set methods ( setInputCol, setOutputCol) are added to the two > LSHModel types for min hash and random projections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23957) Sorts in subqueries are redundant and can be removed
Henry Robinson created SPARK-23957: -- Summary: Sorts in subqueries are redundant and can be removed Key: SPARK-23957 URL: https://issues.apache.org/jira/browse/SPARK-23957 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Henry Robinson Unless combined with a {{LIMIT}}, there's no correctness reason that planned and optimized subqueries should have any sort operators (since the result of the subquery is an unordered collection of tuples). For example: {{SELECT count(1) FROM (select id FROM dft ORDER by id)}} has the following plan: {code:java} == Physical Plan == *(3) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *(2) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(2) Project +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) +- *(1) Project [id#0L] +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} ... but the sort operator is redundant. Less intuitively, the sort is also redundant in selections from an ordered subquery: {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}} has plan: {code:java} == Physical Plan == *(2) Sort [id#0L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200) +- *(1) Project [id#0L] +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} ... but again, since the subquery returns a bag of tuples, the sort is unnecessary. We should consider adding an optimizer rule that removes a sort inside a subquery. SPARK-23375 is related, but removes sorts that are functionally redundant because they perform the same ordering. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23871. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21003 [https://github.com/apache/spark/pull/21003] > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433100#comment-16433100 ] Nicholas Verbeck commented on SPARK-19680: -- KAFKA-3370 is a good solution to the bad preforming jobs problem from a central point. However irregardless of that, Spark shouldn't just dictate functionality to users like that. It should instead leave it up to the user to assume responsibility if they wish to enable that setting. Making notes and comments within sparks docs of the potential issues until either Kafka and/or Spark can come up with a solution that isn't the removal of functionality. As a side solution, until Kafka can be fixed, would be for Spark to eval the setting itself. If set go about looking up the current offsets at start and handling moving them to the latest/earliest as requested. Then switching to NONE for the continued run. This would prevent the issues that you appear to be wanting to prevent. While letting the users maintain, somewhat the key part of the functionality they are looking for. > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene >Priority: Major > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L) > .map(row => (row._2.getSearchCount, row._2)) > .transform(rdd =>
[jira] [Created] (SPARK-23956) Use effective RPC port in AM registration
Gera Shegalov created SPARK-23956: - Summary: Use effective RPC port in AM registration Key: SPARK-23956 URL: https://issues.apache.org/jira/browse/SPARK-23956 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.3.0 Reporter: Gera Shegalov AM's should use their real rpc port in the AM registration for better diagnostics in Application Report. {code} 18/04/10 14:56:21 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: localhost ApplicationMaster RPC port: 58338 queue: default start time: 1523397373659 final status: UNDEFINED tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23871: -- Shepherd: Joseph K. Bradley > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23871: - Assignee: Huaxin Gao > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-23869) Spark 2.3.0 left outer join not emitting null values instead waiting for the record in other stream
[ https://issues.apache.org/jira/browse/SPARK-23869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bharath kumar avusherla closed SPARK-23869. --- > Spark 2.3.0 left outer join not emitting null values instead waiting for the > record in other stream > --- > > Key: SPARK-23869 > URL: https://issues.apache.org/jira/browse/SPARK-23869 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > > Left outer join on two streams not emitting the null outputs. It is just > waiting for the record to be added to other stream. Used socketstream to test > this. In our case we want to emit the records with null values which doesn't > match with id or/and not fall in time range condition > Details of the watermarks and intervals are: > val ds1Map = ds1 > .selectExpr("Id AS ds1_Id", "ds1_timestamp") > .withWatermark("ds1_timestamp","10 seconds") > val ds2Map = ds2 > .selectExpr("Id AS ds2_Id", "ds2_timestamp") > .withWatermark("ds2_timestamp", "20 seconds") > val output = ds1Map.join( ds2Map, > expr( > """ ds1_Id = ds2_Id AND ds2_timestamp >= ds1_timestamp AND ds2_timestamp <= > ds1_timestamp + interval 1 minutes """), > "leftOuter") > val query = output.select("*") > .writeStream > .outputMode(OutputMode.Append) > .format("console") > .option("checkpointLocation", "./ewe-spark-checkpoints/") > .start() > query.awaitTermination() > Thank you. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433039#comment-16433039 ] Cody Koeninger commented on SPARK-19680: [~nerdynick] If you submit a PR to add documentation I'd be happy to review it. IMHO something like KAFKA-3370 is really where this issue "should" be fixed. In the absence of that, I think using time-based indexes would be the best workaround for making jobs easier to start. If you have a constructive alternative suggestion that doesn't conflate offset reset at startup with silently losing data in the middle of a running app, I'd be happy to discuss it. > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene >Priority: Major > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L) > .map(row => (row._2.getSearchCount, row._2)) > .transform(rdd => rdd.sortByKey(ascending = false)) > .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, > row._2.getMeanSearchHits / row._2.getSearchCount)) > } > def main(properties: Properties): Unit = { > val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName) > val kafkaSink = > sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties))) > val
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433001#comment-16433001 ] Nicholas Verbeck commented on SPARK-19680: -- I just spent way to long on this. Thought I was doing something wrong or had a bug somewhere. I understand the not throwing away data with out notice. However something does need to happen with this. If anything at least the documentation needs to note this or at the very least a comment within KafkaUtils. However I really don't agree with the throwing away of data argument 100%. You are overriding expected functionality provided by the KafkaConsumer. This is already highly documented and covered within Kafka's Documentation. I as the user already accept that this can happen by setting "auto.offset.reset" in the first place. That choice should not be forced upon me. I would say a better option is to not override the setting. Instead just default to NONE instead of LATEST. Allowing the user to make the choice themselves. The work around of rewriting a bunch of code that already exists within the KafkaConsumer to provide the offsets on start of the pipeline is not an acceptable answer to this. > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene >Priority: Major > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry))
[jira] [Commented] (SPARK-23926) High-order function: reverse(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432982#comment-16432982 ] Apache Spark commented on SPARK-23926: -- User 'mn-mikke' has created a pull request for this issue: https://github.com/apache/spark/pull/21034 > High-order function: reverse(x) → array > --- > > Key: SPARK-23926 > URL: https://issues.apache.org/jira/browse/SPARK-23926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array which has the reversed order of array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23926: Assignee: (was: Apache Spark) > High-order function: reverse(x) → array > --- > > Key: SPARK-23926 > URL: https://issues.apache.org/jira/browse/SPARK-23926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array which has the reversed order of array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23926) High-order function: reverse(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23926: Assignee: Apache Spark > High-order function: reverse(x) → array > --- > > Key: SPARK-23926 > URL: https://issues.apache.org/jira/browse/SPARK-23926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array which has the reversed order of array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20865) caching dataset throws "Queries with streaming sources must be executed with writeStream.start()"
[ https://issues.apache.org/jira/browse/SPARK-20865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432969#comment-16432969 ] hamroune zahir commented on SPARK-20865: it is huge regression, on that sens we cannot get HEAD FIRST TAKE LIMIT CACHE COUNT...?!!! > caching dataset throws "Queries with streaming sources must be executed with > writeStream.start()" > - > > Key: SPARK-20865 > URL: https://issues.apache.org/jira/browse/SPARK-20865 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.2, 2.1.0, 2.1.1 >Reporter: Martin Brišiak >Priority: Major > Fix For: 2.2.0, 2.3.0 > > > {code} > SparkSession > .builder > .master("local[*]") > .config("spark.sql.warehouse.dir", "C:/tmp/spark") > .config("spark.sql.streaming.checkpointLocation", > "C:/tmp/spark/spark-checkpoint") > .appName("my-test") > .getOrCreate > .readStream > .schema(schema) > .json("src/test/data") > .cache > .writeStream > .start > .awaitTermination > {code} > While executing this sample in spark got error. Without the .cache option it > worked as intended but with .cache option i got: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries > with streaming sources must be executed with writeStream.start();; > FileSource[src/test/data] at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:196) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35) > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:33) > at > org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:58) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:69) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79) > at > org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84) > at > org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:102) > at > org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:65) > at > org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:89) > at org.apache.spark.sql.Dataset.persist(Dataset.scala:2479) at > org.apache.spark.sql.Dataset.cache(Dataset.scala:2489) at > org.me.App$.main(App.scala:23) at org.me.App.main(App.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array
[ https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432850#comment-16432850 ] Dylan Guedes commented on SPARK-23931: -- I would like to try this one. > High-order function: zip(array1, array2[, ...]) → array > > > Key: SPARK-23931 > URL: https://issues.apache.org/jira/browse/SPARK-23931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Merges the given arrays, element-wise, into a single array of rows. The M-th > element of the N-th argument will be the N-th field of the M-th output > element. If the arguments have an uneven length, missing values are filled > with NULL. > {noformat} > SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, > null), ROW(null, '3b')] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23955) typo in parameter name 'rawPredicition'
John Bauer created SPARK-23955: -- Summary: typo in parameter name 'rawPredicition' Key: SPARK-23955 URL: https://issues.apache.org/jira/browse/SPARK-23955 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: John Bauer classifier.py MultilayerPerceptronClassifier.__init__ API call had typo rawPredicition instead of rawPrediction also present in doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23954) Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors.
nicolas paris created SPARK-23954: - Summary: Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors. Key: SPARK-23954 URL: https://issues.apache.org/jira/browse/SPARK-23954 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.0 Reporter: nicolas paris Converting spark dataframe containing int64 fields to R dataframes leads to impredictable errors. The problems comes from R that does not handle int64 natively. As a result a good workaround would be to convert bigint as strings when transforming spark dataframes into R dataframes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs
[ https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432796#comment-16432796 ] Apache Spark commented on SPARK-19320: -- User 'yanji84' has created a pull request for this issue: https://github.com/apache/spark/pull/21033 > Allow guaranteed amount of GPU to be used when launching jobs > - > > Key: SPARK-19320 > URL: https://issues.apache.org/jira/browse/SPARK-19320 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen >Priority: Major > > Currently the only configuration for using GPUs with Mesos is setting the > maximum amount of GPUs a job will take from an offer, but doesn't guarantee > exactly how much. > We should have a configuration that sets this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23912) High-order function: array_distinct(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432780#comment-16432780 ] Huaxin Gao commented on SPARK-23912: I will work on this. Thanks! > High-order function: array_distinct(x) → array > -- > > Key: SPARK-23912 > URL: https://issues.apache.org/jira/browse/SPARK-23912 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Remove duplicate values from the array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21856: -- Fix Version/s: 2.3.0 > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Chunsheng Ji >Priority: Minor > Fix For: 2.3.0 > > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432746#comment-16432746 ] Apache Spark commented on SPARK-23529: -- User 'madanadit' has created a pull request for this issue: https://github.com/apache/spark/pull/21032 > Specify hostpath volume and mount the volume in Spark driver and executor > pods in Kubernetes > > > Key: SPARK-23529 > URL: https://issues.apache.org/jira/browse/SPARK-23529 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Suman Somasundar >Assignee: Anirudh Ramanathan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432739#comment-16432739 ] H Lu commented on SPARK-23928: -- Can I take this one? > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp resolved SPARK-8571. Resolution: Not A Problem > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, DStreams > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23751: - Assignee: Weichen Xu > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23751. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20904 [https://github.com/apache/spark/pull/20904] > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432725#comment-16432725 ] shane knapp commented on SPARK-8571: just doing some email archaeology and found this. no, it's not an issue anymore. > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, DStreams > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8696) Streaming API for Online LDA
[ https://issues.apache.org/jira/browse/SPARK-8696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432686#comment-16432686 ] Joey Frazee commented on SPARK-8696: Is there still interest in this? The two use cases I've seen for this are (1) low latency, or near real-time topic generation -- imagine a dashboard or process depending on _always_ up-to-date topic dist. -- and (2) desire to do updates rather than fitting the entire dataset again because it's very large or very expensive to pre-process – though maybe merely having topic-word priors such as suggested in SPARK-9134 could be a good enough alternative for this second use case. I've seen both of those requirements appear in tandem. > Streaming API for Online LDA > > > Key: SPARK-8696 > URL: https://issues.apache.org/jira/browse/SPARK-8696 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Priority: Major > > Streaming LDA can be a natural extension from online LDA. > Yet for now we need to settle down the implementation for LDA prediction, to > support the predictOn method in the streaming version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23923) High-order function: cardinality(x) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23923: Assignee: Apache Spark > High-order function: cardinality(x) → bigint > > > Key: SPARK-23923 > URL: https://issues.apache.org/jira/browse/SPARK-23923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html. > Returns the cardinality (size) of the array/map x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23923) High-order function: cardinality(x) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23923: Assignee: (was: Apache Spark) > High-order function: cardinality(x) → bigint > > > Key: SPARK-23923 > URL: https://issues.apache.org/jira/browse/SPARK-23923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html. > Returns the cardinality (size) of the array/map x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23923) High-order function: cardinality(x) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432584#comment-16432584 ] Apache Spark commented on SPARK-23923: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21031 > High-order function: cardinality(x) → bigint > > > Key: SPARK-23923 > URL: https://issues.apache.org/jira/browse/SPARK-23923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html. > Returns the cardinality (size) of the array/map x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23953) Add get_json_scalar function
Timothy Chen created SPARK-23953: Summary: Add get_json_scalar function Key: SPARK-23953 URL: https://issues.apache.org/jira/browse/SPARK-23953 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.0 Reporter: Timothy Chen Besides get_json_object which returns a JSON string in a return type, we should add a function "get_json_scalar" that returns a scalar type assuming the path maps to a scalar (boolean, number, string or null). It returns null when the path points to a object or array structure This is also in Presto (https://prestodb.io/docs/current/functions/json.html). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432495#comment-16432495 ] Attila Zsolt Piros commented on SPARK-16630: Let me illustrate my problem with an example: - the limit for blacklisted nodes is configured to 2 - we have one node blacklisted close to the yarn allocator ("host1" -> expiryTime1), this is the new code I am working on - scheduler requests a new executors along with blacklisted nodes (task-level): "host2", "host3" (org.apache.spark.deploy.yarn.YarnAllocator#requestTotalExecutorsWithPreferredLocalities) So I have to choose 2 nodes to communicate towards YARN. My idea to pass expiryTime2 and expiryTime3 to the YarnAllocator to choose the most relevant 2 nodes (the one which expires latter are the more relevant). For this in the case class RequestExecutors the nodeBlacklist field type is changed to Map[String, Long] from Set[String]. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23952) remove type parameter in DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23952: Assignee: Wenchen Fan (was: Apache Spark) > remove type parameter in DataReaderFactory > -- > > Key: SPARK-23952 > URL: https://issues.apache.org/jira/browse/SPARK-23952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23952) remove type parameter in DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432472#comment-16432472 ] Apache Spark commented on SPARK-23952: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/21029 > remove type parameter in DataReaderFactory > -- > > Key: SPARK-23952 > URL: https://issues.apache.org/jira/browse/SPARK-23952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23864) Add Unsafe* copy methods to UnsafeWriter
[ https://issues.apache.org/jira/browse/SPARK-23864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23864. --- Resolution: Fixed Fix Version/s: 2.4.0 > Add Unsafe* copy methods to UnsafeWriter > > > Key: SPARK-23864 > URL: https://issues.apache.org/jira/browse/SPARK-23864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23952) remove type parameter in DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23952: Assignee: Apache Spark (was: Wenchen Fan) > remove type parameter in DataReaderFactory > -- > > Key: SPARK-23952 > URL: https://issues.apache.org/jira/browse/SPARK-23952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23952) remove type parameter in DataReaderFactory
Wenchen Fan created SPARK-23952: --- Summary: remove type parameter in DataReaderFactory Key: SPARK-23952 URL: https://issues.apache.org/jira/browse/SPARK-23952 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432467#comment-16432467 ] Thomas Graves commented on SPARK-16630: --- sorry I don't follow, the list we get from the blacklist tracker is all nodes that are blacklisted currently that haven't met the expiry to unblacklist them. You just union them with the yarn allocator list. There is obviously some race condition there if one of the nodes it just about to be unblacklisted but I don't see that as a major issue, the next allocation will not have it. Is there something I'm missing? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column
[ https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432463#comment-16432463 ] Ed Lee commented on SPARK-20617: Thank you for the clarification. So conversely: {code:java} spark.sql("select null NOT in ('a')") {code} evaluates to null. And when the filter is applied with null == False this is false and therefore the filter wouldn't return those rows. I see now that Pandas doesn't follow SQL standards test_df.query("col1 not in (['a'])") > pyspark.sql filtering fails when using ~isin when there are nulls in column > --- > > Key: SPARK-20617 > URL: https://issues.apache.org/jira/browse/SPARK-20617 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 > Environment: Ubuntu Xenial 16.04, Python 3.5 >Reporter: Ed Lee >Priority: Major > > Hello encountered a filtering bug using 'isin' in pyspark sql on version > 2.2.0, Ubuntu 16.04. > Enclosed below an example to replicate: > from pyspark.sql import SparkSession > from pyspark.sql import functions as sf > import pandas as pd > spark = SparkSession.builder.master("local").appName("Word > Count").getOrCreate() > test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"], > "col2": range(5) > }) > test_sdf = spark.createDataFrame(test_df) > test_sdf.show() > |col1|col2| > |null| 0| > |null| 1| > | a| 2| > | b| 3| > | c| 4| > # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null > are missing: > test_sdf.filter(sf.col("col1").isin(["a"]) == False).show() > Or: > test_sdf.filter(~sf.col("col1").isin(["a"])).show() > *Expecting*: > |col1|col2| > |null| 0| > |null| 1| > | b| 3| > | c| 4| > *Got*: > |col1|col2| > | b| 3| > | c| 4| > My workarounds: > 1. null is considered 'in', so add OR isNull conditon: > test_sdf.filter((sf.col("col1").isin(["a"])== False) | ( > sf.col("col1").isNull())).show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > 2. Use left join and filter > join_df = pd.DataFrame({"col1": ["a"], > "isin": 1 > }) > join_sdf = spark.createDataFrame(join_df) > test_sdf.join(join_sdf, on="col1", how="left") \ > .filter(sf.col("isin").isNull()) \ > .show() > To get: > |col1|col2|isin| > |null| 0|null| > |null| 1|null| > | c| 4|null| > | b| 3|null| > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23922: Assignee: (was: Apache Spark) > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23922: Assignee: Apache Spark > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432423#comment-16432423 ] Apache Spark commented on SPARK-23922: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21028 > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432415#comment-16432415 ] Attila Zsolt Piros commented on SPARK-16630: I would need the expiry times to choose the most relevant (most fresh) subset of nodes to backlist when the limit is less then the union of all blacklist-able nodes. So it is only used for sorting. > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23943: Assignee: (was: Apache Spark) > Improve observability of MesosRestServer/MesosClusterDispatcher > --- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos >Affects Versions: 2.2.1, 2.3.0 > Environment: > >Reporter: paul mackles >Priority: Minor > Fix For: 2.4.0 > > > Two changes in this PR: > * A /health endpoint for a quick binary indication on the health of > MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a > marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. > Returns a 503 status if the server is unhealthy and a 200 if the server is > healthy > * A /status endpoint for a more detailed examination on the current state of > a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool > For both endpoints, regardless of status code, the following body is returned: > > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "iamok", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true, > "pendingRetryDrivers" : 0 > }{code} > Aside from surfacing all of the scheduler metrics, the response also includes > the status of the Mesos SchedulerDriver. On numerous occasions now, we have > observed scenarios where the Mesos SchedulerDriver quietly exits due to some > other failure. When this happens, jobs queue up and the only way to clean > things up is to restart the service. > With the above health check, marathon can be configured to automatically > restart the MesosClusterDispatcher service when the health check fails, > lessening the need for manual intervention. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23943: Assignee: Apache Spark > Improve observability of MesosRestServer/MesosClusterDispatcher > --- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos >Affects Versions: 2.2.1, 2.3.0 > Environment: > >Reporter: paul mackles >Assignee: Apache Spark >Priority: Minor > Fix For: 2.4.0 > > > Two changes in this PR: > * A /health endpoint for a quick binary indication on the health of > MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a > marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. > Returns a 503 status if the server is unhealthy and a 200 if the server is > healthy > * A /status endpoint for a more detailed examination on the current state of > a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool > For both endpoints, regardless of status code, the following body is returned: > > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "iamok", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true, > "pendingRetryDrivers" : 0 > }{code} > Aside from surfacing all of the scheduler metrics, the response also includes > the status of the Mesos SchedulerDriver. On numerous occasions now, we have > observed scenarios where the Mesos SchedulerDriver quietly exits due to some > other failure. When this happens, jobs queue up and the only way to clean > things up is to restart the service. > With the above health check, marathon can be configured to automatically > restart the MesosClusterDispatcher service when the health check fails, > lessening the need for manual intervention. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432379#comment-16432379 ] Apache Spark commented on SPARK-23943: -- User 'pmackles' has created a pull request for this issue: https://github.com/apache/spark/pull/21027 > Improve observability of MesosRestServer/MesosClusterDispatcher > --- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos >Affects Versions: 2.2.1, 2.3.0 > Environment: > >Reporter: paul mackles >Priority: Minor > Fix For: 2.4.0 > > > Two changes in this PR: > * A /health endpoint for a quick binary indication on the health of > MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a > marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. > Returns a 503 status if the server is unhealthy and a 200 if the server is > healthy > * A /status endpoint for a more detailed examination on the current state of > a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool > For both endpoints, regardless of status code, the following body is returned: > > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "iamok", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true, > "pendingRetryDrivers" : 0 > }{code} > Aside from surfacing all of the scheduler metrics, the response also includes > the status of the Mesos SchedulerDriver. On numerous occasions now, we have > observed scenarios where the Mesos SchedulerDriver quietly exits due to some > other failure. When this happens, jobs queue up and the only way to clean > things up is to restart the service. > With the above health check, marathon can be configured to automatically > restart the MesosClusterDispatcher service when the health check fails, > lessening the need for manual intervention. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff
[ https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432375#comment-16432375 ] Apache Spark commented on SPARK-23951: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/21026 > Use java classed in ExprValue and simplify a bunch of stuff > --- > > Key: SPARK-23951 > URL: https://issues.apache.org/jira/browse/SPARK-23951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff
[ https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23951: Assignee: Herman van Hovell (was: Apache Spark) > Use java classed in ExprValue and simplify a bunch of stuff > --- > > Key: SPARK-23951 > URL: https://issues.apache.org/jira/browse/SPARK-23951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff
[ https://issues.apache.org/jira/browse/SPARK-23951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23951: Assignee: Apache Spark (was: Herman van Hovell) > Use java classed in ExprValue and simplify a bunch of stuff > --- > > Key: SPARK-23951 > URL: https://issues.apache.org/jira/browse/SPARK-23951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on
[ https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23888: - Labels: speculation (was: ) > speculative task should not run on a given host where another attempt is > already running on > --- > > Key: SPARK-23888 > URL: https://issues.apache.org/jira/browse/SPARK-23888 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: speculation > Fix For: 2.3.0 > > > There's a bug in: > {code:java} > /** Check whether a task is currently running an attempt on a given host */ > private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { >taskAttempts(taskIndex).exists(_.host == host) > } > {code} > This will ignore hosts which have finished attempts, so we should check > whether the attempt is currently running on the given host. > And it is possible for a speculative task to run on a host where another > attempt failed here before. > Assume we have only two machines: host1, host2. We first run task0.0 on > host1. Then, due to a long time waiting for task0.0, we launch a speculative > task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run > since there's already a copy running on host2. After another long time, we > launch a new speculative task0.2. And, now, we can run task0.2 on host1 > again, since there's no more running attempt on host1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on
[ https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23888: - Component/s: Scheduler > speculative task should not run on a given host where another attempt is > already running on > --- > > Key: SPARK-23888 > URL: https://issues.apache.org/jira/browse/SPARK-23888 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: speculation > Fix For: 2.3.0 > > > There's a bug in: > {code:java} > /** Check whether a task is currently running an attempt on a given host */ > private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { >taskAttempts(taskIndex).exists(_.host == host) > } > {code} > This will ignore hosts which have finished attempts, so we should check > whether the attempt is currently running on the given host. > And it is possible for a speculative task to run on a host where another > attempt failed here before. > Assume we have only two machines: host1, host2. We first run task0.0 on > host1. Then, due to a long time waiting for task0.0, we launch a speculative > task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run > since there's already a copy running on host2. After another long time, we > launch a new speculative task0.2. And, now, we can run task0.2 on host1 > again, since there's no more running attempt on host1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23929) pandas_udf schema mapped by position and not by name
[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432344#comment-16432344 ] Omri edited comment on SPARK-23929 at 4/10/18 2:22 PM: --- [~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have a different example to illustrate the problem. Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros]. The column names are taken from the provided schema and not from the explicitly mentioned data frame. {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * import pandas as pd df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) schema = StructType([ StructField("id", LongType()), StructField("zeros", DoubleType()), StructField("ones", DoubleType()) ]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def constants(grp): return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0]) df.groupby("id").apply(constants).show() {code} results: {code:java} +---+-++ | id|zeros|ones| +---+-++ | 1| 1.0| 0.0| | 2| 1.0| 0.0| +---+-++ {code} So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results. was (Author: omri374): [~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have a different example to illustrate the problem. Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros]. The column names are taken from the provided schema and not from the explicitly mentioned data frame. {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * import pandas as pd df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) schema = StructType([ StructField("id", LongType()), StructField("zeros", DoubleType()), StructField("ones", DoubleType()) ]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def median_per_group(grp): return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0]) df.groupby("id").apply(median_per_group).show() {code} results: {code:java} +---+-++ | id|zeros|ones| +---+-++ | 1| 1.0| 0.0| | 2| 1.0| 0.0| +---+-++ {code} So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results. > pandas_udf schema mapped by position and not by name > > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > >Reporter: Omri >Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+---+ > | id| v| > +---+---+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23929) pandas_udf schema mapped by position and not by name
[ https://issues.apache.org/jira/browse/SPARK-23929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432344#comment-16432344 ] Omri commented on SPARK-23929: -- [~icexelloss], I couldn't recreate the problem I had where the order was mixed, but I have a different example to illustrate the problem. Here the schema struct is [id,zeros,ones] but the user returned a data frame with [id,ones,zeros]. The column names are taken from the provided schema and not from the explicitly mentioned data frame. {code:java} from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.types import * import pandas as pd df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) schema = StructType([ StructField("id", LongType()), StructField("zeros", DoubleType()), StructField("ones", DoubleType()) ]) @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def median_per_group(grp): return pd.DataFrame({"id":grp.iloc[0]['id'],"ones":1,"zeros":0},index = [0]) df.groupby("id").apply(median_per_group).show() {code} results: {code:java} +---+-++ | id|zeros|ones| +---+-++ | 1| 1.0| 0.0| | 2| 1.0| 0.0| +---+-++ {code} So the user was expecting ones to have 1 and zeros to have 0 but they get wrong results. > pandas_udf schema mapped by position and not by name > > > Key: SPARK-23929 > URL: https://issues.apache.org/jira/browse/SPARK-23929 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: PySpark > Spark 2.3.0 > >Reporter: Omri >Priority: Major > > The return struct of a pandas_udf should be mapped to the provided schema by > name. Currently it's not the case. > Consider these two examples, where the only change is the order of the fields > in the provided schema struct: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > and this one: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > df.groupby("id").apply(normalize).show() > {code} > The results should be the same but they are different: > For the first code: > {code:java} > +---+---+ > | v| id| > +---+---+ > |1.0| 0| > |1.0| 0| > |2.0| 0| > |2.0| 0| > |2.0| 1| > +---+---+ > {code} > For the second code: > {code:java} > +---+---+ > | id| v| > +---+---+ > | 1|-0.7071067811865475| > | 1| 0.7071067811865475| > | 2|-0.8320502943378437| > | 2|-0.2773500981126146| > | 2| 1.1094003924504583| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true
[ https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432323#comment-16432323 ] Yu Wang commented on SPARK-23884: - [~Ngone51]I did not mention the patch and wanted to mention one > hasLaunchedTask should be true when launchedAnyTask be true > --- > > Key: SPARK-23884 > URL: https://issues.apache.org/jira/browse/SPARK-23884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > Fix For: 2.3.0 > > Attachments: SPARK-23884.patch > > > *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather > than *task's size > 0.* > *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but > this dose not ensure there's any tasks launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23841) NodeIdCache should unpersist the last cached nodeIdsForInstances
[ https://issues.apache.org/jira/browse/SPARK-23841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23841. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20956 [https://github.com/apache/spark/pull/20956] > NodeIdCache should unpersist the last cached nodeIdsForInstances > > > Key: SPARK-23841 > URL: https://issues.apache.org/jira/browse/SPARK-23841 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.4.0 > > > NodeIdCache forget to unpersist the last cached intermediate dataset: > > {code:java} > scala> import org.apache.spark.ml.classification._ > import org.apache.spark.ml.classification._ > scala> val df = > spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_libsvm_data.txt") > 2018-04-02 11:48:25 WARN LibSVMFileFormat:66 - 'numFeatures' option not > specified, determining the number of features by going though the input. If > you know the number in advance, please specify it via 'numFeatures' option to > avoid the extra scan. > 2018-04-02 11:48:31 WARN ObjectStore:568 - Failed to get database > global_temp, returning NoSuchObjectException > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> val rf = new RandomForestClassifier().setCacheNodeIds(true) > rf: org.apache.spark.ml.classification.RandomForestClassifier = > rfc_aab2b672546b > scala> val rfm = rf.fit(df) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_aab2b672546b) with 20 trees > scala> sc.getPersistentRDDs > res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(56 -> > MapPartitionsRDD[56] at map at NodeIdCache.scala:102){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23841) NodeIdCache should unpersist the last cached nodeIdsForInstances
[ https://issues.apache.org/jira/browse/SPARK-23841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23841: - Assignee: zhengruifeng > NodeIdCache should unpersist the last cached nodeIdsForInstances > > > Key: SPARK-23841 > URL: https://issues.apache.org/jira/browse/SPARK-23841 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > NodeIdCache forget to unpersist the last cached intermediate dataset: > > {code:java} > scala> import org.apache.spark.ml.classification._ > import org.apache.spark.ml.classification._ > scala> val df = > spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_libsvm_data.txt") > 2018-04-02 11:48:25 WARN LibSVMFileFormat:66 - 'numFeatures' option not > specified, determining the number of features by going though the input. If > you know the number in advance, please specify it via 'numFeatures' option to > avoid the extra scan. > 2018-04-02 11:48:31 WARN ObjectStore:568 - Failed to get database > global_temp, returning NoSuchObjectException > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> val rf = new RandomForestClassifier().setCacheNodeIds(true) > rf: org.apache.spark.ml.classification.RandomForestClassifier = > rfc_aab2b672546b > scala> val rfm = rf.fit(df) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_aab2b672546b) with 20 trees > scala> sc.getPersistentRDDs > res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(56 -> > MapPartitionsRDD[56] at map at NodeIdCache.scala:102){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432244#comment-16432244 ] Thomas Graves commented on SPARK-16630: --- yes I think it would make sense as the union of all blacklisted nodes. I'm not sure what you mean by your last question. The expiry currently is all handled in the BlacklistTracker, I wouldn't want to move that out into the yarn allocator. Just use the information passed to it unless there is a case it doesn't cover? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory
[ https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432207#comment-16432207 ] Kingsley Jones commented on SPARK-12216: scala> val loader = Thread.currentThread.getContextClassLoader() loader: ClassLoader = scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@3a1a20f scala> val parent1 = loader.getParent() parent1: ClassLoader = scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@66e6af49 scala> val parent2 = parent1.getParent() parent2: ClassLoader = sun.misc.Launcher$AppClassLoader@5fcfe4b2 scala> val parent3 = parent2.getParent() parent3: ClassLoader = sun.misc.Launcher$ExtClassLoader@5257226b scala> val parent4 = parent3.getParent() parent4: ClassLoader = null I did experiment with trying to find the open ClassLoaders in the scala session (shown above). shows exposed methods on the loaders, but there is no close method: scala> loader. clearAssertionStatus getResource getResources setClassAssertionStatus setPackageAssertionStatus getParent getResourceAsStream loadClass setDefaultAssertionStatus scala> parent1. clearAssertionStatus getResource getResources setClassAssertionStatus setPackageAssertionStatus getParent getResourceAsStream loadClass setDefaultAssertionStatus There is no close method on any of these, so I could not try closing them prior to quitting the session. This was just a simple hack to see if there was any way to use reflection to find the open ClassLoaders. I thought perhaps it might be possible to walk this tree and then close them within ShutDownHookManager ??? > Spark failed to delete temp directory > -- > > Key: SPARK-12216 > URL: https://issues.apache.org/jira/browse/SPARK-12216 > Project: Spark > Issue Type: Bug > Components: Spark Shell > Environment: windows 7 64 bit > Spark 1.52 > Java 1.8.0.65 > PATH includes: > C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin > C:\ProgramData\Oracle\Java\javapath > C:\Users\Stefan\scala\bin > SYSTEM variables set are: > JAVA_HOME=C:\Program Files\Java\jre1.8.0_65 > HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin > (where the bin\winutils resides) > both \tmp and \tmp\hive have permissions > drwxrwxrwx as detected by winutils ls >Reporter: stefan >Priority: Minor > > The mailing list archives have no obvious solution to this: > scala> :q > Stopping spark context. > 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark > temp dir: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > java.io.IOException: Failed to delete: > C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-23337) withWatermark raises an exception on struct objects
[ https://issues.apache.org/jira/browse/SPARK-23337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432200#comment-16432200 ] Aydin Kocas commented on SPARK-23337: - Hi Michael, in my case it's a blocking issue and unfortunately not just annoying because I can't use the watermarking-functionality when doing the readstream on a json file . Without the time limitation via the watermarking-functionality, I have concerns that my checkpoint-dir will increase with time because of not having any time boundaries. > withWatermark raises an exception on struct objects > --- > > Key: SPARK-23337 > URL: https://issues.apache.org/jira/browse/SPARK-23337 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.1 > Environment: Linux Ubuntu, Spark on standalone mode >Reporter: Aydin Kocas >Priority: Major > > Hi, > > when using a nested object (I mean an object within a struct, here concrete: > _source.createTime) from a json file as the parameter for the > withWatermark-method, I get an exception (see below). > Anything else works flawlessly with the nested object. > > +*{color:#14892c}works:{color}*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("myTime", > "10 seconds").toDF();{code} > > json structure: > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- myTime: timestamp (nullable = true) > ..{code} > +*{color:#d04437}does not work - nested json{color}:*+ > {code:java} > Dataset jsonRow = > spark.readStream().schema(getJSONSchema()).json(file).dropDuplicates("_id").withWatermark("_source.createTime", > "10 seconds").toDF();{code} > > json structure: > > {code:java} > root > |-- _id: string (nullable = true) > |-- _index: string (nullable = true) > |-- _score: long (nullable = true) > |-- _source: struct (nullable = true) > | |-- createTime: timestamp (nullable = true) > .. > > Exception in thread "main" > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: > 'EventTimeWatermark '_source.createTime, interval 10 seconds > +- Deduplicate [_id#0], true > +- StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@5dbbb292,json,List(),Some(StructType(StructField(_id,StringType,true), > StructField(_index,StringType,true), StructField(_score,LongType,true), > StructField(_source,StructType(StructField(additionalData,StringType,true), > StructField(client,StringType,true), > StructField(clientDomain,BooleanType,true), > StructField(clientVersion,StringType,true), > StructField(country,StringType,true), > StructField(countryName,StringType,true), > StructField(createTime,TimestampType,true), > StructField(externalIP,StringType,true), > StructField(hostname,StringType,true), > StructField(internalIP,StringType,true), > StructField(location,StringType,true), > StructField(locationDestination,StringType,true), > StructField(login,StringType,true), > StructField(originalRequestString,StringType,true), > StructField(password,StringType,true), > StructField(peerIdent,StringType,true), > StructField(peerType,StringType,true), > StructField(recievedTime,TimestampType,true), > StructField(sessionEnd,StringType,true), > StructField(sessionStart,StringType,true), > StructField(sourceEntryAS,StringType,true), > StructField(sourceEntryIp,StringType,true), > StructField(sourceEntryPort,StringType,true), > StructField(targetCountry,StringType,true), > StructField(targetCountryName,StringType,true), > StructField(targetEntryAS,StringType,true), > StructField(targetEntryIp,StringType,true), > StructField(targetEntryPort,StringType,true), > StructField(targetport,StringType,true), > StructField(username,StringType,true), > StructField(vulnid,StringType,true)),true), > StructField(_type,StringType,true))),List(),None,Map(path -> ./input/),None), > FileSource[./input/], [_id#0, _index#1, _score#2L, _source#3, _type#4] > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:854) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:796) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) >
[jira] [Updated] (SPARK-23943) Improve observability of MesosRestServer/MesosClusterDispatcher
[ https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] paul mackles updated SPARK-23943: - Description: Two changes in this PR: * A /health endpoint for a quick binary indication on the health of MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. Returns a 503 status if the server is unhealthy and a 200 if the server is healthy * A /status endpoint for a more detailed examination on the current state of a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool For both endpoints, regardless of status code, the following body is returned: {code:java} { "action" : "ServerStatusResponse", "launchedDrivers" : 0, "message" : "iamok", "queuedDrivers" : 0, "schedulerDriverStopped" : false, "serverSparkVersion" : "2.3.1-SNAPSHOT", "success" : true, "pendingRetryDrivers" : 0 }{code} Aside from surfacing all of the scheduler metrics, the response also includes the status of the Mesos SchedulerDriver. On numerous occasions now, we have observed scenarios where the Mesos SchedulerDriver quietly exits due to some other failure. When this happens, jobs queue up and the only way to clean things up is to restart the service. With the above health check, marathon can be configured to automatically restart the MesosClusterDispatcher service when the health check fails, lessening the need for manual intervention. was: Two changes: First, a more robust [health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, this check verifies that the MesosSchedulerDriver is still running as we have seen certain cases where it stops (rather quietly) and the only way to revive it is a restart. With this health check, marathon will restart the dispatcher if the MesosSchedulerDriver stops running. The health check lives at the url "/health" and returns a 204 when the server is healthy and a 503 when it is not (e.g. the MesosSchedulerDriver stopped running). Second, a server status endpoint that replies with some basic metrics about the server. The status endpoint resides at the url "/status" and responds with: {code:java} { "action" : "ServerStatusResponse", "launchedDrivers" : 0, "message" : "server OK", "queuedDrivers" : 0, "schedulerDriverStopped" : false, "serverSparkVersion" : "2.3.1-SNAPSHOT", "success" : true }{code} As you can see, it includes a snapshot of the metrics/health of the scheduler. Useful for quick debugging/troubleshooting/monitoring. > Improve observability of MesosRestServer/MesosClusterDispatcher > --- > > Key: SPARK-23943 > URL: https://issues.apache.org/jira/browse/SPARK-23943 > Project: Spark > Issue Type: Improvement > Components: Deploy, Mesos >Affects Versions: 2.2.1, 2.3.0 > Environment: > >Reporter: paul mackles >Priority: Minor > Fix For: 2.4.0 > > > Two changes in this PR: > * A /health endpoint for a quick binary indication on the health of > MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a > marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. > Returns a 503 status if the server is unhealthy and a 200 if the server is > healthy > * A /status endpoint for a more detailed examination on the current state of > a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool > For both endpoints, regardless of status code, the following body is returned: > > {code:java} > { > "action" : "ServerStatusResponse", > "launchedDrivers" : 0, > "message" : "iamok", > "queuedDrivers" : 0, > "schedulerDriverStopped" : false, > "serverSparkVersion" : "2.3.1-SNAPSHOT", > "success" : true, > "pendingRetryDrivers" : 0 > }{code} > Aside from surfacing all of the scheduler metrics, the response also includes > the status of the Mesos SchedulerDriver. On numerous occasions now, we have > observed scenarios where the Mesos SchedulerDriver quietly exits due to some > other failure. When this happens, jobs queue up and the only way to clean > things up is to restart the service. > With the above health check, marathon can be configured to automatically > restart the MesosClusterDispatcher service when the health check fails, > lessening the need for manual intervention. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165 ] Attila Zsolt Piros edited comment on SPARK-16630 at 4/10/18 12:15 PM: -- I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN allocator (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? was (Author: attilapiros): I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16630) Blacklist a node if executors won't launch on it.
[ https://issues.apache.org/jira/browse/SPARK-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432165#comment-16432165 ] Attila Zsolt Piros commented on SPARK-16630: I have question regarding limiting the number of blacklisted nodes according to the cluster size. With this change there will be two sources of nodes to be backlisted: - one list is coming from the scheduler (existing node level backlisting) - the other is computed here close to the YARN (stored along with the expiry times) I think it makes sense to have the limit for the complete list (union) of blacklisted nodes, am I right? If this limit is for the complete list then regarding the subset I think the newly blacklisted nodes are more up-to-date to be used then the earlier backlisted ones. So I would pass the expiry times from the scheduler to the YARN allocator to make the subset of backlisted nodes to be communicated to YARN. What is your opinion? > Blacklist a node if executors won't launch on it. > - > > Key: SPARK-16630 > URL: https://issues.apache.org/jira/browse/SPARK-16630 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.2 >Reporter: Thomas Graves >Priority: Major > > On YARN, its possible that a node is messed or misconfigured such that a > container won't launch on it. For instance if the Spark external shuffle > handler didn't get loaded on it , maybe its just some other hardware issue or > hadoop configuration issue. > It would be nice we could recognize this happening and stop trying to launch > executors on it since that could end up causing us to hit our max number of > executor failures and then kill the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true
[ https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432149#comment-16432149 ] wuyi commented on SPARK-23884: -- [~gentlewang] why? > hasLaunchedTask should be true when launchedAnyTask be true > --- > > Key: SPARK-23884 > URL: https://issues.apache.org/jira/browse/SPARK-23884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > Fix For: 2.3.0 > > Attachments: SPARK-23884.patch > > > *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather > than *task's size > 0.* > *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but > this dose not ensure there's any tasks launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-23705: Attachment: SPARK-23705.patch > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Attachments: SPARK-23705.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432142#comment-16432142 ] Yu Wang commented on SPARK-23705: - [~khoatrantan2000] Could you assign this patch to me? > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Attachments: SPARK-23705.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true
[ https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432118#comment-16432118 ] Yu Wang edited comment on SPARK-23884 at 4/10/18 11:47 AM: --- [~Ngone51]Could you assign this task to me? was (Author: gentlewang): [~Ngone51]Can you assign this task to me? > hasLaunchedTask should be true when launchedAnyTask be true > --- > > Key: SPARK-23884 > URL: https://issues.apache.org/jira/browse/SPARK-23884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > Fix For: 2.3.0 > > Attachments: SPARK-23884.patch > > > *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather > than *task's size > 0.* > *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but > this dose not ensure there's any tasks launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23922) High-order function: arrays_overlap(x, y) → boolean
[ https://issues.apache.org/jira/browse/SPARK-23922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432124#comment-16432124 ] Marco Gaido commented on SPARK-23922: - I will work on this. > High-order function: arrays_overlap(x, y) → boolean > --- > > Key: SPARK-23922 > URL: https://issues.apache.org/jira/browse/SPARK-23922 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Tests if arrays x and y have any any non-null elements in common. Returns > null if there are no non-null elements in common but either array contains > null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23918) High-order function: array_min(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432119#comment-16432119 ] Apache Spark commented on SPARK-23918: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21025 > High-order function: array_min(x) → x > - > > Key: SPARK-23918 > URL: https://issues.apache.org/jira/browse/SPARK-23918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the minimum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23918: Assignee: (was: Apache Spark) > High-order function: array_min(x) → x > - > > Key: SPARK-23918 > URL: https://issues.apache.org/jira/browse/SPARK-23918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the minimum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23918: Assignee: Apache Spark > High-order function: array_min(x) → x > - > > Key: SPARK-23918 > URL: https://issues.apache.org/jira/browse/SPARK-23918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the minimum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true
[ https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Wang updated SPARK-23884: Attachment: SPARK-23884.patch > hasLaunchedTask should be true when launchedAnyTask be true > --- > > Key: SPARK-23884 > URL: https://issues.apache.org/jira/browse/SPARK-23884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > Fix For: 2.3.0 > > Attachments: SPARK-23884.patch > > > *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather > than *task's size > 0.* > *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but > this dose not ensure there's any tasks launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23884) hasLaunchedTask should be true when launchedAnyTask be true
[ https://issues.apache.org/jira/browse/SPARK-23884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432118#comment-16432118 ] Yu Wang commented on SPARK-23884: - [~Ngone51]Can you assign this task to me? > hasLaunchedTask should be true when launchedAnyTask be true > --- > > Key: SPARK-23884 > URL: https://issues.apache.org/jira/browse/SPARK-23884 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: easyfix > Fix For: 2.3.0 > > > *hasLaunchedTask* should be *true* when *launchedAnyTask* be *true*, rather > than *task's size > 0.* > *task'size* would be geater than 0 as long as there‘s any *WorkOffers,*but > this dose not ensure there's any tasks launched. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-23948: - Description: SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, "markMapStageJobAsFinished" is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); But think about below scenario: 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; 2. We submit stage1 by "submitMapStage"; 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1; 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside "runningStages". So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called; 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there is no missing tasks. But in current code, job listener is not triggered was: SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, "markMapStageJobAsFinished" is called only in (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); But think about below scenario: 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; 2. We submit stage1 by "submitMapStage", there are 10 missing tasks in stage1 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got resubmitted as stage0_1 and stage1_1; 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, but stage1 is not inside "runningStages". So even though all splits(including the speculated tasks) in stage1 succeeded, job listener in stage1 will not be called; 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", there is no missing tasks. But in current code, job listener is not triggered > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Priority: Major > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23951) Use java classed in ExprValue and simplify a bunch of stuff
Herman van Hovell created SPARK-23951: - Summary: Use java classed in ExprValue and simplify a bunch of stuff Key: SPARK-23951 URL: https://issues.apache.org/jira/browse/SPARK-23951 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Herman van Hovell Assignee: Herman van Hovell -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23917) High-order function: array_max(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432084#comment-16432084 ] Apache Spark commented on SPARK-23917: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21024 > High-order function: array_max(x) → x > - > > Key: SPARK-23917 > URL: https://issues.apache.org/jira/browse/SPARK-23917 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Returns the maximum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23917) High-order function: array_max(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23917: Assignee: (was: Apache Spark) > High-order function: array_max(x) → x > - > > Key: SPARK-23917 > URL: https://issues.apache.org/jira/browse/SPARK-23917 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Returns the maximum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23917) High-order function: array_max(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23917: Assignee: Apache Spark > High-order function: array_max(x) → x > - > > Key: SPARK-23917 > URL: https://issues.apache.org/jira/browse/SPARK-23917 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Returns the maximum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23949) makes "&&" supports the function of predicate operator "and"
[ https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23949: Assignee: Apache Spark > makes "&&" supports the function of predicate operator "and" > > > Key: SPARK-23949 > URL: https://issues.apache.org/jira/browse/SPARK-23949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: Apache Spark >Priority: Minor > > In mysql , symbol && supports the function of predicate operator "and", > maybe we can add support for the function in Spark SQL. > For example, > select * from tbl where id==1 && age=10 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23949) makes "&&" supports the function of predicate operator "and"
[ https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23949: Assignee: (was: Apache Spark) > makes "&&" supports the function of predicate operator "and" > > > Key: SPARK-23949 > URL: https://issues.apache.org/jira/browse/SPARK-23949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > In mysql , symbol && supports the function of predicate operator "and", > maybe we can add support for the function in Spark SQL. > For example, > select * from tbl where id==1 && age=10 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23949) makes "&&" supports the function of predicate operator "and"
[ https://issues.apache.org/jira/browse/SPARK-23949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432040#comment-16432040 ] Apache Spark commented on SPARK-23949: -- User 'httfighter' has created a pull request for this issue: https://github.com/apache/spark/pull/21023 > makes "&&" supports the function of predicate operator "and" > > > Key: SPARK-23949 > URL: https://issues.apache.org/jira/browse/SPARK-23949 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: hantiantian >Priority: Minor > > In mysql , symbol && supports the function of predicate operator "and", > maybe we can add support for the function in Spark SQL. > For example, > select * from tbl where id==1 && age=10 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432039#comment-16432039 ] Herman van Hovell commented on SPARK-23945: --- [~nchammas] we didn't add explicit dataset support because no-one asked for it, until now :) What do you want to support here? {{(NOT) IN}} and {{EXISTS}}? Or do you also want to add support for scalar subqueries, and subqueries in filters? > Column.isin() should accept a single-column DataFrame as input > -- > > Key: SPARK-23945 > URL: https://issues.apache.org/jira/browse/SPARK-23945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > In SQL you can filter rows based on the result of a subquery: > {code:java} > SELECT * > FROM table1 > WHERE name NOT IN ( > SELECT name > FROM table2 > );{code} > In the Spark DataFrame API, the equivalent would probably look like this: > {code:java} > (table1 > .where( > ~col('name').isin( > table2.select('name') > ) > ) > ){code} > However, .isin() currently [only accepts a local list of > values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin]. > I imagine making this enhancement would happen as part of a larger effort to > support correlated subqueries in the DataFrame API. > Or perhaps there is no plan to support this style of query in the DataFrame > API, and queries like this should instead be written in a different way? How > would we write a query like the one I have above in the DataFrame API, > without needing to collect values locally for the NOT IN filter? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23950) Coalescing an empty dataframe to 1 partition
João Neves created SPARK-23950: -- Summary: Coalescing an empty dataframe to 1 partition Key: SPARK-23950 URL: https://issues.apache.org/jira/browse/SPARK-23950 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1 Environment: Operating System: Windows 7 Tested in Jupyter notebooks using Python 2.7.14 and Python 3.6.3. Hardware specs not relevant to the issue. Reporter: João Neves Coalescing an empty dataframe to 1 partition returns an error. The funny thing is that coalescing an empty dataframe to 2 or more partitions seem to work. The test case is the following: {code} from pyspark.sql.types import StructType df = spark.createDataFrame(spark.sparkContext.emptyRDD(), StructType([])) print(df.coalesce(2).count()) print(df.coalesce(3).count()) print(df.coalesce(4).count()) df.coalesce(1).count(){code} Output: {code:java} 0 0 0 --- Py4JJavaError Traceback (most recent call last) in () 7 print(df.coalesce(4).count()) 8 > 9 print(df.coalesce(1).count()) C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py in count(self) 425 2 426 """ --> 427 return int(self._jdf.count()) 428 429 @ignore_unicode_prefix c:\python36\lib\site-packages\py4j\java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: C:\spark-2.2.1-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() c:\python36\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o176.count. : java.util.NoSuchElementException: next on empty iterator at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) at scala.collection.IterableLike$class.head(IterableLike.scala:107) at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186) at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2435) at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2434) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841) at org.apache.spark.sql.Dataset.count(Dataset.scala:2434) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Unknown Source){code} Shouldn't this be consistent? Thank you very much. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23949) makes "&&" supports the function of predicate operator "and"
hantiantian created SPARK-23949: --- Summary: makes "&&" supports the function of predicate operator "and" Key: SPARK-23949 URL: https://issues.apache.org/jira/browse/SPARK-23949 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: hantiantian In mysql , symbol && supports the function of predicate operator "and", maybe we can add support for the function in Spark SQL. For example, select * from tbl where id==1 && age=10 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar
[ https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431795#comment-16431795 ] Kazuaki Ishizaki commented on SPARK-23916: -- Sorry for my mistake regarding a PR with wrong JIRA number. > High-order function: array_join(x, delimiter, null_replacement) → varchar > - > > Key: SPARK-23916 > URL: https://issues.apache.org/jira/browse/SPARK-23916 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Concatenates the elements of the given array using the delimiter and an > optional string to replace nulls. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org