[jira] [Created] (SPARK-28167) Show global temporary view in database tool
Yuming Wang created SPARK-28167: --- Summary: Show global temporary view in database tool Key: SPARK-28167 URL: https://issues.apache.org/jira/browse/SPARK-28167 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22340: Assignee: (was: Apache Spark) > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Priority: Major > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22340: Assignee: Apache Spark > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Assignee: Apache Spark >Priority: Major > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-22340: -- > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Priority: Major > Labels: bulk-closed > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22340: - Labels: (was: bulk-closed) > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Priority: Major > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872873#comment-16872873 ] Hanna Kan commented on SPARK-28164: --- but if you add some options such as "{{sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER}}}" it will not work properly. Because at the very beginning $1 is not master > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27656) Safely register class for GraphX
[ https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-27656. -- Resolution: Not A Problem > Safely register class for GraphX > > > Key: SPARK-27656 > URL: https://issues.apache.org/jira/browse/SPARK-27656 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 2.4.3 >Reporter: zhengruifeng >Priority: Major > > GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo > by default. > Users can register those classes via > {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems > that none graphx-lib impls call it, and users tend to ignore this > registration. > So I prefer to safely register them in \{{KryoSerializer.scala}}, like what > SQL and ML do. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
[ https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-28159: - Description: It is a long time since ML was released. However, there are still many TODOs (like in [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5] {// TODO: Make the transformer natively in ml framework to avoid extra conversion.}) on making transform natively in ml framework. I try to make ml algs no longer need to convert ml-vector to mllib-vector in transforms. Including: LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. was: It is a long time since ML was released. However, there are still many TODOs on making transform natively in ml framework. > Make the transform natively in ml framework to avoid extra conversion > - > > Key: SPARK-28159 > URL: https://issues.apache.org/jira/browse/SPARK-28159 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > It is a long time since ML was released. > However, there are still many TODOs (like in > [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5] > {// TODO: Make the transformer natively in ml framework to avoid extra > conversion.}) on making transform natively in ml framework. > > I try to make ml algs no longer need to convert ml-vector to mllib-vector in > transforms. > Including: > LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads
[ https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872852#comment-16872852 ] Liang-Chi Hsieh commented on SPARK-22340: - [~hyukjin.kwon] Should we reopen this as you are open a PR for it now? > pyspark setJobGroup doesn't match java threads > -- > > Key: SPARK-22340 > URL: https://issues.apache.org/jira/browse/SPARK-22340 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2 >Reporter: Leif Walsh >Priority: Major > Labels: bulk-closed > > With pyspark, {{sc.setJobGroup}}'s documentation says > {quote} > Assigns a group ID to all the jobs started by this thread until the group ID > is set to a different value or cleared. > {quote} > However, this doesn't appear to be associated with Python threads, only with > Java threads. As such, a Python thread which calls this and then submits > multiple jobs doesn't necessarily get its jobs associated with any particular > spark job group. For example: > {code} > def run_jobs(): > sc.setJobGroup('hello', 'hello jobs') > x = sc.range(100).sum() > y = sc.range(1000).sum() > return x, y > import concurrent.futures > with concurrent.futures.ThreadPoolExecutor() as executor: > future = executor.submit(run_jobs) > sc.cancelJobGroup('hello') > future.result() > {code} > In this example, depending how the action calls on the Python side are > allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be > assigned the job group {{hello}}. > First, we should clarify the docs if this truly is the case. > Second, it would be really helpful if we could make the job group assignment > reliable for a Python thread, though I’m not sure the best way to do this. > As it stands, job groups are pretty useless from the pyspark side, if we > can't rely on this fact. > My only idea so far is to mimic the TLS behavior on the Python side and then > patch every point where job submission may take place to pass that in, but > this feels pretty brittle. In my experience with py4j, controlling threading > there is a challenge. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing
[ https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27676. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24668 [https://github.com/apache/spark/pull/24668] > InMemoryFileIndex should hard-fail on missing files instead of logging and > continuing > - > > Key: SPARK-27676 > URL: https://issues.apache.org/jira/browse/SPARK-27676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.0.0 > > > Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} > exceptions are caught and logged as warnings (during [directory > listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274] > and [block location > lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]). > I think that this is a dangerous default behavior and would prefer that > Spark hard-fails by default (with the ignore-and-continue behavior guarded by > a SQL session configuration). > In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. > Quoting from the PR for SPARK-17599: > {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. > If a folder is deleted at any time between the paths were resolved and the > file catalog can check for the folder, the Spark job fails. This may abruptly > stop long running StructuredStreaming jobs for example. > Folders may be deleted by users or automatically by retention policies. These > cases should not prevent jobs from successfully completing. > {quote} > Let's say that I'm *not* expecting to ever delete input files for my job. In > that case, this behavior can mask bugs. > One straightforward masked bug class is accidental file deletion: if I'm > never expecting to delete files then I'd prefer to fail my job if Spark sees > deleted files. > A more subtle bug can occur when using a S3 filesystem. Say I'm running a > Spark job against a partitioned Parquet dataset which is laid out like this: > {code:java} > data/ > date=1/ > region=west/ >0.parquet >1.parquet > region=east/ >0.parquet >1.parquet{code} > If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform > multiple rounds of file listing, first listing {{/data/date=1}} to discover > the partitions for that date, then listing within each partition to discover > the leaf files. Due to the eventual consistency of S3 ListObjects, it's > possible that the first listing will show the {{region=west}} and > {{region=east}} partitions existing and then the next-level listing fails to > return any for some of the directories (e.g. {{/data/date=1/}} returns files > but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A > due to ListObjects inconsistency). > If Spark propagated the {{FileNotFoundException}} and hard-failed in this > case then I'd be able to fail the job in this case where we _definitely_ know > that the S3 listing is inconsistent (failing here doesn't guard against _all_ > potential S3 list inconsistency issues (e.g. back-to-back listings which both > return a subset of the true set of objects), but I think it's still an > improvement to fail for the subset of cases that we _can_ detect even if > that's not a surefire failsafe against the more general problem). > Finally, I'm unsure if the original patch will have the desired effect: if a > file is deleted once a Spark job expects to read it then that can cause > problems at multiple layers, both in the driver (multiple rounds of file > listing) and in executors (if the deletion occurs after the construction of > the catalog but before the scheduling of the read tasks); I think the > original patch only resolved the problem for the driver (unless I'm missing > similar executor-side code specific to the original streaming use-case). > Given all of these reasons, I think that the "ignore potentially deleted > files during file index listing" behavior should be guarded behind a feature > flag which defaults to {{false}}, consistent with the existing > {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} > flags (which both default to false). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-27802) SparkUI throws NoSuchElementException when inconsistency appears between `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s
[ https://issues.apache.org/jira/browse/SPARK-27802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872740#comment-16872740 ] shahid commented on SPARK-27802: Could you please provide a reproducible step for the issue? > SparkUI throws NoSuchElementException when inconsistency appears between > `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s > - > > Key: SPARK-27802 > URL: https://issues.apache.org/jira/browse/SPARK-27802 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.2 >Reporter: liupengcheng >Priority: Major > > Recently, we hit this issue when testing spark2.3. It report the following > error messages when clicking on the stage UI link. > We add more logs to print the executorId(here is 10) to debug, and finally > find out that it's caused by the inconsistency between the list of > `ExecutorStageSummaryWrapper` and the `ExecutorSummaryWrapper` in the > KVStore. The number of deadExecutors may exceeded threshold and being removed > from list of `ExecutorSummaryWrapper`, however, it may still be kept in the > list of `ExecutorStageSummaryWrapper` in the store. > {code:java} > HTTP ERROR 500 > Problem accessing /stages/stage/. Reason: > Server Error > Caused by: > java.util.NoSuchElementException: 10 > at > org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:83) > at > org.apache.spark.status.ElementTrackingStore.read(ElementTrackingStore.scala:95) > at > org.apache.spark.status.AppStatusStore.executorSummary(AppStatusStore.scala:70) > at > org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:99) > at > org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:92) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ui.jobs.ExecutorTable.createExecutorTable(ExecutorTable.scala:92) > at > org.apache.spark.ui.jobs.ExecutorTable.toNodeSeq(ExecutorTable.scala:75) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:478) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) > at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) > at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:539) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at >
[jira] [Commented] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872739#comment-16872739 ] Shiv Prashant Sood commented on SPARK-28152: Resolved as part of https://issues.apache.org/jira/browse/SPARK-28151 > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector > --- > > Key: SPARK-28152 > URL: https://issues.apache.org/jira/browse/SPARK-28152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector. This results in tables and spark data frame being > created with unintended types. > Some example issue > * Write from df with column type results in a SQL table of with column type > as INTEGER as opposed to SMALLINT. Thus a larger table that expected. > * read results in a dataframe with type INTEGER as opposed to ShortType > FloatTypes have a issue with read path. In the write path Spark data type > 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in > the read path when JDBC data types need to be converted to Catalyst data > types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' > rather than 'FloatType'. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28151) ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables
[ https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiv Prashant Sood updated SPARK-28151: --- Description: ##ByteType issue Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT {color:#cc7832}case {color}ByteType => Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, {color}java.sql.Types.{color:#9876aa}TINYINT{color})) In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE. 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed. Note : These issues were found when reading/writing with SQLServer. Will be submitting a PR soon to fix these mappings in MSSQLServerDialect. Error seen when writing table (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: *Cannot find data type BYTE*.) com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE. com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) .. ##ShortType and FloatType issue ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. Some example issue Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected. read results in a dataframe with type INTEGER as opposed to ShortType FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. was: Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. Append and Read of tables also fail. The problem is due 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT {color:#cc7832}case {color}ByteType => Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, {color}java.sql.Types.{color:#9876aa}TINYINT{color})) In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE. 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed. Note : These issues were found when reading/writing with SQLServer. Will be submitting a PR soon to fix these mappings in MSSQLServerDialect. Error seen when writing table (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: *Cannot find data type BYTE*.) com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #2: Cannot find data type BYTE. com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) .. > ByteType, ShortType and FloatTypes are not correctly mapped for read/write of > SQLServer tables > -- > > Key: SPARK-28151 > URL: https://issues.apache.org/jira/browse/SPARK-28151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >
[jira] [Updated] (SPARK-28151) ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables
[ https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiv Prashant Sood updated SPARK-28151: --- Summary: ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables (was: ByteType is not correctly mapped for read/write of SQLServer tables) > ByteType, ShortType and FloatTypes are not correctly mapped for read/write of > SQLServer tables > -- > > Key: SPARK-28151 > URL: https://issues.apache.org/jira/browse/SPARK-28151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > Writing dataframe with column type BYTETYPE fails when using JDBC connector > for SQL Server. Append and Read of tables also fail. The problem is due > 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in > jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped > to TINYINT > {color:#cc7832}case {color}ByteType => > Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, > {color}java.sql.Types.{color:#9876aa}TINYINT{color})) > In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to > INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from > the point of view of upcasting, but will lead to 4 byte allocation rather > than 1 byte for BYTETYPE. > 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: > Metadata). The function sets the value in RDD row. The value is set per the > data type. Here there is no mapping for BYTETYPE and thus results will result > in an error when getCatalystType() is fixed. > Note : These issues were found when reading/writing with SQLServer. Will be > submitting a PR soon to fix these mappings in MSSQLServerDialect. > Error seen when writing table > (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, > parameter, or variable #2: *Cannot find data type BYTE*.) > com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or > variable #2: Cannot find data type BYTE. > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) > com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) > .. > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shiv Prashant Sood updated SPARK-28152: --- Description: ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. Some example issue * Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected. * read results in a dataframe with type INTEGER as opposed to ShortType FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. was: ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables or spark data frame being created with unintended types. Some example issue * Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT. Thus a larger table that expected. * read results in a dataframe with type INTEGER as opposed to ShortType FloatTypes have a issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'. > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector > --- > > Key: SPARK-28152 > URL: https://issues.apache.org/jira/browse/SPARK-28152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > ShortType and FloatTypes are not correctly mapped to right JDBC types when > using JDBC connector. This results in tables and spark data frame being > created with unintended types. > Some example issue > * Write from df with column type results in a SQL table of with column type > as INTEGER as opposed to SMALLINT. Thus a larger table that expected. > * read results in a dataframe with type INTEGER as opposed to ShortType > FloatTypes have a issue with read path. In the write path Spark data type > 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in > the read path when JDBC data types need to be converted to Catalyst data > types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' > rather than 'FloatType'. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28166) Query optimization for symmetric difference / disjunctive union of Datasets
[ https://issues.apache.org/jira/browse/SPARK-28166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28166: --- Description: The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see [https://en.wikipedia.org/wiki/Symmetric_difference]). With the Datasets API, we can express this as either {code:java} a.union(b).except(a.intersect(b)){code} or {code:java} a.except(b).union(b.except(a)){code} Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite. This is a very low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA). was: The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see [https://en.wikipedia.org/wiki/Symmetric_difference]). With the Datasets API, we can express this as either {code:java} a.union(b).except(a.intersect(b)){code} or {code:java} a.except(b).union(b.except(a)){code} Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite. This is a pretty low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA). > Query optimization for symmetric difference / disjunctive union of Datasets > --- > > Key: SPARK-28166 > URL: https://issues.apache.org/jira/browse/SPARK-28166 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Minor > > The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their > set union minus their set intersection: it returns tuples which are in only > one of the sets and omits tuples which are present in both sets (see > [https://en.wikipedia.org/wiki/Symmetric_difference]). > With the Datasets API, we can express this as either > {code:java} > a.union(b).except(a.intersect(b)){code} > or > {code:java} > a.except(b).union(b.except(a)){code} > Spark currently plan this query with two joins. However, it may be more > efficient to represent this as a full outer join followed by a filter and a > distinct (and, depending on the number of duplicates, we might want to push > additional distinct clauses beneath the join, but I think that's a separate > optimization). It would cool if the optimizer could automatically perform > this rewrite. > This is a very low priority: I'm filing this ticket mostly for tracking / > reference purposes (so searches for 'symmetric difference' turn up something > useful in Spark's JIRA). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28166) Query optimization for symmetric difference / disjunctive union of Datasets
Josh Rosen created SPARK-28166: -- Summary: Query optimization for symmetric difference / disjunctive union of Datasets Key: SPARK-28166 URL: https://issues.apache.org/jira/browse/SPARK-28166 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their set union minus their set intersection: it returns tuples which are in only one of the sets and omits tuples which are present in both sets (see [https://en.wikipedia.org/wiki/Symmetric_difference]). With the Datasets API, we can express this as either {code:java} a.union(b).except(a.intersect(b)){code} or {code:java} a.except(b).union(b.except(a)){code} Spark currently plan this query with two joins. However, it may be more efficient to represent this as a full outer join followed by a filter and a distinct (and, depending on the number of duplicates, we might want to push additional distinct clauses beneath the join, but I think that's a separate optimization). It would cool if the optimizer could automatically perform this rewrite. This is a pretty low priority: I'm filing this ticket mostly for tracking / reference purposes (so searches for 'symmetric difference' turn up something useful in Spark's JIRA). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872684#comment-16872684 ] Lars Francke commented on SPARK-25390: -- Is there any kind of end-user documentation for this on how to use these APIs to develop custom sources? When looking on the Spark homepage one only finds this documentation [https://spark.apache.org/docs/2.2.0/streaming-custom-receivers.html] it'd be useful to have a version of this for the new APIs > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values
[ https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872657#comment-16872657 ] Tony Zhang commented on SPARK-28135: As commented in the PR, to fix this overflow, we should not change Ceil return type to double, and thus will have to add support to 128-bit int type in the code base. It will be a different story. > ceil/ceiling/floor/power returns incorrect values > - > > Key: SPARK-28135 > URL: https://issues.apache.org/jira/browse/SPARK-28135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {noformat} > spark-sql> select ceil(double(1.2345678901234e+200)), > ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), > power('1', 'NaN'); > 9223372036854775807 9223372036854775807 9223372036854775807 NaN > {noformat} > {noformat} > postgres=# select ceil(1.2345678901234e+200::float8), > ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), > power('1', 'NaN'); > ceil | ceiling|floor | power > --+--+--+--- > 1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1 > (1 row) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative
[ https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-27630. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24497 [https://github.com/apache/spark/pull/24497] > Stage retry causes totalRunningTasks calculation to be negative > --- > > Key: SPARK-27630 > URL: https://issues.apache.org/jira/browse/SPARK-27630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > Fix For: 3.0.0 > > > Track tasks separately for each stage attempt (instead of tracking by stage), > and do NOT reset the numRunningTasks to 0 on StageCompleted. > In the case of stage retry, the {{taskEnd}} event from the zombie stage > sometimes makes the number of {{totalRunningTasks}} negative, which will > causes the job to get stuck. > Similar problem also exists with {{stageIdToTaskIndices}} & > {{stageIdToSpeculativeTaskIndices}}. > If it is a failed {{taskEnd}} event of the zombie stage, this will cause > {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the > task index of the active stage, and the number of {{totalPendingTasks}} will > increase unexpectedly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative
[ https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-27630: Assignee: dzcxzl > Stage retry causes totalRunningTasks calculation to be negative > --- > > Key: SPARK-27630 > URL: https://issues.apache.org/jira/browse/SPARK-27630 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Minor > > Track tasks separately for each stage attempt (instead of tracking by stage), > and do NOT reset the numRunningTasks to 0 on StageCompleted. > In the case of stage retry, the {{taskEnd}} event from the zombie stage > sometimes makes the number of {{totalRunningTasks}} negative, which will > causes the job to get stuck. > Similar problem also exists with {{stageIdToTaskIndices}} & > {{stageIdToSpeculativeTaskIndices}}. > If it is a failed {{taskEnd}} event of the zombie stage, this will cause > {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the > task index of the active stage, and the number of {{totalPendingTasks}} will > increase unexpectedly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Summary: Make SHS clear KVStore LogInfo for the blacklisted entries (was: Make SHS check Spark event log file permission changes) > Make SHS clear KVStore LogInfo for the blacklisted entries > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables
[ https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872577#comment-16872577 ] Shiv Prashant Sood commented on SPARK-28151: Fixed by [https://github.com/apache/spark/pull/24969] > ByteType is not correctly mapped for read/write of SQLServer tables > --- > > Key: SPARK-28151 > URL: https://issues.apache.org/jira/browse/SPARK-28151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > Writing dataframe with column type BYTETYPE fails when using JDBC connector > for SQL Server. Append and Read of tables also fail. The problem is due > 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in > jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped > to TINYINT > {color:#cc7832}case {color}ByteType => > Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, > {color}java.sql.Types.{color:#9876aa}TINYINT{color})) > In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to > INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from > the point of view of upcasting, but will lead to 4 byte allocation rather > than 1 byte for BYTETYPE. > 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: > Metadata). The function sets the value in RDD row. The value is set per the > data type. Here there is no mapping for BYTETYPE and thus results will result > in an error when getCatalystType() is fixed. > Note : These issues were found when reading/writing with SQLServer. Will be > submitting a PR soon to fix these mappings in MSSQLServerDialect. > Error seen when writing table > (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, > parameter, or variable #2: *Cannot find data type BYTE*.) > com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or > variable #2: Cannot find data type BYTE. > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) > com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) > .. > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables
[ https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28151: Assignee: Apache Spark > ByteType is not correctly mapped for read/write of SQLServer tables > --- > > Key: SPARK-28151 > URL: https://issues.apache.org/jira/browse/SPARK-28151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Assignee: Apache Spark >Priority: Minor > > Writing dataframe with column type BYTETYPE fails when using JDBC connector > for SQL Server. Append and Read of tables also fail. The problem is due > 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in > jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped > to TINYINT > {color:#cc7832}case {color}ByteType => > Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, > {color}java.sql.Types.{color:#9876aa}TINYINT{color})) > In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to > INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from > the point of view of upcasting, but will lead to 4 byte allocation rather > than 1 byte for BYTETYPE. > 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: > Metadata). The function sets the value in RDD row. The value is set per the > data type. Here there is no mapping for BYTETYPE and thus results will result > in an error when getCatalystType() is fixed. > Note : These issues were found when reading/writing with SQLServer. Will be > submitting a PR soon to fix these mappings in MSSQLServerDialect. > Error seen when writing table > (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, > parameter, or variable #2: *Cannot find data type BYTE*.) > com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or > variable #2: Cannot find data type BYTE. > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) > com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) > .. > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables
[ https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28151: Assignee: (was: Apache Spark) > ByteType is not correctly mapped for read/write of SQLServer tables > --- > > Key: SPARK-28151 > URL: https://issues.apache.org/jira/browse/SPARK-28151 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Shiv Prashant Sood >Priority: Minor > > Writing dataframe with column type BYTETYPE fails when using JDBC connector > for SQL Server. Append and Read of tables also fail. The problem is due > 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in > jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped > to TINYINT > {color:#cc7832}case {color}ByteType => > Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, > {color}java.sql.Types.{color:#9876aa}TINYINT{color})) > In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to > INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from > the point of view of upcasting, but will lead to 4 byte allocation rather > than 1 byte for BYTETYPE. > 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: > Metadata). The function sets the value in RDD row. The value is set per the > data type. Here there is no mapping for BYTETYPE and thus results will result > in an error when getCatalystType() is fixed. > Note : These issues were found when reading/writing with SQLServer. Will be > submitting a PR soon to fix these mappings in MSSQLServerDialect. > Error seen when writing table > (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, > parameter, or variable #2: *Cannot find data type BYTE*.) > com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or > variable #2: Cannot find data type BYTE. > com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254) > com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608) > com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859) > .. > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts
[ https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872556#comment-16872556 ] Shivu Sondur commented on SPARK-28164: -- [~hannankan] ./sbin/start-slave.sh it will start properly. Tested in master > usage description does not match with shell scripts > --- > > Key: SPARK-28164 > URL: https://issues.apache.org/jira/browse/SPARK-28164 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.3 >Reporter: Hanna Kan >Priority: Major > > I found that "spark/sbin/start-slave.sh" may have some error. > line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " > but later this script, I found line 59 MASTER=$1 > Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28165) SHS does not delete old inprogress files until cleaner.maxAge after SHS start time
Imran Rashid created SPARK-28165: Summary: SHS does not delete old inprogress files until cleaner.maxAge after SHS start time Key: SPARK-28165 URL: https://issues.apache.org/jira/browse/SPARK-28165 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3, 2.3.3 Reporter: Imran Rashid The SHS will not delete inprogress files until {{spark.history.fs.cleaner.maxAge}} time after it has started (7 days by default), regardless of when the last modification to the file was. This is particularly problematic if the SHS gets restarted regularly, as then you'll end up never deleting old files. There might not be much we can do about this -- we can't really trust the modification time of the file, as that isn't always updated reliably. We could take the last time of any event from the file, but then we'd have to turn off the optimization of SPARK-6951, to avoid reading the entire file just for the listing. *WORKAROUND*: have the SHS save state across restarts to local disk by specifying a path in {{spark.history.store.path}}. It'll still take 7 days from when you add that config for the cleaning to happen, but then going for the cleaning should happen reliably. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile
[ https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872530#comment-16872530 ] shane knapp commented on SPARK-28114: - [~dongjoon] the `--force` came back because someone manually edited the build configs via the jenkins gui, and when i auto-generated & deployed them via the jenkins job builder configs, those changes got clobbered. > Add Jenkins job for `Hadoop-3.2` profile > > > Key: SPARK-28114 > URL: https://issues.apache.org/jira/browse/SPARK-28114 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: shane knapp >Priority: Major > > Spark 3.0 is a major version change. We want to have the following new Jobs. > 1. SBT with hadoop-3.2 > 2. Maven with hadoop-3.2 (on JDK8 and JDK11) > Also, shall we have a limit for the concurrent run for the following existing > job? Currently, it invokes multiple jobs concurrently. We can save the > resource by limiting to 1 like the other jobs. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing > We will drop four `branch-2.3` jobs at the end of August, 2019. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27622. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24554 [https://github.com/apache/spark/pull/24554] > Avoid the network when block manager fetches disk persisted RDD blocks from > the same host > - > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.0.0 > > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host
[ https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-27622: -- Assignee: Attila Zsolt Piros > Avoid the network when block manager fetches disk persisted RDD blocks from > the same host > - > > Key: SPARK-27622 > URL: https://issues.apache.org/jira/browse/SPARK-27622 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > > Currently fetching blocks always uses the network even when the two block > managers are running on the same host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28145) Executor pods polling source can fail to replace dead executors
[ https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28145: -- Priority: Minor (was: Major) Issue Type: Improvement (was: New Feature) > Executor pods polling source can fail to replace dead executors > --- > > Key: SPARK-28145 > URL: https://issues.apache.org/jira/browse/SPARK-28145 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Onur Satici >Priority: Minor > > Scheduled task responsible for reporting executor snapshots to the executor > allocator in kubernetes will die on any error, killing subsequent runs of the > same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian
[ https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26985. --- Resolution: Fixed Assignee: ketan kunde Fix Version/s: 3.0.0 Resolved by https://github.com/apache/spark/pull/24861 > Test "access only some column of the all of columns " fails on big endian > - > > Key: SPARK-26985 > URL: https://issues.apache.org/jira/browse/SPARK-26985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: Linux Ubuntu 16.04 > openjdk version "1.8.0_202" > OpenJDK Runtime Environment (build 1.8.0_202-b08) > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed > References 20190205_218 (JIT enabled, AOT enabled) > OpenJ9 - 90dd8cb40 > OMR - d2f4534b > JCL - d002501a90 based on jdk8u202-b08) > >Reporter: Anuja Jakhade >Assignee: ketan kunde >Priority: Major > Labels: BigEndian > Fix For: 3.0.0 > > Attachments: DataFrameTungstenSuite.txt, > InMemoryColumnarQuerySuite.txt, access only some column of the all of > columns.txt > > > While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am > observing test failures for 2 Suites of Project SQL. > 1. InMemoryColumnarQuerySuite > 2. DataFrameTungstenSuite > In both the cases test "access only some column of the all of columns" fails > due to mismatch in the final assert. > Observed that the data obtained after df.cache() is causing the error. Please > find attached the log with the details. > cache() works perfectly fine if double and float values are not in picture. > Inside test !!- access only some column of the all of columns *** FAILED > *** -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28154) GMM fix double caching
[ https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28154. --- Resolution: Fixed Assignee: zhengruifeng Fix Version/s: 3.0.0 2.4.4 Resolved by https://github.com/apache/spark/pull/24919 > GMM fix double caching > -- > > Key: SPARK-28154 > URL: https://issues.apache.org/jira/browse/SPARK-28154 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0, 2.4.0, 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > The intermediate rdd is always cached. We should only cache it if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary
[ https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28117: - Assignee: zhengruifeng > LDA and BisectingKMeans cache the input dataset if necessary > > > Key: SPARK-28117 > URL: https://issues.apache.org/jira/browse/SPARK-28117 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > In MLLIB-LDA, if the EM solver caches the dataset internally, while the > Online do not. > So in the ML-LDA, we need to cache the internmediate dataset if necessary. > > BisectingKMeans also needs too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary
[ https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28117. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24920 [https://github.com/apache/spark/pull/24920] > LDA and BisectingKMeans cache the input dataset if necessary > > > Key: SPARK-28117 > URL: https://issues.apache.org/jira/browse/SPARK-28117 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > In MLLIB-LDA, if the EM solver caches the dataset internally, while the > Online do not. > So in the ML-LDA, we need to cache the internmediate dataset if necessary. > > BisectingKMeans also needs too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary
[ https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28117: -- Priority: Minor (was: Major) > LDA and BisectingKMeans cache the input dataset if necessary > > > Key: SPARK-28117 > URL: https://issues.apache.org/jira/browse/SPARK-28117 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > > In MLLIB-LDA, if the EM solver caches the dataset internally, while the > Online do not. > So in the ML-LDA, we need to cache the internmediate dataset if necessary. > > BisectingKMeans also needs too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28045) add missing RankingEvaluator
[ https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-28045: -- Priority: Minor (was: Major) > add missing RankingEvaluator > > > Key: SPARK-28045 > URL: https://issues.apache.org/jira/browse/SPARK-28045 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator
[ https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28045: - Assignee: zhengruifeng > add missing RankingEvaluator > > > Key: SPARK-28045 > URL: https://issues.apache.org/jira/browse/SPARK-28045 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28045) add missing RankingEvaluator
[ https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28045. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24869 [https://github.com/apache/spark/pull/24869] > add missing RankingEvaluator > > > Key: SPARK-28045 > URL: https://issues.apache.org/jira/browse/SPARK-28045 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > expose RankingEvaluator -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28164) usage description does not match with shell scripts
Hanna Kan created SPARK-28164: - Summary: usage description does not match with shell scripts Key: SPARK-28164 URL: https://issues.apache.org/jira/browse/SPARK-28164 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 2.4.3 Reporter: Hanna Kan I found that "spark/sbin/start-slave.sh" may have some error. line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] " but later this script, I found line 59 MASTER=$1 Is this a conflict? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS
[ https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-28163: -- Description: There are "unsafe" conversions in the Kafka connector. CaseInsensitiveStringMap comes in which is then converted the following way: {code:java} ... options.asScala.toMap ... {code} The main problem with this is that such case it looses its case insensitive nature (case insensitive map is converting the key to lower case when get/contains called). was: There are "unsafe" conversions in the Kafka connector. CaseInsensitiveStringMap comes in which is then converted the following way: {code:java} ... options.asScala.toMap ... {code} The main problem with this that such case it looses its case insensitive nature. Case insensitive map is converting the key to lower case when get/contains called. > Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and > FETCH_OFFSET_RETRY_INTERVAL_MS > - > > Key: SPARK-28163 > URL: https://issues.apache.org/jira/browse/SPARK-28163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There are "unsafe" conversions in the Kafka connector. > CaseInsensitiveStringMap comes in which is then converted the following way: > {code:java} > ... > options.asScala.toMap > ... > {code} > The main problem with this is that such case it looses its case insensitive > nature > (case insensitive map is converting the key to lower case when get/contains > called). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS
[ https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28163: Assignee: Apache Spark > Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and > FETCH_OFFSET_RETRY_INTERVAL_MS > - > > Key: SPARK-28163 > URL: https://issues.apache.org/jira/browse/SPARK-28163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > There are "unsafe" conversions in the Kafka connector. > CaseInsensitiveStringMap comes in which is then converted the following way: > {code:java} > ... > options.asScala.toMap > ... > {code} > The main problem with this that such case it looses its case insensitive > nature. > Case insensitive map is converting the key to lower case when get/contains > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS
[ https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28163: Assignee: (was: Apache Spark) > Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and > FETCH_OFFSET_RETRY_INTERVAL_MS > - > > Key: SPARK-28163 > URL: https://issues.apache.org/jira/browse/SPARK-28163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There are "unsafe" conversions in the Kafka connector. > CaseInsensitiveStringMap comes in which is then converted the following way: > {code:java} > ... > options.asScala.toMap > ... > {code} > The main problem with this that such case it looses its case insensitive > nature. > Case insensitive map is converting the key to lower case when get/contains > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Description: At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to the file system, and maintains a blacklist for all event log files failed once at reading. The blacklisted log files are released back after CLEAN_INTERVAL_S . However, the files whose size don't changes are ignored forever because shouldReloadLog return false always when the size is the same with the value in KVStore. This is recovered only via SHS restart. was: At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to the file system, and maintains a permanent blacklist for all event log files failed once at reading. Although this reduces a lot of invalid accesses, there is no way to see this log files back after the permissions are recovered correctly. The only way has been restarting SHS. Apache Spark is unable to know the permission recovery. However, we had better give a second chances for those blacklisted files in a regular manner. > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a blacklist for all event log files failed > once at reading. The blacklisted log files are released back after > CLEAN_INTERVAL_S . > However, the files whose size don't changes are ignored forever because > shouldReloadLog return false always when the size is the same with the value > in KVStore. This is recovered only via SHS restart. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS
Gabor Somogyi created SPARK-28163: - Summary: Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS Key: SPARK-28163 URL: https://issues.apache.org/jira/browse/SPARK-28163 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi There are "unsafe" conversions in the Kafka connector. CaseInsensitiveStringMap comes in which is then converted the following way: {code:java} ... options.asScala.toMap ... {code} The main problem with this that such case it looses its case insensitive nature. Case insensitive map is converting the key to lower case when get/contains called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28157: Assignee: Apache Spark > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28157: Assignee: (was: Apache Spark) > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28162) approximSimilarityJoin creating a bottleneck
Simone Iovane created SPARK-28162: - Summary: approximSimilarityJoin creating a bottleneck Key: SPARK-28162 URL: https://issues.apache.org/jira/browse/SPARK-28162 Project: Spark Issue Type: Bug Components: ML, MLlib, Scheduler, Spark Core Affects Versions: 2.4.3 Reporter: Simone Iovane Hi I am using spark Mllib and doing approxSimilarityJoin between a 1M dataset and a 1k dataset. When i do it I bradcast the 1k one. What I see is that thew job stops going forward at the second-last task. All the executors are dead but one which keeps running for very long time until it reaches Out of memory. I checked ganglia and it shows memory keeping rising until it reaches the limit[!https://i.stack.imgur.com/gfhGg.png!|https://i.stack.imgur.com/gfhGg.png] and the disk space keeps going down until it finishes: [!https://i.stack.imgur.com/vbEmG.png!|https://i.stack.imgur.com/vbEmG.png] The action I called is a write, but it does the same with count. Now I wonder: is it possible that all the partitions in the cluster converge to only one node and creating this bottleneck? Is it a function bug? Here is my code snippet: {code:java} var dfW = cookesWb.withColumn("n", monotonically_increasing_id()) var bunchDf = dfW.filter(col("n").geq(0) && col("n").lt(100) ) bunchDf.repartition(3000) model. approxSimilarityJoin(bunchDf,broadcast(cookesNextLimited),80,"EuclideanDistance"). withColumn("min_distance", min(col("EuclideanDistance")).over(Window.partitionBy(col("datasetA.uid"))) ). filter(col("EuclideanDistance") === col("min_distance")). select(col("datasetA.uid").alias("weboId"), col("datasetB.nextploraId").alias("nextId"), col("EuclideanDistance")).write.format("parquet").mode("overwrite").save("approxJoin.parquet") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-28157: --- > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28157: -- Comment: was deleted (was: My bad. This issue is invalid.) > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27463: Assignee: Apache Spark > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Assignee: Apache Spark >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27463: Assignee: (was: Apache Spark) > Support Dataframe Cogroup via Pandas UDFs > -- > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28160) TransportClient.sendRpcSync may hang forever
[ https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28160: Assignee: (was: Apache Spark) > TransportClient.sendRpcSync may hang forever > > > Key: SPARK-28160 > URL: https://issues.apache.org/jira/browse/SPARK-28160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Lantao Jin >Priority: Major > > This is very like > [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665] > `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large > but no enough memory is available. However, when this happens, > TransportClient.sendRpcSync will just hang forever if the timeout set to > unlimited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28160) TransportClient.sendRpcSync may hang forever
[ https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28160: Assignee: Apache Spark > TransportClient.sendRpcSync may hang forever > > > Key: SPARK-28160 > URL: https://issues.apache.org/jira/browse/SPARK-28160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.3, 3.0.0, 2.4.3 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Major > > This is very like > [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665] > `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large > but no enough memory is available. However, when this happens, > TransportClient.sendRpcSync will just hang forever if the timeout set to > unlimited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872216#comment-16872216 ] Steve Loughran commented on SPARK-28091: (FWIW, I really like codahale; lines up very nicely with scala closures for on-demand eval) > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872214#comment-16872214 ] Steve Loughran commented on SPARK-28091: thanks for the links. FWIW, if you call toString() on an S3A input stream you get a dump of the stream-specific stats; call it on an FS instance and you get the full file stats. That's for logging rather than anything else. The impala team would actually like to get at those stream stats, but I've been -1 to date as I don't want them getting at unstable internals (see HADOOP-16379). I wonder if we added an accessor to StorageStatistics then for those S3A versions/input streams which offered it, you'd only need to ask for it (with some reflection work to still be compatible with older versions). We could still get away with only using long values to count (i.e. no slower atomic values), and make the iterator() call create a snapshot of the values to iterate over. Would that be of interest > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in particular for monitoring S3 and for extending HDFS instrumentation with > the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric > plugin example and code used for testing are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Description: Trying to build spark from source based on the Dockerfile attached locally (launched on docker on OSX) fails. Attempts to change/add the following things beyond what's recommended on the build page do not bring improvement: 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help 2. editing the pom.xml to exclude zinc as in one of the answers in [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] --> doesn't help 3. adding options -DrecompileMode=all --> doesn't help I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which is manually put into /usr/java as the Oracle java seems to be recommended. Build fails at project streaming with: {code} [INFO] [INFO] Reactor Summary for Spark Project Parent POM 2.4.3: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s] [INFO] Spark Project Tags . SUCCESS [ 20.386 s] [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s] [INFO] Spark Project Local DB . SUCCESS [ 5.654 s] [INFO] Spark Project Networking ... SUCCESS [ 7.401 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 3.400 s] [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s] [INFO] Spark Project Launcher . SUCCESS [ 17.471 s] [INFO] Spark Project Core . SUCCESS [02:36 min] [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s] [INFO] Spark Project GraphX ... SUCCESS [ 21.097 s] [INFO] Spark Project Streaming SUCCESS [ 52.537 s] [INFO] Spark Project Catalyst . SUCCESS [02:44 min] [INFO] Spark Project SQL .. FAILURE [10:44 min] [INFO] Spark Project ML Library ... SKIPPED [INFO] Spark Project Tools SKIPPED [INFO] Spark Project Hive . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project Assembly . SKIPPED [INFO] Spark Integration for Kafka 0.10 ... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED [INFO] Spark Avro . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 20:15 min [INFO] Finished at: 2019-06-25T09:45:49Z [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-sql_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :spark-sql_2.11 The command '/bin/sh -c ./build/mvn -DskipTests clean package' returned a non-zero code: 1 {code} Any help? I'm stuck with this since 2 days, hence I'm raising this issue. was: Trying to build spark from source based on the Dockerfile attached locally (launched on docker on OSX) fails. Attempts to change/add the following things beyond what's recommended on the build page do not bring improvement: 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help 2. editing the pom.xml to exclude zinc as in one of the answers in [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] --> doesn't help 3. adding options -DrecompileMode=all --> doesn't help I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which is manually put into /usr/java as the Oracle java seems to be recommended. Build fails at project streaming with: {code} [INFO]
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Description: Trying to build spark from source based on the Dockerfile attached locally (launched on docker on OSX) fails. Attempts to change/add the following things beyond what's recommended on the build page do not bring improvement: 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help 2. editing the pom.xml to exclude zinc as in one of the answers in [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] --> doesn't help 3. adding options -DrecompileMode=all --> doesn't help I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which is manually put into /usr/java as the Oracle java seems to be recommended. Build fails at project streaming with: {code} [INFO] [INFO] Reactor Summary for Spark Project Parent POM 2.4.3: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s] [INFO] Spark Project Tags . SUCCESS [ 20.386 s] [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s] [INFO] Spark Project Local DB . SUCCESS [ 5.654 s] [INFO] Spark Project Networking ... SUCCESS [ 7.401 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 3.400 s] [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s] [INFO] Spark Project Launcher . SUCCESS [ 17.471 s] [INFO] Spark Project Core . SUCCESS [02:36 min] [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s] [INFO] Spark Project GraphX ... SUCCESS [ 21.097 s] [INFO] Spark Project Streaming SUCCESS [ 52.537 s] [INFO] Spark Project Catalyst . SUCCESS [02:44 min] [INFO] Spark Project SQL .. FAILURE [10:44 min] [INFO] Spark Project ML Library ... SKIPPED [INFO] Spark Project Tools SKIPPED [INFO] Spark Project Hive . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project Assembly . SKIPPED [INFO] Spark Integration for Kafka 0.10 ... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED [INFO] Spark Avro . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 20:15 min [INFO] Finished at: 2019-06-25T09:45:49Z [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-sql_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :spark-sql_2.11 {code} The command '/bin/sh -c ./build/mvn -DskipTests clean package' returned a non-zero code: 1 Any help? I'm stuck with this since 2 days, hence I'm raising this issue. was: Trying to build spark from source based on the Dockerfile attached locally (launched on docker on OSX) fails. Attempts to change/add the following things beyond what's recommended on the build page do not bring improvement: 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help 2. editing the pom.xml to exclude zinc as in one of the answers in [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] --> doesn't help 3. adding options -DrecompileMode=all --> doesn't help I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which is manually put into /usr/java as the Oracle java seems to be recommended. Build fails at project streaming with: ``` [INFO]
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Environment: {code} # Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package {code} was: {code:docker} # Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package {code} > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: {code} > # Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package > {code} >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > --> doesn't help > 3. adding options -DrecompileMode=all --> doesn't help > I've downloaded the
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Environment: {code:docker} # Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package {code} was: {quote}# Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package{quote} > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: {code:docker} > # Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package > {code} >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > --> doesn't help > 3. adding options -DrecompileMode=all --> doesn't help > I've downloaded
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Environment: {quote}# Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package{quote} was: # Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: {quote}# Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package{quote} >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > --> doesn't help > 3. adding options -DrecompileMode=all --> doesn't help > I've downloaded the java from Oracle directly (
[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
[ https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Nigsch updated SPARK-28161: -- Environment: # Dockerfile # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package was: # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven # Define commonly used JAVA_HOME variable ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH https://github.com/apache/spark $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package #RUN ./dev/make-distribution.sh -e --name mn-spark-2.3.3 --pip --r --tgz -Psparkr -Phadoop-2.7 -Pmesos -Pyarn -Pkubernetes > Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 > -- > > Key: SPARK-28161 > URL: https://issues.apache.org/jira/browse/SPARK-28161 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 > Environment: # Dockerfile > # Pull base image. > FROM ubuntu:16.04 > RUN apt update --fix-missing > RUN apt-get install -y software-properties-common > RUN mkdir /usr/java > ADD jdk-8u212-linux-x64.tar /usr/java > ENV JAVA_HOME=/usr/java/jdk1.8.0_212 > RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java > 2 > RUN update-alternatives --install /usr/bin/javac javac > ${JAVA_HOME%*/}/bin/javac 2 > ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" > ENV MAVEN_VERSION 3.6.1 > RUN apt-get install -y curl wget > RUN curl -fsSL > [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz > | tar xzf - -C /usr/share \ > && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ > && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn > ENV MAVEN_HOME /usr/share/maven > ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" > ENV SPARK_SRC="/usr/src/spark" > ENV BRANCH="v2.4.3" > RUN apt-get update && apt-get install -y --no-install-recommends \ > git python3 python3-setuptools r-base-dev r-cran-evaluate > RUN mkdir -p $SPARK_SRC > RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC > WORKDIR $SPARK_SRC > RUN ./build/mvn -DskipTests clean package >Reporter: Martin Nigsch >Priority: Minor > > Trying to build spark from source based on the Dockerfile attached locally > (launched on docker on OSX) fails. > Attempts to change/add the following things beyond what's recommended on the > build page do not bring improvement: > 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help > 2. editing the pom.xml to exclude zinc as in one of the answers in > [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] > -->
[jira] [Created] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
Martin Nigsch created SPARK-28161: - Summary: Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212 Key: SPARK-28161 URL: https://issues.apache.org/jira/browse/SPARK-28161 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.3 Environment: # Pull base image. FROM ubuntu:16.04 RUN apt update --fix-missing RUN apt-get install -y software-properties-common RUN mkdir /usr/java ADD jdk-8u212-linux-x64.tar /usr/java ENV JAVA_HOME=/usr/java/jdk1.8.0_212 RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 2 RUN update-alternatives --install /usr/bin/javac javac ${JAVA_HOME%*/}/bin/javac 2 ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin" ENV MAVEN_VERSION 3.6.1 RUN apt-get install -y curl wget RUN curl -fsSL http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz | tar xzf - -C /usr/share \ && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \ && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven # Define commonly used JAVA_HOME variable ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" ENV SPARK_SRC="/usr/src/spark" ENV BRANCH="v2.4.3" RUN apt-get update && apt-get install -y --no-install-recommends \ git python3 python3-setuptools r-base-dev r-cran-evaluate RUN mkdir -p $SPARK_SRC RUN git clone --branch $BRANCH https://github.com/apache/spark $SPARK_SRC WORKDIR $SPARK_SRC RUN ./build/mvn -DskipTests clean package #RUN ./dev/make-distribution.sh -e --name mn-spark-2.3.3 --pip --r --tgz -Psparkr -Phadoop-2.7 -Pmesos -Pyarn -Pkubernetes Reporter: Martin Nigsch Trying to build spark from source based on the Dockerfile attached locally (launched on docker on OSX) fails. Attempts to change/add the following things beyond what's recommended on the build page do not bring improvement: 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help 2. editing the pom.xml to exclude zinc as in one of the answers in [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558] --> doesn't help 3. adding options -DrecompileMode=all --> doesn't help I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which is manually put into /usr/java as the Oracle java seems to be recommended. Build fails at project streaming with: ``` [INFO] [INFO] Reactor Summary for Spark Project Parent POM 2.4.3: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s] [INFO] Spark Project Tags . SUCCESS [ 20.386 s] [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s] [INFO] Spark Project Local DB . SUCCESS [ 5.654 s] [INFO] Spark Project Networking ... SUCCESS [ 7.401 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 3.400 s] [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s] [INFO] Spark Project Launcher . SUCCESS [ 17.471 s] [INFO] Spark Project Core . SUCCESS [02:36 min] [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s] [INFO] Spark Project GraphX ... SUCCESS [ 21.097 s] [INFO] Spark Project Streaming SUCCESS [ 52.537 s] [INFO] Spark Project Catalyst . SUCCESS [02:44 min] [INFO] Spark Project SQL .. FAILURE [10:44 min] [INFO] Spark Project ML Library ... SKIPPED [INFO] Spark Project Tools SKIPPED [INFO] Spark Project Hive . SKIPPED [INFO] Spark Project REPL . SKIPPED [INFO] Spark Project Assembly . SKIPPED [INFO] Spark Integration for Kafka 0.10 ... SKIPPED [INFO] Kafka 0.10+ Source for Structured Streaming SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED [INFO] Spark Avro . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 20:15 min [INFO] Finished at: 2019-06-25T09:45:49Z [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-sql_2.11: Execution scala-compile-first of goal
[jira] [Created] (SPARK-28160) TransportClient.sendRpcSync may hang forever
Lantao Jin created SPARK-28160: -- Summary: TransportClient.sendRpcSync may hang forever Key: SPARK-28160 URL: https://issues.apache.org/jira/browse/SPARK-28160 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3, 2.3.3, 3.0.0 Reporter: Lantao Jin This is very like [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665] `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large but no enough memory is available. However, when this happens, TransportClient.sendRpcSync will just hang forever if the timeout set to unlimited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872045#comment-16872045 ] Shivu Sondur edited comment on SPARK-28036 at 6/25/19 8:30 AM: --- [~yumwang] select left('ahoj', 2), right('ahoj', 2); use without '-' sign, it will work fine, i tested in latest spark was (Author: shivuson...@gmail.com): [~yumwang] select left('ahoj', 2), right('ahoj', 2); use with '-' sign, it will work fine, i tested in latest spark > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
[ https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28159: Assignee: (was: Apache Spark) > Make the transform natively in ml framework to avoid extra conversion > - > > Key: SPARK-28159 > URL: https://issues.apache.org/jira/browse/SPARK-28159 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > It is a long time since ML was released. > However, there are still many TODOs on making transform natively in ml > framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
[ https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28159: Assignee: Apache Spark > Make the transform natively in ml framework to avoid extra conversion > - > > Key: SPARK-28159 > URL: https://issues.apache.org/jira/browse/SPARK-28159 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > > It is a long time since ML was released. > However, there are still many TODOs on making transform natively in ml > framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion
zhengruifeng created SPARK-28159: Summary: Make the transform natively in ml framework to avoid extra conversion Key: SPARK-28159 URL: https://issues.apache.org/jira/browse/SPARK-28159 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng It is a long time since ML was released. However, there are still many TODOs on making transform natively in ml framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics
[ https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872075#comment-16872075 ] Luca Canali commented on SPARK-28091: - Thank you [~ste...@apache.org] for your comment and clarifications. Indeed what I am trying to do is to collect executor-level metrics for S3A (and also for other Hadoop compatible filesystems of interest). The goal is to bring the metrics into the Spark metrics system, so that they can be used, for example, in a performance dashboard and displayed together with the rest of the instrumentation metrics. The original work for this started from the need of measuring I/O metrics for a custom HDFS-compatible filesystem that we use (called ROOT:) and more recently also for S3A. The first implementation we did was a simple: a small change in [[ExecutorSource]], which already has code to collect metrics for "hdfs" and "file"/local filesystems at the executor level, obviously that code is very easy to extend, however it feels like a short-term hack going that way. My though with this PR is to provide a flexible method to add instrumentation, profiting from our current use case related to I/O workload monitoring, but also open to several other use cases. I am also quite interested to see developmnets in this area for CPU counters and possibly also GPU-related instrumentation. I think the proposal to use executor plugins for this goes in the original direction outlined by [~irashid] and collaborators with SPARK-24918 I add some links to reference and related material: code of a few test executor metrics plugins that I am developing: [https://github.com/cerndb/SparkExecutorPlugins] The general idea of how to build a dashboard with Spark metrics is described in [https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark] > Extend Spark metrics system with executor plugin metrics > > > Key: SPARK-28091 > URL: https://issues.apache.org/jira/browse/SPARK-28091 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Luca Canali >Priority: Minor > > This proposes to improve Spark instrumentation by adding a hook for Spark > executor plugin metrics to the Spark metrics systems implemented with the > Dropwizard/Codahale library. > Context: The Spark metrics system provides a large variety of metrics, see > also SPARK-26890, useful to monitor and troubleshoot Spark workloads. A > typical workflow is to sink the metrics to a storage system and build > dashboards on top of that. > Improvement: The original goal of this work was to add instrumentation for S3 > filesystem access metrics by Spark job. Currently, [[ExecutorSource]] > instruments HDFS and local filesystem metrics. Rather than extending the code > there, we proposes to add a metrics plugin system which is of more flexible > and general use. > Advantages: > * The metric plugin system makes it easy to implement instrumentation for S3 > access by Spark jobs. > * The metrics plugin system allows for easy extensions of how Spark collects > HDFS-related workload metrics. This is currently done using the Hadoop > Filesystem GetAllStatistics method, which is deprecated in recent versions of > Hadoop. Recent versions of Hadoop Filesystem recommend using method > GetGlobalStorageStatistics, which also provides several additional metrics. > GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been > introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an > easy way to “opt in” using such new API calls for those deploying suitable > Hadoop versions. > * We also have the use case of adding Hadoop filesystem monitoring for a > custom Hadoop compliant filesystem in use in our organization (EOS using the > XRootD protocol). The metrics plugin infrastructure makes this easy to do. > Others may have similar use cases. > * More generally, this method makes it straightforward to plug in Filesystem > and other metrics to the Spark monitoring system. Future work on plugin > implementation can address extending monitoring to measure usage of external > resources (OS, filesystem, network, accelerator cards, etc), that maybe would > not normally be considered general enough for inclusion in Apache Spark code, > but that can be nevertheless useful for specialized use cases, tests or > troubleshooting. > Implementation: > The proposed implementation is currently a WIP open for comments and > improvements. It is based on the work on Executor Plugin of SPARK-24918 and > builds on recent work on extending Spark executor metrics, such as SPARK-25228 > Tests and examples: > This has been so far manually tested running Spark on YARN and K8S clusters, > in
[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28149: Assignee: (was: Apache Spark) > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching
[ https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28149: Assignee: Apache Spark > Disable negeative DNS caching > - > > Key: SPARK-28149 > URL: https://issues.apache.org/jira/browse/SPARK-28149 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.3 >Reporter: Jose Luis Pedrosa >Assignee: Apache Spark >Priority: Minor > > By default JVM caches the failures for the DNS resolutions, by default is > cached by 10 seconds. > Alpine JDK used in the images for kubernetes has a default timout of 5 > seconds. > This means that in clusters with slow init time (network sidecar pods, slow > network start up) executor will never run, because the first attempt to > connect to the driver will fail, and that failure will be cached, causing > the retries to happen in a tight loop without actually trying again. > > The proposed implementation would be to add to the entrypoint.sh (that is > exclusive for k8s) to alter the file with the dns caching, and disable it if > there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28158: Assignee: (was: Apache Spark) > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type
[ https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28158: Assignee: Apache Spark > Hive UDFs supports UDT type > --- > > Key: SPARK-28158 > URL: https://issues.apache.org/jira/browse/SPARK-28158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 2.4.3 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28158) Hive UDFs supports UDT type
Genmao Yu created SPARK-28158: - Summary: Hive UDFs supports UDT type Key: SPARK-28158 URL: https://issues.apache.org/jira/browse/SPARK-28158 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: Genmao Yu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28157) Make SHS check Spark event log file permission changes
[ https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28157. --- Resolution: Invalid My bad. This issue is invalid. > Make SHS check Spark event log file permission changes > -- > > Key: SPARK-28157 > URL: https://issues.apache.org/jira/browse/SPARK-28157 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3 >Reporter: Dongjoon Hyun >Priority: Major > > At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to > the file system, and maintains a permanent blacklist for all event log files > failed once at reading. Although this reduces a lot of invalid accesses, > there is no way to see this log files back after the permissions are > recovered correctly. The only way has been restarting SHS. > Apache Spark is unable to know the permission recovery. However, we had > better give a second chances for those blacklisted files in a regular manner. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872050#comment-16872050 ] Yuming Wang commented on SPARK-28036: - {code:sql} [root@spark-3267648 spark-3.0.0-SNAPSHOT-bin-3.2.0]# bin/spark-shell 19/06/24 23:10:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040 Spark context available as 'sc' (master = local[*], app id = local-1561443018277). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("select left('ahoj', -2), right('ahoj', -2)").show ++-+ |left('ahoj', -2)|right('ahoj', -2)| ++-+ || | ++-+ scala> spark.sql("select left('ahoj', 2), right('ahoj', 2)").show +---++ |left('ahoj', 2)|right('ahoj', 2)| +---++ | ah| oj| +---++ {code} > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872045#comment-16872045 ] Shivu Sondur commented on SPARK-28036: -- [~yumwang] select left('ahoj', 2), right('ahoj', 2); use with '-' sign, it will work fine, i tested in latest spark > Built-in udf left/right has inconsistent behavior > - > > Key: SPARK-28036 > URL: https://issues.apache.org/jira/browse/SPARK-28036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > PostgreSQL: > {code:sql} > postgres=# select left('ahoj', -2), right('ahoj', -2); > left | right > --+--- > ah | oj > (1 row) > {code} > Spark SQL: > {code:sql} > spark-sql> select left('ahoj', -2), right('ahoj', -2); > spark-sql> > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org