[jira] [Created] (SPARK-28167) Show global temporary view in database tool

2019-06-25 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28167:
---

 Summary: Show global temporary view in database tool
 Key: SPARK-28167
 URL: https://issues.apache.org/jira/browse/SPARK-28167
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22340:


Assignee: (was: Apache Spark)

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Priority: Major
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22340:


Assignee: Apache Spark

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Assignee: Apache Spark
>Priority: Major
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-22340:
--

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Priority: Major
>  Labels: bulk-closed
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22340:
-
Labels:   (was: bulk-closed)

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Priority: Major
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts

2019-06-25 Thread Hanna Kan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872873#comment-16872873
 ] 

Hanna Kan commented on SPARK-28164:
---

but if you add some options such as "{{sbin/start-slave.sh -c $CORES_PER_WORKER 
-m 3G ${MASTER}}}"  it will not work properly. Because at the very beginning $1 
is not master

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27656) Safely register class for GraphX

2019-06-25 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-27656.
--
Resolution: Not A Problem

> Safely register class for GraphX
> 
>
> Key: SPARK-27656
> URL: https://issues.apache.org/jira/browse/SPARK-27656
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.4.3
>Reporter: zhengruifeng
>Priority: Major
>
> GraphX common classes (such as: Edge, EdgeTriplet) are not registered in Kryo 
> by default.
> Users can register those classes via 
> {{GraphXUtils.{color:#ffc66d}registerKryoClasses{color}}}, however, it seems 
> that none graphx-lib impls call it, and users tend to ignore this 
> registration.
> So I prefer to safely register them in \{{KryoSerializer.scala}}, like what  
> SQL and ML do.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28159:
-
Description: 
It is a long time since ML was released.

However, there are still many TODOs (like in 
[ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5]
 {// TODO: Make the transformer natively in ml framework to avoid extra 
conversion.}) on making transform natively in ml framework.

 

I try to make ml algs no longer need to convert ml-vector to mllib-vector in 
transforms.

Including: 
LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.

 

  was:
It is a long time since ML was released.

However, there are still many TODOs on making transform natively in ml 
framework.


> Make the transform natively in ml framework to avoid extra conversion
> -
>
> Key: SPARK-28159
> URL: https://issues.apache.org/jira/browse/SPARK-28159
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> It is a long time since ML was released.
> However, there are still many TODOs (like in 
> [ChiSqSelector.scala|https://github.com/apache/spark/pull/24963/files#diff-9b0bc8a01b34c38958ce45c14f9c5da5]
>  {// TODO: Make the transformer natively in ml framework to avoid extra 
> conversion.}) on making transform natively in ml framework.
>  
> I try to make ml algs no longer need to convert ml-vector to mllib-vector in 
> transforms.
> Including: 
> LDA/ChiSqSelector/ElementwiseProduct/HashingTF/IDF/Normalizer/PCA/StandardScaler.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22340) pyspark setJobGroup doesn't match java threads

2019-06-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872852#comment-16872852
 ] 

Liang-Chi Hsieh commented on SPARK-22340:
-

[~hyukjin.kwon] Should we reopen this as you are open a PR for it now?

> pyspark setJobGroup doesn't match java threads
> --
>
> Key: SPARK-22340
> URL: https://issues.apache.org/jira/browse/SPARK-22340
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: Leif Walsh
>Priority: Major
>  Labels: bulk-closed
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
> sc.setJobGroup('hello', 'hello jobs')
> x = sc.range(100).sum()
> y = sc.range(1000).sum()
> return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
> future = executor.submit(run_jobs)
> sc.cancelJobGroup('hello')
> future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-06-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27676.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24668
[https://github.com/apache/spark/pull/24668]

> InMemoryFileIndex should hard-fail on missing files instead of logging and 
> continuing
> -
>
> Key: SPARK-27676
> URL: https://issues.apache.org/jira/browse/SPARK-27676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
> exceptions are caught and logged as warnings (during [directory 
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
>  and [block location 
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
>  I think that this is a dangerous default behavior and would prefer that 
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by 
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. 
> If a folder is deleted at any time between the paths were resolved and the 
> file catalog can check for the folder, the Spark job fails. This may abruptly 
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These 
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In 
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm 
> never expecting to delete files then I'd prefer to fail my job if Spark sees 
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a 
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
>   date=1/
> region=west/
>0.parquet
>1.parquet
> region=east/
>0.parquet
>1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
> multiple rounds of file listing, first listing {{/data/date=1}} to discover 
> the partitions for that date, then listing within each partition to discover 
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's 
> possible that the first listing will show the {{region=west}} and 
> {{region=east}} partitions existing and then the next-level listing fails to 
> return any for some of the directories (e.g. {{/data/date=1/}} returns files 
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A 
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this 
> case then I'd be able to fail the job in this case where we _definitely_ know 
> that the S3 listing is inconsistent (failing here doesn't guard against _all_ 
> potential S3 list inconsistency issues (e.g. back-to-back listings which both 
> return a subset of the true set of objects), but I think it's still an 
> improvement to fail for the subset of cases that we _can_ detect even if 
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a 
> file is deleted once a Spark job expects to read it then that can cause 
> problems at multiple layers, both in the driver (multiple rounds of file 
> listing) and in executors (if the deletion occurs after the construction of 
> the catalog but before the scheduling of the read tasks); I think the 
> original patch only resolved the problem for the driver (unless I'm missing 
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted 
> files during file index listing" behavior should be guarded behind a feature 
> flag which defaults to {{false}}, consistent with the existing 
> {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
> flags (which both default to false).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-27802) SparkUI throws NoSuchElementException when inconsistency appears between `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s

2019-06-25 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872740#comment-16872740
 ] 

shahid commented on SPARK-27802:


Could you please provide a reproducible step for the issue?

> SparkUI throws NoSuchElementException when inconsistency appears between 
> `ExecutorStageSummaryWrapper`s and `ExecutorSummaryWrapper`s
> -
>
> Key: SPARK-27802
> URL: https://issues.apache.org/jira/browse/SPARK-27802
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we hit this issue when testing spark2.3. It report the following 
> error messages when clicking on the stage UI link.
> We add more logs to print the executorId(here is 10) to debug, and finally 
> find out that it's caused by the inconsistency between the list of 
> `ExecutorStageSummaryWrapper` and the `ExecutorSummaryWrapper` in the 
> KVStore. The number of deadExecutors may exceeded threshold and being removed 
> from list of `ExecutorSummaryWrapper`, however, it may still be kept in the 
> list of `ExecutorStageSummaryWrapper` in the store.
> {code:java}
> HTTP ERROR 500
> Problem accessing /stages/stage/. Reason:
> Server Error
> Caused by:
> java.util.NoSuchElementException: 10
>   at 
> org.apache.spark.util.kvstore.InMemoryStore.read(InMemoryStore.java:83)
>   at 
> org.apache.spark.status.ElementTrackingStore.read(ElementTrackingStore.scala:95)
>   at 
> org.apache.spark.status.AppStatusStore.executorSummary(AppStatusStore.scala:70)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:99)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable$$anonfun$createExecutorTable$2.apply(ExecutorTable.scala:92)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable.createExecutorTable(ExecutorTable.scala:92)
>   at 
> org.apache.spark.ui.jobs.ExecutorTable.toNodeSeq(ExecutorTable.scala:75)
>   at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:478)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
>   at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
>   at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>   at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:539)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> 

[jira] [Commented] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector

2019-06-25 Thread Shiv Prashant Sood (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872739#comment-16872739
 ] 

Shiv Prashant Sood commented on SPARK-28152:


Resolved as part of https://issues.apache.org/jira/browse/SPARK-28151

> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector
> ---
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28151) ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables

2019-06-25 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28151:
---
Description: 
##ByteType issue
Writing dataframe with column type BYTETYPE fails when using JDBC connector for 
SQL Server. Append and Read of tables also fail. The problem is due 

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to 
TINYINT
{color:#cc7832}case {color}ByteType => 
Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
{color}java.sql.Types.{color:#9876aa}TINYINT{color}))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
the point of view of upcasting, but will lead to 4 byte allocation rather than 
1 byte for BYTETYPE.



2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
Metadata). The function sets the value in RDD row. The value is set per the 
data type. Here there is no mapping for BYTETYPE and thus results will result 
in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be 
submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #2: *Cannot find data type BYTE*.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable 
#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..

##ShortType and FloatType issue
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables and spark data frame being created 
with unintended types.

Some example issue

Write from df with column type results in a SQL table of with column type 
as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
read results in a dataframe with type INTEGER as opposed to ShortType 

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 

 

 

 

 

 

  was:
Writing dataframe with column type BYTETYPE fails when using JDBC connector for 
SQL Server. Append and Read of tables also fail. The problem is due 

1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to 
TINYINT
{color:#cc7832}case {color}ByteType => 
Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
{color}java.sql.Types.{color:#9876aa}TINYINT{color}))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
the point of view of upcasting, but will lead to 4 byte allocation rather than 
1 byte for BYTETYPE.



2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
Metadata). The function sets the value in RDD row. The value is set per the 
data type. Here there is no mapping for BYTETYPE and thus results will result 
in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer. Will be 
submitting a PR soon to fix these mappings in MSSQLServerDialect.

Error seen when writing table

(JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
parameter, or variable #2: *Cannot find data type BYTE*.)
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable 
#2: Cannot find data type BYTE.
com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
 ..

 

 

 

 

 

 


> ByteType, ShortType and FloatTypes are not correctly mapped for read/write of 
> SQLServer tables
> --
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>   

[jira] [Updated] (SPARK-28151) ByteType, ShortType and FloatTypes are not correctly mapped for read/write of SQLServer tables

2019-06-25 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28151:
---
Summary: ByteType, ShortType and FloatTypes are not correctly mapped for 
read/write of SQLServer tables  (was: ByteType is not correctly mapped for 
read/write of SQLServer tables)

> ByteType, ShortType and FloatTypes are not correctly mapped for read/write of 
> SQLServer tables
> --
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  ..
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector

2019-06-25 Thread Shiv Prashant Sood (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shiv Prashant Sood updated SPARK-28152:
---
Description: 
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables and spark data frame being created 
with unintended types.

Some example issue
 * Write from df with column type results in a SQL table of with column type as 
INTEGER as opposed to SMALLINT. Thus a larger table that expected.
 * read results in a dataframe with type INTEGER as opposed to ShortType 

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 

  was:
ShortType and FloatTypes are not correctly mapped to right JDBC types when 
using JDBC connector. This results in tables or spark data frame being created 
with unintended types.

Some example issue
 * Write from df with column type results in a SQL table of with column type as 
INTEGER as opposed to SMALLINT. Thus a larger table that expected.
 * read results in a dataframe with type INTEGER as opposed to ShortType  

FloatTypes have a issue with read path. In the write path Spark data type 
'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the 
read path when JDBC data types need to be converted to Catalyst data types ( 
getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather 
than 'FloatType'.

 


> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector
> ---
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28166) Query optimization for symmetric difference / disjunctive union of Datasets

2019-06-25 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28166:
---
Description: 
The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their 
set union minus their set intersection: it returns tuples which are in only one 
of the sets and omits tuples which are present in both sets (see 
[https://en.wikipedia.org/wiki/Symmetric_difference]).

With the Datasets API, we can express this as either
{code:java}
a.union(b).except(a.intersect(b)){code}
or
{code:java}
a.except(b).union(b.except(a)){code}
Spark currently plan this query with two joins. However, it may be more 
efficient to represent this as a full outer join followed by a filter and a 
distinct (and, depending on the number of duplicates, we might want to push 
additional distinct clauses beneath the join, but I think that's a separate 
optimization). It would cool if the optimizer could automatically perform this 
rewrite.

This is a very low priority: I'm filing this ticket mostly for tracking / 
reference purposes (so searches for 'symmetric difference' turn up something 
useful in Spark's JIRA).

  was:
The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their 
set union minus their set intersection: it returns tuples which are in only one 
of the sets and omits tuples which are present in both sets (see 
[https://en.wikipedia.org/wiki/Symmetric_difference]).

With the Datasets API, we can express this as either
{code:java}
a.union(b).except(a.intersect(b)){code}
or
{code:java}
a.except(b).union(b.except(a)){code}
Spark currently plan this query with two joins. However, it may be more 
efficient to represent this as a full outer join followed by a filter and a 
distinct (and, depending on the number of duplicates, we might want to push 
additional distinct clauses beneath the join, but I think that's a separate 
optimization). It would cool if the optimizer could automatically perform this 
rewrite.

This is a pretty low priority: I'm filing this ticket mostly for tracking / 
reference purposes (so searches for 'symmetric difference' turn up something 
useful in Spark's JIRA).


> Query optimization for symmetric difference / disjunctive union of Datasets
> ---
>
> Key: SPARK-28166
> URL: https://issues.apache.org/jira/browse/SPARK-28166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their 
> set union minus their set intersection: it returns tuples which are in only 
> one of the sets and omits tuples which are present in both sets (see 
> [https://en.wikipedia.org/wiki/Symmetric_difference]).
> With the Datasets API, we can express this as either
> {code:java}
> a.union(b).except(a.intersect(b)){code}
> or
> {code:java}
> a.except(b).union(b.except(a)){code}
> Spark currently plan this query with two joins. However, it may be more 
> efficient to represent this as a full outer join followed by a filter and a 
> distinct (and, depending on the number of duplicates, we might want to push 
> additional distinct clauses beneath the join, but I think that's a separate 
> optimization). It would cool if the optimizer could automatically perform 
> this rewrite.
> This is a very low priority: I'm filing this ticket mostly for tracking / 
> reference purposes (so searches for 'symmetric difference' turn up something 
> useful in Spark's JIRA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28166) Query optimization for symmetric difference / disjunctive union of Datasets

2019-06-25 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-28166:
--

 Summary: Query optimization for symmetric difference / disjunctive 
union of Datasets
 Key: SPARK-28166
 URL: https://issues.apache.org/jira/browse/SPARK-28166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


The *symmetric difference* (a.k.a. *disjunctive union*) of two sets is their 
set union minus their set intersection: it returns tuples which are in only one 
of the sets and omits tuples which are present in both sets (see 
[https://en.wikipedia.org/wiki/Symmetric_difference]).

With the Datasets API, we can express this as either
{code:java}
a.union(b).except(a.intersect(b)){code}
or
{code:java}
a.except(b).union(b.except(a)){code}
Spark currently plan this query with two joins. However, it may be more 
efficient to represent this as a full outer join followed by a filter and a 
distinct (and, depending on the number of duplicates, we might want to push 
additional distinct clauses beneath the join, but I think that's a separate 
optimization). It would cool if the optimizer could automatically perform this 
rewrite.

This is a pretty low priority: I'm filing this ticket mostly for tracking / 
reference purposes (so searches for 'symmetric difference' turn up something 
useful in Spark's JIRA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25390) data source V2 API refactoring

2019-06-25 Thread Lars Francke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872684#comment-16872684
 ] 

Lars Francke commented on SPARK-25390:
--

Is there any kind of end-user documentation for this on how to use these APIs 
to develop custom sources?

When looking on the Spark homepage one only finds this documentation 
[https://spark.apache.org/docs/2.2.0/streaming-custom-receivers.html] it'd be 
useful to have a version of this for the new APIs

> data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-25 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872657#comment-16872657
 ] 

Tony Zhang commented on SPARK-28135:


As commented in the PR, to fix this overflow, we should not change Ceil return 
type to double, and thus will have to add support to 128-bit int type in the 
code base. It will be a different story.

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative

2019-06-25 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27630.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24497
[https://github.com/apache/spark/pull/24497]

> Stage retry causes totalRunningTasks calculation to be negative
> ---
>
> Key: SPARK-27630
> URL: https://issues.apache.org/jira/browse/SPARK-27630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
> Fix For: 3.0.0
>
>
> Track tasks separately for each stage attempt (instead of tracking by stage), 
> and do NOT reset the numRunningTasks to 0 on StageCompleted.
> In the case of stage retry, the {{taskEnd}} event from the zombie stage 
> sometimes makes the number of {{totalRunningTasks}} negative, which will 
> causes the job to get stuck.
>  Similar problem also exists with {{stageIdToTaskIndices}} & 
> {{stageIdToSpeculativeTaskIndices}}.
>  If it is a failed {{taskEnd}} event of the zombie stage, this will cause 
> {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the 
> task index of the active stage, and the number of {{totalPendingTasks}} will 
> increase unexpectedly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27630) Stage retry causes totalRunningTasks calculation to be negative

2019-06-25 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-27630:


Assignee: dzcxzl

> Stage retry causes totalRunningTasks calculation to be negative
> ---
>
> Key: SPARK-27630
> URL: https://issues.apache.org/jira/browse/SPARK-27630
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Minor
>
> Track tasks separately for each stage attempt (instead of tracking by stage), 
> and do NOT reset the numRunningTasks to 0 on StageCompleted.
> In the case of stage retry, the {{taskEnd}} event from the zombie stage 
> sometimes makes the number of {{totalRunningTasks}} negative, which will 
> causes the job to get stuck.
>  Similar problem also exists with {{stageIdToTaskIndices}} & 
> {{stageIdToSpeculativeTaskIndices}}.
>  If it is a failed {{taskEnd}} event of the zombie stage, this will cause 
> {{stageIdToTaskIndices}} or {{stageIdToSpeculativeTaskIndices}} to remove the 
> task index of the active stage, and the number of {{totalPendingTasks}} will 
> increase unexpectedly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries

2019-06-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Summary: Make SHS clear KVStore LogInfo for the blacklisted entries  (was: 
Make SHS check Spark event log file permission changes)

> Make SHS clear KVStore LogInfo for the blacklisted entries
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a blacklist for all event log files failed 
> once at reading. The blacklisted log files are released back after 
> CLEAN_INTERVAL_S .
> However, the files whose size don't changes are ignored forever because 
> shouldReloadLog return false always when the size is the same with the value 
> in KVStore. This is recovered only via SHS restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables

2019-06-25 Thread Shiv Prashant Sood (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872577#comment-16872577
 ] 

Shiv Prashant Sood commented on SPARK-28151:


Fixed by [https://github.com/apache/spark/pull/24969]

 

> ByteType is not correctly mapped for read/write of SQLServer tables
> ---
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  ..
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28151:


Assignee: Apache Spark

> ByteType is not correctly mapped for read/write of SQLServer tables
> ---
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Assignee: Apache Spark
>Priority: Minor
>
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  ..
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28151) ByteType is not correctly mapped for read/write of SQLServer tables

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28151:


Assignee: (was: Apache Spark)

> ByteType is not correctly mapped for read/write of SQLServer tables
> ---
>
> Key: SPARK-28151
> URL: https://issues.apache.org/jira/browse/SPARK-28151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> Writing dataframe with column type BYTETYPE fails when using JDBC connector 
> for SQL Server. Append and Read of tables also fail. The problem is due 
> 1. (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in 
> jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped 
> to TINYINT
> {color:#cc7832}case {color}ByteType => 
> Option(JdbcType({color:#6a8759}"BYTE"{color}{color:#cc7832}, 
> {color}java.sql.Types.{color:#9876aa}TINYINT{color}))
> In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to 
> INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from 
> the point of view of upcasting, but will lead to 4 byte allocation rather 
> than 1 byte for BYTETYPE.
> 2. (read path) Read path ends up calling makeGetter(dt: DataType, metadata: 
> Metadata). The function sets the value in RDD row. The value is set per the 
> data type. Here there is no mapping for BYTETYPE and thus results will result 
> in an error when getCatalystType() is fixed.
> Note : These issues were found when reading/writing with SQLServer. Will be 
> submitting a PR soon to fix these mappings in MSSQLServerDialect.
> Error seen when writing table
> (JDBC Write failed,com.microsoft.sqlserver.jdbc.SQLServerException: Column, 
> parameter, or variable #2: *Cannot find data type BYTE*.)
> com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or 
> variable #2: Cannot find data type BYTE.
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:254)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1608)
> com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:859)
>  ..
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts

2019-06-25 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872556#comment-16872556
 ] 

Shivu Sondur commented on SPARK-28164:
--

[~hannankan]

 ./sbin/start-slave.sh  

it will start properly. Tested in master 

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28165) SHS does not delete old inprogress files until cleaner.maxAge after SHS start time

2019-06-25 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-28165:


 Summary: SHS does not delete old inprogress files until 
cleaner.maxAge after SHS start time
 Key: SPARK-28165
 URL: https://issues.apache.org/jira/browse/SPARK-28165
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3, 2.3.3
Reporter: Imran Rashid


The SHS will not delete inprogress files until 
{{spark.history.fs.cleaner.maxAge}} time after it has started (7 days by 
default), regardless of when the last modification to the file was.  This is 
particularly problematic if the SHS gets restarted regularly, as then you'll 
end up never deleting old files.

There might not be much we can do about this -- we can't really trust the 
modification time of the file, as that isn't always updated reliably.

We could take the last time of any event from the file, but then we'd have to 
turn off the optimization of SPARK-6951, to avoid reading the entire file just 
for the listing.

*WORKAROUND*: have the SHS save state across restarts to local disk by 
specifying a path in {{spark.history.store.path}}.  It'll still take 7 days 
from when you add that config for the cleaning to happen, but then going for 
the cleaning should happen reliably.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile

2019-06-25 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872530#comment-16872530
 ] 

shane knapp commented on SPARK-28114:
-

[~dongjoon] the `--force` came back because someone manually edited the build 
configs via the jenkins gui, and when i auto-generated & deployed them via the 
jenkins job builder configs, those changes got clobbered.

> Add Jenkins job for `Hadoop-3.2` profile
> 
>
> Key: SPARK-28114
> URL: https://issues.apache.org/jira/browse/SPARK-28114
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
>
> Spark 3.0 is a major version change. We want to have the following new Jobs.
> 1. SBT with hadoop-3.2
> 2. Maven with hadoop-3.2 (on JDK8 and JDK11)
> Also, shall we have a limit for the concurrent run for the following existing 
> job? Currently, it invokes multiple jobs concurrently. We can save the 
> resource by limiting to 1 like the other jobs.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing
> We will drop four `branch-2.3` jobs at the end of August, 2019.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-06-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27622.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24554
[https://github.com/apache/spark/pull/24554]

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-06-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27622:
--

Assignee: Attila Zsolt Piros

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28145:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: New Feature)

> Executor pods polling source can fail to replace dead executors
> ---
>
> Key: SPARK-28145
> URL: https://issues.apache.org/jira/browse/SPARK-28145
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Onur Satici
>Priority: Minor
>
> Scheduled task responsible for reporting executor snapshots to the executor 
> allocator in kubernetes will die on any error, killing subsequent runs of the 
> same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26985.
---
   Resolution: Fixed
 Assignee: ketan kunde
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/24861

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Assignee: ketan kunde
>Priority: Major
>  Labels: BigEndian
> Fix For: 3.0.0
>
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt, access only some column of the all of 
> columns.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
> Observed that the data obtained after df.cache() is causing the error. Please 
> find attached the log with the details. 
> cache() works perfectly fine if double and  float values are not in picture.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28154) GMM fix double caching

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28154.
---
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 3.0.0
   2.4.4

Resolved by https://github.com/apache/spark/pull/24919

> GMM fix double caching
> --
>
> Key: SPARK-28154
> URL: https://issues.apache.org/jira/browse/SPARK-28154
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> The intermediate rdd is always cached.  We should only cache it if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28117:
-

Assignee: zhengruifeng

> LDA and BisectingKMeans cache the input dataset if necessary
> 
>
> Key: SPARK-28117
> URL: https://issues.apache.org/jira/browse/SPARK-28117
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> In MLLIB-LDA, if the EM solver caches the dataset internally, while the 
> Online do not.
> So in the ML-LDA, we need to cache the internmediate dataset if necessary.
>  
> BisectingKMeans also needs too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28117.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24920
[https://github.com/apache/spark/pull/24920]

> LDA and BisectingKMeans cache the input dataset if necessary
> 
>
> Key: SPARK-28117
> URL: https://issues.apache.org/jira/browse/SPARK-28117
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> In MLLIB-LDA, if the EM solver caches the dataset internally, while the 
> Online do not.
> So in the ML-LDA, we need to cache the internmediate dataset if necessary.
>  
> BisectingKMeans also needs too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28117) LDA and BisectingKMeans cache the input dataset if necessary

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28117:
--
Priority: Minor  (was: Major)

> LDA and BisectingKMeans cache the input dataset if necessary
> 
>
> Key: SPARK-28117
> URL: https://issues.apache.org/jira/browse/SPARK-28117
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In MLLIB-LDA, if the EM solver caches the dataset internally, while the 
> Online do not.
> So in the ML-LDA, we need to cache the internmediate dataset if necessary.
>  
> BisectingKMeans also needs too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28045) add missing RankingEvaluator

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28045:
--
Priority: Minor  (was: Major)

> add missing RankingEvaluator
> 
>
> Key: SPARK-28045
> URL: https://issues.apache.org/jira/browse/SPARK-28045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28045) add missing RankingEvaluator

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28045:
-

Assignee: zhengruifeng

> add missing RankingEvaluator
> 
>
> Key: SPARK-28045
> URL: https://issues.apache.org/jira/browse/SPARK-28045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28045) add missing RankingEvaluator

2019-06-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28045.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24869
[https://github.com/apache/spark/pull/24869]

> add missing RankingEvaluator
> 
>
> Key: SPARK-28045
> URL: https://issues.apache.org/jira/browse/SPARK-28045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> expose RankingEvaluator



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28164) usage description does not match with shell scripts

2019-06-25 Thread Hanna Kan (JIRA)
Hanna Kan created SPARK-28164:
-

 Summary: usage description does not match with shell scripts
 Key: SPARK-28164
 URL: https://issues.apache.org/jira/browse/SPARK-28164
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.3
Reporter: Hanna Kan


I found that "spark/sbin/start-slave.sh" may have some error. 

line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "

but later this script,  I found line 59  MASTER=$1 

Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS

2019-06-25 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28163:
--
Description: 
There are "unsafe" conversions in the Kafka connector.
CaseInsensitiveStringMap comes in which is then converted the following way:
{code:java}
...
options.asScala.toMap
...
{code}
The main problem with this is that such case it looses its case insensitive 
nature
(case insensitive map is converting the key to lower case when get/contains 
called).


  was:
There are "unsafe" conversions in the Kafka connector.
CaseInsensitiveStringMap comes in which is then converted the following way:
{code:java}
...
options.asScala.toMap
...
{code}
The main problem with this that such case it looses its case insensitive nature.
Case insensitive map is converting the key to lower case when get/contains 
called.



> Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and 
> FETCH_OFFSET_RETRY_INTERVAL_MS
> -
>
> Key: SPARK-28163
> URL: https://issues.apache.org/jira/browse/SPARK-28163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There are "unsafe" conversions in the Kafka connector.
> CaseInsensitiveStringMap comes in which is then converted the following way:
> {code:java}
> ...
> options.asScala.toMap
> ...
> {code}
> The main problem with this is that such case it looses its case insensitive 
> nature
> (case insensitive map is converting the key to lower case when get/contains 
> called).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28163:


Assignee: Apache Spark

> Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and 
> FETCH_OFFSET_RETRY_INTERVAL_MS
> -
>
> Key: SPARK-28163
> URL: https://issues.apache.org/jira/browse/SPARK-28163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> There are "unsafe" conversions in the Kafka connector.
> CaseInsensitiveStringMap comes in which is then converted the following way:
> {code:java}
> ...
> options.asScala.toMap
> ...
> {code}
> The main problem with this that such case it looses its case insensitive 
> nature.
> Case insensitive map is converting the key to lower case when get/contains 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28163:


Assignee: (was: Apache Spark)

> Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and 
> FETCH_OFFSET_RETRY_INTERVAL_MS
> -
>
> Key: SPARK-28163
> URL: https://issues.apache.org/jira/browse/SPARK-28163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> There are "unsafe" conversions in the Kafka connector.
> CaseInsensitiveStringMap comes in which is then converted the following way:
> {code:java}
> ...
> options.asScala.toMap
> ...
> {code}
> The main problem with this that such case it looses its case insensitive 
> nature.
> Case insensitive map is converting the key to lower case when get/contains 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Description: 
At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
the file system, and maintains a blacklist for all event log files failed once 
at reading. The blacklisted log files are released back after CLEAN_INTERVAL_S .

However, the files whose size don't changes are ignored forever because 
shouldReloadLog return false always when the size is the same with the value in 
KVStore. This is recovered only via SHS restart.

  was:
At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
the file system, and maintains a permanent blacklist for all event log files 
failed once at reading. Although this reduces a lot of invalid accesses, there 
is no way to see this log files back after the permissions are recovered 
correctly. The only way has been restarting SHS.

Apache Spark is unable to know the permission recovery. However, we had better 
give a second chances for those blacklisted files in a regular manner.


> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a blacklist for all event log files failed 
> once at reading. The blacklisted log files are released back after 
> CLEAN_INTERVAL_S .
> However, the files whose size don't changes are ignored forever because 
> shouldReloadLog return false always when the size is the same with the value 
> in KVStore. This is recovered only via SHS restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28163) Kafka ignores user configuration on FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS

2019-06-25 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-28163:
-

 Summary: Kafka ignores user configuration on 
FETCH_OFFSET_NUM_RETRY and FETCH_OFFSET_RETRY_INTERVAL_MS
 Key: SPARK-28163
 URL: https://issues.apache.org/jira/browse/SPARK-28163
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


There are "unsafe" conversions in the Kafka connector.
CaseInsensitiveStringMap comes in which is then converted the following way:
{code:java}
...
options.asScala.toMap
...
{code}
The main problem with this that such case it looses its case insensitive nature.
Case insensitive map is converting the key to lower case when get/contains 
called.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28157:


Assignee: Apache Spark

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28157:


Assignee: (was: Apache Spark)

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28162) approximSimilarityJoin creating a bottleneck

2019-06-25 Thread Simone Iovane (JIRA)
Simone Iovane created SPARK-28162:
-

 Summary: approximSimilarityJoin creating a bottleneck
 Key: SPARK-28162
 URL: https://issues.apache.org/jira/browse/SPARK-28162
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, Scheduler, Spark Core
Affects Versions: 2.4.3
Reporter: Simone Iovane


Hi I am using spark Mllib and doing approxSimilarityJoin between a 1M dataset 
and a 1k dataset.
When i do it I bradcast the 1k one.
What I see is that thew job stops going forward at the second-last task.
All the executors are dead but one which keeps running for very long time until 
it reaches Out of memory.
I checked ganglia and it shows memory keeping rising until it reaches the 
limit[!https://i.stack.imgur.com/gfhGg.png!|https://i.stack.imgur.com/gfhGg.png]

and the disk space keeps going down until it finishes:
[!https://i.stack.imgur.com/vbEmG.png!|https://i.stack.imgur.com/vbEmG.png]
The action I called is a write, but it does the same with count.
Now I wonder: is it possible that all the partitions in the cluster converge to 
only one node and creating this bottleneck? Is it a function bug?

Here is my code snippet:
{code:java}
var dfW = cookesWb.withColumn("n", monotonically_increasing_id()) var bunchDf = 
dfW.filter(col("n").geq(0) && col("n").lt(100) ) bunchDf.repartition(3000) 
model. 
approxSimilarityJoin(bunchDf,broadcast(cookesNextLimited),80,"EuclideanDistance").
 withColumn("min_distance", 
min(col("EuclideanDistance")).over(Window.partitionBy(col("datasetA.uid"))) ). 
filter(col("EuclideanDistance") === col("min_distance")). 
select(col("datasetA.uid").alias("weboId"), 
col("datasetB.nextploraId").alias("nextId"), 
col("EuclideanDistance")).write.format("parquet").mode("overwrite").save("approxJoin.parquet")
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-28157:
---

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Comment: was deleted

(was: My bad. This issue is invalid.)

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27463:


Assignee: Apache Spark

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Assignee: Apache Spark
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27463) Support Dataframe Cogroup via Pandas UDFs

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27463:


Assignee: (was: Apache Spark)

> Support Dataframe Cogroup via Pandas UDFs 
> --
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28160) TransportClient.sendRpcSync may hang forever

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28160:


Assignee: (was: Apache Spark)

> TransportClient.sendRpcSync may hang forever
> 
>
> Key: SPARK-28160
> URL: https://issues.apache.org/jira/browse/SPARK-28160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Lantao Jin
>Priority: Major
>
> This is very like 
> [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665]
> `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large 
> but no enough memory is available. However, when this happens, 
> TransportClient.sendRpcSync will just hang forever if the timeout set to 
> unlimited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28160) TransportClient.sendRpcSync may hang forever

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28160:


Assignee: Apache Spark

> TransportClient.sendRpcSync may hang forever
> 
>
> Key: SPARK-28160
> URL: https://issues.apache.org/jira/browse/SPARK-28160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> This is very like 
> [SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665]
> `ByteBuffer.allocate` may throw OutOfMemoryError when the response is large 
> but no enough memory is available. However, when this happens, 
> TransportClient.sendRpcSync will just hang forever if the timeout set to 
> unlimited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-25 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872216#comment-16872216
 ] 

Steve Loughran commented on SPARK-28091:


(FWIW, I really like codahale; lines up very nicely with scala closures for 
on-demand eval)

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-25 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872214#comment-16872214
 ] 

Steve Loughran commented on SPARK-28091:


thanks for the links. FWIW, if you call toString() on an S3A input stream you 
get a dump of the stream-specific stats; call it on an FS instance and you get 
the full file stats. That's for logging rather than anything else.

The impala team would actually like to get at those stream stats, but I've been 
-1 to date as I don't want them getting at unstable internals (see 
HADOOP-16379).

I wonder if we added an accessor to StorageStatistics then for those S3A 
versions/input streams which offered it, you'd only need to ask for it (with 
some reflection work to still be compatible with older versions). We could 
still get away with only using long values to count (i.e. no slower atomic 
values), and make the iterator() call create a snapshot of the values to 
iterate over. 

Would that be of interest

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Description: 
Trying to build spark from source based on the Dockerfile attached locally 
(launched on docker on OSX) fails. 

Attempts to change/add the following things beyond what's recommended on the 
build page do not bring improvement: 

1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
2. editing the pom.xml to exclude zinc as in one of the answers in 
[https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
 --> doesn't help
3. adding options -DrecompileMode=all   --> doesn't help

I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which 
is manually put into /usr/java as the Oracle java seems to be recommended. 

Build fails at project streaming with:


{code}
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 2.4.3:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s]
[INFO] Spark Project Tags . SUCCESS [ 20.386 s]
[INFO] Spark Project Sketch ... SUCCESS [ 3.026 s]
[INFO] Spark Project Local DB . SUCCESS [ 5.654 s]
[INFO] Spark Project Networking ... SUCCESS [ 7.401 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 3.400 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s]
[INFO] Spark Project Launcher . SUCCESS [ 17.471 s]
[INFO] Spark Project Core . SUCCESS [02:36 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s]
[INFO] Spark Project GraphX ... SUCCESS [ 21.097 s]
[INFO] Spark Project Streaming  SUCCESS [ 52.537 s]
[INFO] Spark Project Catalyst . SUCCESS [02:44 min]
[INFO] Spark Project SQL .. FAILURE [10:44 min]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 20:15 min
[INFO] Finished at: 2019-06-25T09:45:49Z
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-sql_2.11: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> 
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn  -rf :spark-sql_2.11
The command '/bin/sh -c ./build/mvn -DskipTests clean package' returned a 
non-zero code: 1
{code}

Any help? I'm stuck with this since 2 days, hence I'm raising this issue. 

  was:
Trying to build spark from source based on the Dockerfile attached locally 
(launched on docker on OSX) fails. 

Attempts to change/add the following things beyond what's recommended on the 
build page do not bring improvement: 

1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
2. editing the pom.xml to exclude zinc as in one of the answers in 
[https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
 --> doesn't help
3. adding options -DrecompileMode=all   --> doesn't help

I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which 
is manually put into /usr/java as the Oracle java seems to be recommended. 

Build fails at project streaming with:


{code}
[INFO] 

[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Description: 
Trying to build spark from source based on the Dockerfile attached locally 
(launched on docker on OSX) fails. 

Attempts to change/add the following things beyond what's recommended on the 
build page do not bring improvement: 

1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
2. editing the pom.xml to exclude zinc as in one of the answers in 
[https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
 --> doesn't help
3. adding options -DrecompileMode=all   --> doesn't help

I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which 
is manually put into /usr/java as the Oracle java seems to be recommended. 

Build fails at project streaming with:


{code}
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 2.4.3:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s]
[INFO] Spark Project Tags . SUCCESS [ 20.386 s]
[INFO] Spark Project Sketch ... SUCCESS [ 3.026 s]
[INFO] Spark Project Local DB . SUCCESS [ 5.654 s]
[INFO] Spark Project Networking ... SUCCESS [ 7.401 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 3.400 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s]
[INFO] Spark Project Launcher . SUCCESS [ 17.471 s]
[INFO] Spark Project Core . SUCCESS [02:36 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s]
[INFO] Spark Project GraphX ... SUCCESS [ 21.097 s]
[INFO] Spark Project Streaming  SUCCESS [ 52.537 s]
[INFO] Spark Project Catalyst . SUCCESS [02:44 min]
[INFO] Spark Project SQL .. FAILURE [10:44 min]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 20:15 min
[INFO] Finished at: 2019-06-25T09:45:49Z
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-sql_2.11: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> 
[Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn  -rf :spark-sql_2.11

{code}


The command '/bin/sh -c ./build/mvn -DskipTests clean package' returned a 
non-zero code: 1
Any help? I'm stuck with this since 2 days, hence I'm raising this issue. 

  was:
Trying to build spark from source based on the Dockerfile attached locally 
(launched on docker on OSX) fails. 

Attempts to change/add the following things beyond what's recommended on the 
build page do not bring improvement: 

1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
2. editing the pom.xml to exclude zinc as in one of the answers in 
[https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
 --> doesn't help
3. adding options -DrecompileMode=all   --> doesn't help

I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which 
is manually put into /usr/java as the Oracle java seems to be recommended. 

Build fails at project streaming with:

```
[INFO] 

[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Environment: 
{code}
# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package
{code}


  was:

{code:docker}
# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package
{code}



> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: {code}
> # Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package
> {code}
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> doesn't help
> 3. adding options -DrecompileMode=all   --> doesn't help
> I've downloaded the 

[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Environment: 

{code:docker}
# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package
{code}


  was:
{quote}# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package{quote}


> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: {code:docker}
> # Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package
> {code}
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> doesn't help
> 3. adding options -DrecompileMode=all   --> doesn't help
> I've downloaded 

[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Environment: 
{quote}# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package{quote}

  was:
# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package


> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: {quote}# Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package{quote}
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> doesn't help
> 3. adding options -DrecompileMode=all   --> doesn't help
> I've downloaded the java from Oracle directly ( 

[jira] [Updated] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Nigsch updated SPARK-28161:
--
Environment: 
# Dockerfile
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
[http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package

  was:
# Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven

# Define commonly used JAVA_HOME variable
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH https://github.com/apache/spark $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package
#RUN ./dev/make-distribution.sh -e --name mn-spark-2.3.3 --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Pmesos -Pyarn -Pkubernetes


> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: # Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> 

[jira] [Created] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-25 Thread Martin Nigsch (JIRA)
Martin Nigsch created SPARK-28161:
-

 Summary: Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 
SDK v212
 Key: SPARK-28161
 URL: https://issues.apache.org/jira/browse/SPARK-28161
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.3
 Environment: # Pull base image.
FROM ubuntu:16.04

RUN apt update --fix-missing
RUN apt-get install -y software-properties-common

RUN mkdir /usr/java
ADD jdk-8u212-linux-x64.tar /usr/java

ENV JAVA_HOME=/usr/java/jdk1.8.0_212
RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
2
RUN update-alternatives --install /usr/bin/javac javac 
${JAVA_HOME%*/}/bin/javac 2
ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"

ENV MAVEN_VERSION 3.6.1

RUN apt-get install -y curl wget
RUN curl -fsSL 
http://archive.apache.org/dist/maven/maven-3/${MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
 | tar xzf - -C /usr/share \
 && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn

ENV MAVEN_HOME /usr/share/maven

# Define commonly used JAVA_HOME variable
ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

ENV SPARK_SRC="/usr/src/spark"
ENV BRANCH="v2.4.3"

RUN apt-get update && apt-get install -y --no-install-recommends \
 git python3 python3-setuptools r-base-dev r-cran-evaluate

RUN mkdir -p $SPARK_SRC
RUN git clone --branch $BRANCH https://github.com/apache/spark $SPARK_SRC

WORKDIR $SPARK_SRC

RUN ./build/mvn -DskipTests clean package
#RUN ./dev/make-distribution.sh -e --name mn-spark-2.3.3 --pip --r --tgz 
-Psparkr -Phadoop-2.7 -Pmesos -Pyarn -Pkubernetes
Reporter: Martin Nigsch


Trying to build spark from source based on the Dockerfile attached locally 
(launched on docker on OSX) fails. 

Attempts to change/add the following things beyond what's recommended on the 
build page do not bring improvement: 

1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
2. editing the pom.xml to exclude zinc as in one of the answers in 
[https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
 --> doesn't help
3. adding options -DrecompileMode=all   --> doesn't help

I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) which 
is manually put into /usr/java as the Oracle java seems to be recommended. 

Build fails at project streaming with:

```
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 2.4.3:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 59.875 s]
[INFO] Spark Project Tags . SUCCESS [ 20.386 s]
[INFO] Spark Project Sketch ... SUCCESS [ 3.026 s]
[INFO] Spark Project Local DB . SUCCESS [ 5.654 s]
[INFO] Spark Project Networking ... SUCCESS [ 7.401 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 3.400 s]
[INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s]
[INFO] Spark Project Launcher . SUCCESS [ 17.471 s]
[INFO] Spark Project Core . SUCCESS [02:36 min]
[INFO] Spark Project ML Local Library . SUCCESS [ 50.313 s]
[INFO] Spark Project GraphX ... SUCCESS [ 21.097 s]
[INFO] Spark Project Streaming  SUCCESS [ 52.537 s]
[INFO] Spark Project Catalyst . SUCCESS [02:44 min]
[INFO] Spark Project SQL .. FAILURE [10:44 min]
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Integration for Kafka 0.10 ... SKIPPED
[INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
[INFO] Spark Avro . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 20:15 min
[INFO] Finished at: 2019-06-25T09:45:49Z
[INFO] 
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-sql_2.11: Execution scala-compile-first of goal 

[jira] [Created] (SPARK-28160) TransportClient.sendRpcSync may hang forever

2019-06-25 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-28160:
--

 Summary: TransportClient.sendRpcSync may hang forever
 Key: SPARK-28160
 URL: https://issues.apache.org/jira/browse/SPARK-28160
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3, 2.3.3, 3.0.0
Reporter: Lantao Jin


This is very like 
[SPARK-26665|https://issues.apache.org/jira/browse/SPARK-26665]
`ByteBuffer.allocate` may throw OutOfMemoryError when the response is large but 
no enough memory is available. However, when this happens, 
TransportClient.sendRpcSync will just hang forever if the timeout set to 
unlimited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-25 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872045#comment-16872045
 ] 

Shivu Sondur edited comment on SPARK-28036 at 6/25/19 8:30 AM:
---

[~yumwang]
 select left('ahoj', 2), right('ahoj', 2);
 use without '-' sign, it will work fine,

i tested in latest spark

 


was (Author: shivuson...@gmail.com):
[~yumwang]
select left('ahoj', 2), right('ahoj', 2);
use with '-' sign, it will work fine,

i tested in latest spark

 

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28159:


Assignee: (was: Apache Spark)

> Make the transform natively in ml framework to avoid extra conversion
> -
>
> Key: SPARK-28159
> URL: https://issues.apache.org/jira/browse/SPARK-28159
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> It is a long time since ML was released.
> However, there are still many TODOs on making transform natively in ml 
> framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28159:


Assignee: Apache Spark

> Make the transform natively in ml framework to avoid extra conversion
> -
>
> Key: SPARK-28159
> URL: https://issues.apache.org/jira/browse/SPARK-28159
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> It is a long time since ML was released.
> However, there are still many TODOs on making transform natively in ml 
> framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28159) Make the transform natively in ml framework to avoid extra conversion

2019-06-25 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-28159:


 Summary: Make the transform natively in ml framework to avoid 
extra conversion
 Key: SPARK-28159
 URL: https://issues.apache.org/jira/browse/SPARK-28159
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


It is a long time since ML was released.

However, there are still many TODOs on making transform natively in ml 
framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-25 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872075#comment-16872075
 ] 

Luca Canali commented on SPARK-28091:
-

Thank you [~ste...@apache.org] for your comment and clarifications. Indeed what 
I am trying to do is to collect executor-level metrics for S3A (and also for 
other Hadoop compatible filesystems of interest). The goal is to bring the 
metrics into the Spark metrics system, so that they can be used, for example, 
in a performance dashboard and displayed together with the rest of the 
instrumentation metrics.

The original work for this started from the need of measuring I/O metrics for a 
custom HDFS-compatible filesystem that we use (called ROOT:) and more recently 
also for S3A. The first implementation we did was a simple: a small change in 
[[ExecutorSource]], which already has code to collect metrics for "hdfs" and 
"file"/local filesystems at the executor level, obviously that code is very 
easy to extend, however it feels like a short-term hack going that way.

My though with this PR is to provide a flexible method to add instrumentation, 
profiting from our current use case related to I/O workload monitoring, but 
also open to several other use cases. 
I am also quite interested to see developmnets in this area for CPU counters 
and possibly also GPU-related instrumentation.
I think the proposal to use executor plugins for this goes in the original 
direction outlined by [~irashid] and collaborators with SPARK-24918

I add some links to reference and related material: code of a few test executor 
metrics plugins that I am developing: 
[https://github.com/cerndb/SparkExecutorPlugins] 
The general idea of how to build a dashboard with Spark metrics is described in 
[https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark]

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in 

[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28149:


Assignee: (was: Apache Spark)

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28149) Disable negeative DNS caching

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28149:


Assignee: Apache Spark

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Jose Luis Pedrosa
>Assignee: Apache Spark
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28158:


Assignee: (was: Apache Spark)

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28158) Hive UDFs supports UDT type

2019-06-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28158:


Assignee: Apache Spark

> Hive UDFs supports UDT type
> ---
>
> Key: SPARK-28158
> URL: https://issues.apache.org/jira/browse/SPARK-28158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28158) Hive UDFs supports UDT type

2019-06-25 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-28158:
-

 Summary: Hive UDFs supports UDT type
 Key: SPARK-28158
 URL: https://issues.apache.org/jira/browse/SPARK-28158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: Genmao Yu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28157) Make SHS check Spark event log file permission changes

2019-06-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28157.
---
Resolution: Invalid

My bad. This issue is invalid.

> Make SHS check Spark event log file permission changes
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a permanent blacklist for all event log files 
> failed once at reading. Although this reduces a lot of invalid accesses, 
> there is no way to see this log files back after the permissions are 
> recovered correctly. The only way has been restarting SHS.
> Apache Spark is unable to know the permission recovery. However, we had 
> better give a second chances for those blacklisted files in a regular manner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-25 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872050#comment-16872050
 ] 

Yuming Wang commented on SPARK-28036:
-


{code:sql}
[root@spark-3267648 spark-3.0.0-SNAPSHOT-bin-3.2.0]# bin/spark-shell
19/06/24 23:10:12 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://spark-3267648.lvs02.dev.ebayc3.com:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1561443018277).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/

Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select left('ahoj', -2), right('ahoj', -2)").show
++-+
|left('ahoj', -2)|right('ahoj', -2)|
++-+
|| |
++-+


scala> spark.sql("select left('ahoj', 2), right('ahoj', 2)").show
+---++
|left('ahoj', 2)|right('ahoj', 2)|
+---++
| ah|  oj|
+---++
{code}

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-25 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872045#comment-16872045
 ] 

Shivu Sondur commented on SPARK-28036:
--

[~yumwang]
select left('ahoj', 2), right('ahoj', 2);
use with '-' sign, it will work fine,

i tested in latest spark

 

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org