date:20170724

[jira] [Resolved] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K resolved SPARK-15142.
---
Resolution: Duplicate

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099535#comment-16099535
 ] 

Devaraj K commented on SPARK-15142:
---

[~skonto] Thanks for showing interest on this. I have already created a PR for 
SPARK-15359 which fixes this issue. 

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16784) Configurable log4j settings

2017-07-24 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099477#comment-16099477
 ] 

HanCheol Cho commented on SPARK-16784:
--

Hi, 

I used the following options that allows both driver & executors use a custom 
log4j.properties,
{code}
spark-submit \
  --driver-java-options=-Dlog4j.configuration=conf/log4j.properties \
  --files conf/log4j.properties \
  --conf 
"spark.executor.extraJavaOptions='-Dlog4j.configuration=log4j.properties'" \
  ...
{code}
I used a local log4j.properties file for the driver and the file in the 
distributed cache (by --files option) for the executors.
As shown above, the paths to driver and executors are different.

I hope it might be useful to the others.



> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099412#comment-16099412
 ] 

zhoukang commented on SPARK-21517:
--

Can any one help verify the patch related to this issue? Thanks too much

> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it will execute during buffer copy.
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory during buffer copy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099411#comment-16099411
 ] 

zhoukang commented on SPARK-21517:
--

Can any one help verify the patch related to this issue?

> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it will execute during buffer copy.
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory during buffer copy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099392#comment-16099392
 ] 

Saisai Shao commented on SPARK-21521:
-

[~vanzin], I guess so, in the current logics of {{checkAccessPermission}} we 
don't differentiate special user, so user "root" here is still just a normal 
user. Let me verify it.

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-24 Thread Iurii Antykhovych (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099390#comment-16099390
 ] 

Iurii Antykhovych commented on SPARK-21491:
---

Done,
could you please re-check.

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099351#comment-16099351
 ] 

Kazuaki Ishizaki commented on SPARK-21501:
--

I see. I misunderstood the description.
You expect that memory cache would be enabled even when # of entries is larger 
than {{spark.shuffle.service.index.cache.entries}} if the total cache size is 
not large.

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21526) Add support to ML LogisticRegression for setting initial model

2017-07-24 Thread John Brock (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099331#comment-16099331
 ] 

John Brock commented on SPARK-21526:


Related to SPARK-21386.

> Add support to ML LogisticRegression for setting initial model
> --
>
> Key: SPARK-21526
> URL: https://issues.apache.org/jira/browse/SPARK-21526
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: John Brock
>
> Make it possible to set the initial model when training a logistic regression 
> model.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21526) Add support to ML LogisticRegression for setting initial model

2017-07-24 Thread John Brock (JIRA)

John Brock created SPARK-21526:
--

 Summary: Add support to ML LogisticRegression for setting initial 
model
 Key: SPARK-21526
 URL: https://issues.apache.org/jira/browse/SPARK-21526
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.1
Reporter: John Brock


Make it possible to set the initial model when training a logistic regression 
model.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099313#comment-16099313
 ] 

yuhao yang commented on SPARK-21524:


https://github.com/apache/spark/pull/18728

> ValidatorParamsSuiteHelpers generates wrong temp files
> --
>
> Key: SPARK-21524
> URL: https://issues.apache.org/jira/browse/SPARK-21524
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the 
> wrong place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21525) ReceiverSupervisorImpl seems to ignore the error code when writing to the WAL

2017-07-24 Thread Mark Grover (JIRA)

Mark Grover created SPARK-21525:
---

 Summary: ReceiverSupervisorImpl seems to ignore the error code 
when writing to the WAL
 Key: SPARK-21525
 URL: https://issues.apache.org/jira/browse/SPARK-21525
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Mark Grover


{{AddBlock}} returns an error code related to whether writing the block to the 
WAL was successful or not. In cases where a WAL may be unavailable temporarily, 
the write would fail but it seems like we are not using the return code (see 
[here|https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala#L162]).

For example, when using the Flume Receiver, we should be sending a n'ack back 
to Flume if the block wasn't written to the WAL. I haven't gone through the 
full code path yet but at least from looking at the ReceiverSupervisorImpl, it 
doesn't seem like that return code is being used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099229#comment-16099229
 ] 

Joseph K. Bradley commented on SPARK-21523:
---

CC [~yanboliang] [~yuhaoyan] [~dbtsai] making a few people aware of this

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21523:
--
Description: 
We need merge this breeze bugfix into spark because it influence a series of 
algos in MLlib which use LBFGS.
https://github.com/scalanlp/breeze/pull/651

  was:
We need merge this breeze bugfix into spark because it influence a series of 
algos in MLLib which use LBFGS.
https://github.com/scalanlp/breeze/pull/651


> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21524) ValidatorParamsSuiteHelpers generates wrong temp files

2017-07-24 Thread yuhao yang (JIRA)

yuhao yang created SPARK-21524:
--

 Summary: ValidatorParamsSuiteHelpers generates wrong temp files
 Key: SPARK-21524
 URL: https://issues.apache.org/jira/browse/SPARK-21524
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang


ValidatorParamsSuiteHelpers.testFileMove() is generating temp dir in the wrong 
place and does not delete them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099225#comment-16099225
 ] 

Weichen Xu commented on SPARK-21523:


I will work on this once the breeze cut a new version for this bugfix.

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLLib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-21523:
---
Priority: Minor  (was: Major)

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLLib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-24 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-21523:
--

 Summary: Fix bug of strong wolfe linesearch `init` parameter lose 
effectiveness
 Key: SPARK-21523
 URL: https://issues.apache.org/jira/browse/SPARK-21523
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.0
Reporter: Weichen Xu


We need merge this breeze bugfix into spark because it influence a series of 
algos in MLLib which use LBFGS.
https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099114#comment-16099114
 ] 

Marcelo Vanzin edited comment on SPARK-21521 at 7/24/17 8:54 PM:
-

Just to double-correct myself, it should really be {{3777}}.


was (Author: vanzin):
Just do double-correct myself, it should really be {{3777}}.

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099114#comment-16099114
 ] 

Marcelo Vanzin commented on SPARK-21521:


Just do double-correct myself, it should really be {{3777}}.

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099111#comment-16099111
 ] 

Adrian Bridgett commented on SPARK-21521:
-

Thanks Marcelo - good idea regarding the setgid bit - that's definitely cleaner 
than my current workaround :)
(3775 for setgid rather than 4775 I think though - 2 for the setgid, 1 for the 
sticky bit)

And thanks for such a speedy response!

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099100#comment-16099100
 ] 

Marcelo Vanzin commented on SPARK-21521:


Hmm, if you're using a local FS it should work since root there should be able 
to read everything.

[~jerryshao] do you think the {{checkAccessPermission}} code you added could be 
misbehaving here? It doesn't seem to treat any user as special, so maybe.

[~abridgett] you could try setting the directory's permissions to {{4755}}; 
that should mimic the HDFS behavior of inheriting the directory's group when 
creating new files.

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099090#comment-16099090
 ] 

Adrian Bridgett edited comment on SPARK-21521 at 7/24/17 8:40 PM:
--

Hmm, I didn't check that (it's actually just a local filesystem in this case, 
but I take the point that it's accessed via the Hadoop FS API).

Not sure why it's changed behaviour in 2.2.0 compared to 2.0.2 though. 

Just checked access via hadoop (using hdfs command line) and it's certainly 
still able to access it:
{noformat}
~# id
uid=0(root) gid=0(root) groups=0(root),997(airflow)

~# ls -l /var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4
-rwxrwx--- 1 ubuntu ubuntu 13982 Jul 24 17:57 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4

~# hdfs dfs -cat 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4
..(stuff)
{noformat}



was (Author: abridgett):
Hmm, I didn't check that (it's actually just a local filesystem in this case, 
but I take the point that it's accessed via the Hadoop FS API).

Not sure why it's changed behaviour in 2.2.0 compared to 2.0.2 though. 

Just checked access via hadoop (using hdfs command line) and it's certainly 
still able to access it:
{noformat}
~# id
uid=0(root) gid=0(root) groups=0(root),997(airflow)

~# ls -l /var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4
-rwxrwx--- 1 ubuntu ubuntu 13982 Jul 24 17:57 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4

~# hdfs dfs -cat 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4id
..(stuff)
{noformat}


> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099090#comment-16099090
 ] 

Adrian Bridgett commented on SPARK-21521:
-

Hmm, I didn't check that (it's actually just a local filesystem in this case, 
but I take the point that it's accessed via the Hadoop FS API).

Not sure why it's changed behaviour in 2.2.0 compared to 2.0.2 though. 

Just checked access via hadoop (using hdfs command line) and it's certainly 
still able to access it:
{noformat}
~# id
uid=0(root) gid=0(root) groups=0(root),997(airflow)

~# ls -l /var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4
-rwxrwx--- 1 ubuntu ubuntu 13982 Jul 24 17:57 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4

~# hdfs dfs -cat 
/var/log/spark/events/a5bab156-b4c2-41a2-93f2-11ba78c99c6e-12221.lz4id
..(stuff)
{noformat}


> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099078#comment-16099078
 ] 

Marcelo Vanzin commented on SPARK-21521:


This smells of a configuration issue... {{root}} is not a special user on HDFS, 
so running the SHS as root doesn't mean much assuming you're using HDFS.

For HDFS, the event log directory should have "1777" permissions, and the group 
should be a group to which the SHS user belongs. e.g., in my cluster:

{noformat}
$ hdfs dfs -ls /user/spark
Found 1 items
drwxrwxrwt   - spark spark  0 2017-07-24 13:26 
/user/spark/applicationHistory
{noformat}

My SHS runs as user "spark" which belongs to group "spark".

> History service requires user is in any group
> -
>
> Key: SPARK-21521
> URL: https://issues.apache.org/jira/browse/SPARK-21521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>
> (Regression cf. 2.0.2)
> We run spark as several users, these write to the history location where the 
> files are saved as those users with permissions of 770 (this is hardcoded in 
> EventLoggingListener.scala).
> The history service runs as root so that it has permissions on these files 
> (see https://spark.apache.org/docs/latest/security.html).
> This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
> unless I add the root user into each users group at which point they are seen.
> We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21522) Flaky test: LauncherServerSuite.testStreamFiltering

2017-07-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-21522:
---
Summary: Flaky test: LauncherServerSuite.testStreamFiltering  (was: Flay 
test:)

> Flaky test: LauncherServerSuite.testStreamFiltering
> ---
>
> Key: SPARK-21522
> URL: https://issues.apache.org/jira/browse/SPARK-21522
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We ran into this in our internal Jenkins servers. Partial stack trace:
> {noformat}
> java.net.SocketException: Broken pipe
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
>   at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>   at 
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)
>   at 
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.launcher.LauncherConnection.send(LauncherConnection.java:82)
>   at 
> org.apache.spark.launcher.LauncherServerSuite.testStreamFiltering(LauncherServerSuite.java:174)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21522) Flay test:

2017-07-24 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-21522:
--

 Summary: Flay test:
 Key: SPARK-21522
 URL: https://issues.apache.org/jira/browse/SPARK-21522
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.0, 2.1.1
Reporter: Marcelo Vanzin
Priority: Minor


We ran into this in our internal Jenkins servers. Partial stack trace:

{noformat}
java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at 
java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1285)
at 
java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
at 
org.apache.spark.launcher.LauncherConnection.send(LauncherConnection.java:82)
at 
org.apache.spark.launcher.LauncherServerSuite.testStreamFiltering(LauncherServerSuite.java:174)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-07-24 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099030#comment-16099030
 ] 

Mitesh edited comment on SPARK-20112 at 7/24/17 7:37 PM:
-

Still seeing this on 2.1.0, new err file 
https://gist.github.com/MasterDDT/7eab5e8828ef9cbed2b8956a505c


was (Author: masterddt):
Still seeing this on 2.1.0, attached new err file

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20112) SIGSEGV in GeneratedIterator.sort_addToSorter

2017-07-24 Thread Mitesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16099030#comment-16099030
 ] 

Mitesh commented on SPARK-20112:


Still seeing this on 2.1.0, attached new err file

> SIGSEGV in GeneratedIterator.sort_addToSorter
> -
>
> Key: SPARK-20112
> URL: https://issues.apache.org/jira/browse/SPARK-20112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS m4.10xlarge with EBS (io1 drive, 400g, 4000iops)
>Reporter: Mitesh
> Attachments: codegen_sorter_crash.log, hs_err_pid19271.log, 
> hs_err_pid22870.log
>
>
> I'm seeing a very weird crash in {{GeneratedIterator.sort_addToSorter}}. The 
> hs_err_pid and codegen file are attached (with query plans). Its not a 
> deterministic repro, but running a big query load, I eventually see it come 
> up within a few minutes.
> Here is some interesting repro information:
> - Using AWS r3.8xlarge machines, which have ephermal attached drives, I can't 
> repro this. But it does repro with m4.10xlarge with an io1 EBS drive. So I 
> think that means its not an issue with the code-gen, but I cant figure out 
> what the difference in behavior is.
> - The broadcast joins in the plan are all small tables. I have 
> autoJoinBroadcast=-1 because I always hint which tables should be broadcast.
> - As you can see from the plan, all the sources are cached memory tables. And 
> we partition/sort them all beforehand so its always sort-merge-joins or 
> broadcast joins (with small tables).
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  [thread 139872345896704 also had an error]
> SIGSEGV (0xb) at pc=0x7f38a378caa3, pid=19271, tid=139872342738688
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> [thread 139872348002048 also had an error]# Problematic frame:
> # 
> J 28454 C1 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V
>  (369 bytes) @ 0x7f38a378caa3 [0x7f38a378b5e0+0x14c3]
> {noformat}
> This kind of looks like https://issues.apache.org/jira/browse/SPARK-15822, 
> but that is marked fix in 2.0.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098948#comment-16098948
 ] 

yuhao yang commented on SPARK-14239:


Close overlooked stale jira.

> Add load for LDAModel that supports both local and distributedModel
> ---
>
> Key: SPARK-14239
> URL: https://issues.apache.org/jira/browse/SPARK-14239
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Add load for LDAModel that supports loading both local and distributedModel, 
> as discussed in https://github.com/apache/spark/pull/9894. So that users 
> don't have to know the details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14239) Add load for LDAModel that supports both local and distributedModel

2017-07-24 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-14239.

Resolution: Won't Do

> Add load for LDAModel that supports both local and distributedModel
> ---
>
> Key: SPARK-14239
> URL: https://issues.apache.org/jira/browse/SPARK-14239
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Add load for LDAModel that supports loading both local and distributedModel, 
> as discussed in https://github.com/apache/spark/pull/9894. So that users 
> don't have to know the details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098946#comment-16098946
 ] 

yuhao yang commented on SPARK-12875:


Close stale jira.

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12875) Add Weight of Evidence and Information value to Spark.ml as a feature transformer

2017-07-24 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-12875.

Resolution: Won't Do

> Add Weight of Evidence and Information value to Spark.ml as a feature 
> transformer
> -
>
> Key: SPARK-12875
> URL: https://issues.apache.org/jira/browse/SPARK-12875
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> As a feature transformer, WOE and IV enable one to:
> Consider each variable’s independent contribution to the outcome.
> Detect linear and non-linear relationships.
> Rank variables in terms of "univariate" predictive strength.
> Visualize the correlations between the predictive variables and the binary 
> outcome.
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/ gives 
> a good introduction to WoE and IV.
>  The Weight of Evidence or WoE value provides a measure of how well a 
> grouping of feature is able to distinguish between a binary response (e.g. 
> "good" versus "bad"), which is widely used in grouping continuous feature or 
> mapping categorical features to continuous values. It is computed from the 
> basic odds ratio:
> (Distribution of positive Outcomes) / (Distribution of negative Outcomes)
> where Distr refers to the proportion of positive or negative in the 
> respective group, relative to the column totals.
> The WoE recoding of features is particularly well suited for subsequent 
> modeling using Logistic Regression or MLP.
> In addition, the information value or IV can be computed based on WoE, which 
> is a popular technique to select variables in a predictive model.
> TODO: Currently we support only calculation for categorical features. Add an 
> estimator to estimate the proper grouping for continuous feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940
 ] 

yuhao yang edited comment on SPARK-14760 at 7/24/17 6:23 PM:
-

Close stale jira since it's been overlooked for some time. Thanks for the 
review and comments.


was (Author: yuhaoyan):
Close it since it's been overlooked for some time. Thanks for the review and 
comments.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14760) Feature transformers should always invoke transformSchema in transform or fit

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098940#comment-16098940
 ] 

yuhao yang commented on SPARK-14760:


Close it since it's been overlooked for some time. Thanks for the review and 
comments.

> Feature transformers should always invoke transformSchema in transform or fit
> -
>
> Key: SPARK-14760
> URL: https://issues.apache.org/jira/browse/SPARK-14760
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Since one of the primary function for transformSchema is to conduct parameter 
> validation, transformers should always invoke transformSchema in transform 
> and fit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13223) Add stratified sampling to ML feature engineering

2017-07-24 Thread yuhao yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-13223.

Resolution: Not A Problem

> Add stratified sampling to ML feature engineering
> -
>
> Key: SPARK-13223
> URL: https://issues.apache.org/jira/browse/SPARK-13223
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> I found it useful to add an sampling transformer during a case of fraud 
> detection. It can be used in resampling or overSampling, which in turn is 
> required by ensemble and unbalanced data processing.
> Internally, it invoke the sampleByKey in Pair RDD operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13223) Add stratified sampling to ML feature engineering

2017-07-24 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098933#comment-16098933
 ] 

yuhao yang commented on SPARK-13223:


Close it since it's been overlooked for some time and can be implemented with 
#17583 easily. 

> Add stratified sampling to ML feature engineering
> -
>
> Key: SPARK-13223
> URL: https://issues.apache.org/jira/browse/SPARK-13223
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> I found it useful to add an sampling transformer during a case of fraud 
> detection. It can be used in resampling or overSampling, which in turn is 
> required by ensemble and unbalanced data processing.
> Internally, it invoke the sampleByKey in Pair RDD operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21502) --supervise causing frameworkId conflicts in mesos cluster mode

2017-07-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21502.

   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.3.0

> --supervise causing frameworkId conflicts in mesos cluster mode
> ---
>
> Key: SPARK-21502
> URL: https://issues.apache.org/jira/browse/SPARK-21502
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
> Fix For: 2.3.0
>
>
> This issue is described here: https://dcosjira.atlassian.net/browse/SPARK-8
> The --supervise option is actually broken in mesos cluster mode. 
> The error message you get when you re-launch a driver is:
> 17/05/16 13:26:09 ERROR MesosCoarseGrainedSchedulerBackend: Mesos error: 
> Framework has been removed
> Exception in thread "Thread-13" org.apache.spark.SparkException: Exiting due 
> to error from cluster scheduler: Framework has been removed
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459)
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.error(MesosCoarseGrainedSchedulerBackend.scala:569)
> I0516 13:26:09.00334292 sched.cpp:2021] Asked to abort the driver
> I0516 13:26:09.00336292 sched.cpp:1217] Aborting framework 
> '1b9cb705-2e80-4354-8f3b-a4cffac7dac3-0003-driver-20170516131736-0004'
> 17/05/16 13:26:09 INFO MesosCoarseGrainedSchedulerBackend: driver.run() 
> returned with code DRIVER_ABORTED
> 17/05/16 13:26:09 ERROR SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Error starting driver, DRIVER_ABORTED
>   at 
> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils$$anon$1.run(MesosSchedulerUtils.scala:125)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-07-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098907#comment-16098907
 ] 

Thomas Graves commented on SPARK-21501:
---

The issue was actually introduced with SPARK-15074.  Updated the affects 
version. I was thinking that was added earlier so thanks for pointing out.

That cache should be memory based not # of entries based.

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-07-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-21501:
--
Affects Version/s: (was: 2.0.0)
   2.1.0

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-07-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098612#comment-16098612
 ] 

Maciej Bryński commented on SPARK-20392:


Is it safe to merge it to 2.2 ?
I'm tracing problems with Catalyst performance and this could be a solution.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
>

[jira] [Closed] (SPARK-21387) org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki closed SPARK-21387.

Resolution: Cannot Reproduce

> org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM
> -
>
> Key: SPARK-21387
> URL: https://issues.apache.org/jira/browse/SPARK-21387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21387) org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki closed SPARK-21387.

Resolution: Fixed

> org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM
> -
>
> Key: SPARK-21387
> URL: https://issues.apache.org/jira/browse/SPARK-21387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21387) org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-21387:
--

> org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM
> -
>
> Key: SPARK-21387
> URL: https://issues.apache.org/jira/browse/SPARK-21387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21387) org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098545#comment-16098545
 ] 

Kazuaki Ishizaki commented on SPARK-21387:
--

While I got OOM in my unit test, I have to reinvestigate whether the unit test 
follows actual restrictions.

> org.apache.spark.memory.TaskMemoryManager.allocatePage causes OOM
> -
>
> Key: SPARK-21387
> URL: https://issues.apache.org/jira/browse/SPARK-21387
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098239#comment-16098239
 ] 

Stavros Kontopoulos edited comment on SPARK-15142 at 7/24/17 3:01 PM:
--

I will give it a shot if it is ok. [~devaraj.k] Any details on how to reproduce 
this? You simply restart the node by backing up zk state, (I assume no HA) 
right?


was (Author: skonto):
I will give it a shot if it is ok. [~devaraj.k] Any details on how to reproduce 
this? You simply restart the node by backing up zk state?

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098239#comment-16098239
 ] 

Stavros Kontopoulos edited comment on SPARK-15142 at 7/24/17 3:01 PM:
--

I will give it a shot if it is ok. [~devaraj.k] Any details on how to reproduce 
this? You simply restart the node by backing up zk state first, (I assume no 
HA) right?


was (Author: skonto):
I will give it a shot if it is ok. [~devaraj.k] Any details on how to reproduce 
this? You simply restart the node by backing up zk state, (I assume no HA) 
right?

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098239#comment-16098239
 ] 

Stavros Kontopoulos edited comment on SPARK-15142 at 7/24/17 3:00 PM:
--

I will give it a shot if it is ok. [~devaraj.k] Any details on how to reproduce 
this? You simply restart the node by backing up zk state?


was (Author: skonto):
I will give it a shot if it is ok.

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-07-24 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098531#comment-16098531
 ] 

Kazuaki Ishizaki commented on SPARK-21501:
--

I guess that to use Spark 2.1 or later version alleviates this issue by 
SPARK-15074

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-07-24 Thread Leif Walsh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098524#comment-16098524
 ] 

Leif Walsh commented on SPARK-21187:


Also, if you're unfamiliar, {{object}} columns are rather slow in pandas, to do 
anything with them you have to go through the python interpreter.  It's 
generally better, when possible, to make sure your columns have primitive 
dtypes so that you can use vectorized operations on them.  For that reason, 
modeling a struct as a hierarchical index would probably be much faster to 
consume.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * *Date*
> * *Timestamp*
> * *Complex*: Struct, Array, Map
> * *Decimal*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19214) Inconsistencies between DataFrame and Dataset APIs

2017-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19214.
---
Resolution: Won't Fix

> Inconsistencies between DataFrame and Dataset APIs
> --
>
> Key: SPARK-19214
> URL: https://issues.apache.org/jira/browse/SPARK-19214
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Alexander Alexandrov
>Priority: Trivial
>
> I am not sure whether this has been reported already, but there are some 
> confusing & annoying inconsistencies when programming the same expression in 
> the Dataset and the DataFrame APIs.
> Consider the following minimal example executed in a Spark Shell:
> {code}
> case class Point(x: Int, y: Int, z: Int)
> val ps = spark.createDataset(for {
>   x <- 1 to 10
>   y <- 1 to 10
>   z <- 1 to 10
> } yield Point(x, y, z))
> // Problem 1:
> // count produces different fields in the Dataset / DataFrame variants
> // count() on grouped DataFrame: field name is `count`
> ps.groupBy($"x").count().printSchema
> // root
> //  |-- x: integer (nullable = false)
> //  |-- count: long (nullable = false)
> // count() on grouped Dataset: field name is `count(1)`
> ps.groupByKey(_.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // Problem 2:
> // groupByKey produces different `key` field name depending
> // on the result type
> // this is especially confusing in the first case below (simple key types)
> // where the key field is actually named `value`
> // simple key types
> ps.groupByKey(p => p.x).count().printSchema
> // root
> //  |-- value: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> // complex key types
> ps.groupByKey(p => (p.x, p.y)).count().printSchema
> // root
> //  |-- key: struct (nullable = false)
> //  ||-- _1: integer (nullable = true)
> //  ||-- _2: integer (nullable = true)
> //  |-- count(1): long (nullable = false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-07-24 Thread Leif Walsh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098522#comment-16098522
 ] 

Leif Walsh commented on SPARK-21187:


[~rxin] [~bryanc], pandas does support array and map columns, it represents 
each value as a python {{list}} or {{dict}} (with {{object}} dtype):

{code}
>>> pd.DataFrame({'x': [[1,2,3], [4,5]], 'y': [{'hello': 1}, {'world': 2, 
>>> ('fizz', 'buzz'): 3}]})
   x  y
0  [1, 2, 3]   {'hello': 1}
1 [4, 5]  {'world': 2, ('fizz', 'buzz'): 3}
{code}

You could also model structs as namedtuples:

{code}
>>> import collections
>>> person = collections.namedtuple('person', ['first', 'last'])
>>> pd.DataFrame({'participants': [person('Reynold', 'Xin'), person('Bryan', 
>>> 'Cutler')]})
  participants
0   (Reynold, Xin)
1  (Bryan, Cutler)
{code}

This would also have {{object}} dtype.

Another choice is, for structs at least, you could model it as a hierarchical 
index on columns:

{code}
>>> pd.DataFrame(data=[['Reynold', 'Xin'], ['Bryan', 'Cutler']], 
>>> columns=pd.MultiIndex(levels=[['participant'], ['first', 'last']], 
>>> labels=[[0, 0], [0, 1]]))
  participant
firstlast
0 Reynold Xin
1   Bryan  Cutler
{code}

Let me know if this is unclear and I should elaborate.

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * *Date*
> * *Timestamp*
> * *Complex*: Struct, Array, Map
> * *Decimal*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098344#comment-16098344
 ] 

Liang-Chi Hsieh commented on SPARK-21177:
-

[~hyukjin.kwon] I ran spark-shell and your code snippets, but it still can't 
reproduce the issue. I didn't create the hive table.

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098305#comment-16098305
 ] 

Hyukjin Kwon commented on SPARK-21177:
--

[~viirya], I assume you did in the way I did?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-21177:
--

I am reopening this as I can reproduce:

{code}
def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
}


for(i <- 1 to 1000) {
  printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })
}
{code}


{code}
...

Time taken for time to append to hive: is 166

Time taken for time to append to hive: is 155

Time taken for time to append to hive: is 164

...

Time taken for time to append to hive: is 374

Time taken for time to append to hive: is 360

Time taken for time to append to hive: is 377
{code}

{code}
scala> sql("SELECT count(*) from t1").show()
++
|count(1)|
++
|2000|
++
{code}


What I did is with {{format("hive")}} :

{code}
hive> create table t1 (value bigint);
{code}

Output:

{code}

...

Time taken for time to append to hive: is 593

Time taken for time to append to hive: is 587

Time taken for time to append to hive: is 580

...


Time taken for time to append to hive: is 506

Time taken for time to append to hive: is 511

Time taken for time to append to hive: is 507
{code}

{code}
scala> sql("SELECT count(*) from t1").show()
++
|count(1)|
++
|2000|
++
{code}


> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098262#comment-16098262
 ] 

Hyukjin Kwon commented on SPARK-21177:
--

Oh, no. I created a Hive table and then inserted into this in Spark. I thought 
you indented to do so. The code path should be different in this case. I think 
I am reproducing this now. Let me reopen this after summarising what I did soon.

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098255#comment-16098255
 ] 

Prashant Sharma commented on SPARK-21177:
-

Yes. In fact, hive should not be setup. Are you on some special hardware like 
Nvme?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-21177:

Description: 
In short, please use the following shell transcript for the reproducer. 

{code:java}

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
 |  |  |  |  |  |  | printTimeTaken: (str: String, 
f: () => Unit)Unit

scala> 
for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

...
Time taken for time to append to hive: is 3055
...
Time taken for time to append to hive: is 22425


{code}

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe 
operation.

  was:
In short, please use the following shell transcript for the reproducer. 

{code:java}

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
 |  |  |  |  |  |  | printTimeTaken: (str: String, 
f: () => Unit)Unit

scala> 
for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

Time taken for time to append to hive: is 3055

Time taken for time to append to hive: is 22425


{code}

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe 
operation.


> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098239#comment-16098239
 ] 

Stavros Kontopoulos edited comment on SPARK-15142 at 7/24/17 11:52 AM:
---

I will give it a shot if it is ok.


was (Author: skonto):
I will give it a shot.

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2017-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098239#comment-16098239
 ] 

Stavros Kontopoulos commented on SPARK-15142:
-

I will give it a shot.

> Spark Mesos dispatcher becomes unusable when the Mesos master restarts
> --
>
> Key: SPARK-15142
> URL: https://issues.apache.org/jira/browse/SPARK-15142
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Priority: Minor
> Attachments: 
> spark-devaraj-org.apache.spark.deploy.mesos.MesosClusterDispatcher-1-stobdtserver5.out
>
>
> While Spark Mesos dispatcher running if the Mesos master gets restarted then 
> Spark Mesos dispatcher will keep running and queues up all the submitted 
> applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
>

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities of injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however I believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098227#comment-16098227
 ] 

Hyukjin Kwon commented on SPARK-21177:
--

Wait ... is the code itself a self-contained reproducer? without any hive step?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2017-07-24 Thread Mathieu Rossignol (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098199#comment-16098199
 ] 

Mathieu Rossignol commented on SPARK-10872:
---

Using Spark 2.1.0, still having this issue. 
I confirm the workaround to delete the metastore directory between each test 
method is ok for me (giving the Junit code using apache commons io FileUtils 
facitlity):

{code:java}
@Before
public void initTest()
{
File hiveLocalMetaStorePath = new File("metastore_db");
try {
FileUtils.deleteDirectory(hiveLocalMetaStorePath);
} catch (IOException e) {
e.printStackTrace();
}
}
{code}

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>

[jira] [Created] (SPARK-21521) History service requires user is in any group

2017-07-24 Thread Adrian Bridgett (JIRA)

Adrian Bridgett created SPARK-21521:
---

 Summary: History service requires user is in any group
 Key: SPARK-21521
 URL: https://issues.apache.org/jira/browse/SPARK-21521
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Adrian Bridgett


(Regression cf. 2.0.2)

We run spark as several users, these write to the history location where the 
files are saved as those users with permissions of 770 (this is hardcoded in 
EventLoggingListener.scala).

The history service runs as root so that it has permissions on these files (see 
https://spark.apache.org/docs/latest/security.html).

This worked fine in v2.0.2, however in v2.2.0 the events are being skipped 
unless I add the root user into each users group at which point they are seen.

We currently have all acls configuration unset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098190#comment-16098190
 ] 

Apache Spark commented on SPARK-21520:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18725

> Hivetable scan for all the columns the SQL statement contains the 'rand'
> 
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, when the rand function is present in the SQL statement, hivetable 
> searches all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
> ceil(c010) as cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
> output=[k#403L, k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
>   +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
> 1.0)) AS k#403L]
>  +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
>
> All columns will be searched in HiveTableScans , Consequently, All column 
> data is read to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from 
> hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 
> with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true], offset: 0, length: 
> 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21520:


Assignee: Apache Spark

> Hivetable scan for all the columns the SQL statement contains the 'rand'
> 
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>
> Currently, when the rand function is present in the SQL statement, hivetable 
> searches all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
> ceil(c010) as cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
> output=[k#403L, k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
>   +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
> 1.0)) AS k#403L]
>  +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
>
> All columns will be searched in HiveTableScans , Consequently, All column 
> data is read to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from 
> hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 
> with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true], offset: 0, length: 
> 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21520:


Assignee: (was: Apache Spark)

> Hivetable scan for all the columns the SQL statement contains the 'rand'
> 
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, when the rand function is present in the SQL statement, hivetable 
> searches all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
> ceil(c010) as cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
> output=[k#403L, k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
>   +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
> 1.0)) AS k#403L]
>  +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
>
> All columns will be searched in HiveTableScans , Consequently, All column 
> data is read to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from 
> hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 
> with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true], offset: 0, length: 
> 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.


  was:
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true,

[jira] [Created] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread caoxuewen (JIRA)

caoxuewen created SPARK-21520:
-

 Summary: Hivetable scan for all the columns the SQL statement 
contains the 'rand'
 Key: SPARK-21520
 URL: https://issues.apache.org/jira/browse/SPARK-21520
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.

solution:
Set the property of the rand expression, deterministic = true



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORD").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
>

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-24 Thread Manoj Mahalingam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098185#comment-16098185
 ] 

Manoj Mahalingam commented on SPARK-18016:
--

Is there a workaround we can do when we are hit with this? Also, is this being 
backported to 2.1? I see that it was added and then reverted.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098149#comment-16098149
 ] 

Prashant Sharma commented on SPARK-21177:
-

Well, I have tried it on three different environment. One on Ubuntu laptop, 
Macbook pro, and a centos.
 Git sha a848d552ef6b5d0d3bb3b2da903478437a8b10aa.

What else can I help you with?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098148#comment-16098148
 ] 

Hyukjin Kwon commented on SPARK-21177:
--

Meanwhile let me give another try.

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098143#comment-16098143
 ] 

Hyukjin Kwon edited comment on SPARK-21177 at 7/24/17 9:35 AM:
---

Yea, I did but I could not reproduce. Could you describe your environment?


was (Author: hyukjin.kwon):
Yea, I did but I could not. Could you describe your environment?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098143#comment-16098143
 ] 

Hyukjin Kwon commented on SPARK-21177:
--

Yea, I did but I could not. Could you describe your environment?

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-21177:

Description: 
In short, please use the following shell transcript for the reproducer. 

{code:java}

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
 |  |  |  |  |  |  | printTimeTaken: (str: String, 
f: () => Unit)Unit

scala> 
for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

Time taken for time to append to hive: is 3055

Time taken for time to append to hive: is 22425


{code}

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe 
operation.

  was:

In short, please use the following shell transcript for the reproducer. 

{code:java}

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
 |  |  |  |  |  |  | printTimeTaken: (str: String, 
f: () => Unit)Unit

scala> 
for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

Time taken for time to append to hive: is 3055

Time taken for time to append to hive: is 22425


{code}

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe 
operation.


> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-07-24 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098139#comment-16098139
 ] 

Prashant Sharma commented on SPARK-21177:
-

It is pretty easy to reproduce, you need to run it long enough. Increasing the 
number of iterations in the above reproducer, might help.

E.g.

{code:java}

for(i <- 1 to 10) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}

{code}


> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21517:
-
Description: 
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it will execute during buffer copy.

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory during buffer copy.


  was:
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it will execute during buffer copy.

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.



> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it will execute during buffer copy.
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory during buffer copy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{code}
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{code}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:


val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)


Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {code}
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df =

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{quote}val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{quote}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> {quote}val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df =

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:


val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)


Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

{quote}val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
{quote}

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> 
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df =

[jira] [Updated] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21519:

Description: 
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;
"""

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.


  was:
This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

```
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
```

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.



> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> """
> bin/spark-shell –jars ojdb6.jar
> val df =

[jira] [Commented] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098105#comment-16098105
 ] 

Apache Spark commented on SPARK-21519:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/18724

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> ```
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> bin/spark-shell –jars ojdb6.jar
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
> "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
> systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
> "MYUSER").option("password", 
> "MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
> preambleSQL).load()
> df.show(5,false)
> ```
> Comments: This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however it believe it can be 
> useful for other target DBs too, as it is quite generic.  
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities for injecting user-provided SQL (and 
> PL/SQL) that this opens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21519:


Assignee: (was: Apache Spark)

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> ```
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> bin/spark-shell –jars ojdb6.jar
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
> "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
> systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
> "MYUSER").option("password", 
> "MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
> preambleSQL).load()
> df.show(5,false)
> ```
> Comments: This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however it believe it can be 
> useful for other target DBs too, as it is quite generic.  
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities for injecting user-provided SQL (and 
> PL/SQL) that this opens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21519:


Assignee: Apache Spark

> Add an option to the JDBC data source to initialize the environment of the 
> remote database session
> --
>
> Key: SPARK-21519
> URL: https://issues.apache.org/jira/browse/SPARK-21519
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> This proposes an option to the JDBC datasource, tentatively called " 
> sessionInitStatement" to implement the functionality of session 
> initialization present for example in the Sqoop connector for Oracle (see 
> https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
>  ) . After each database session is opened to the remote DB, and before 
> starting to read data, this option executes a custom SQL statement (or a 
> PL/SQL block in the case of Oracle).
> Example of usage, relevant to Oracle JDBC:
> ```
> val preambleSQL="""
> begin 
>   execute immediate 'alter session set tracefile_identifier=sparkora'; 
>   execute immediate 'alter session set "_serial_direct_read"=true';
>   execute immediate 'alter session set time_zone=''+02:00''';
> end;
> bin/spark-shell –jars ojdb6.jar
> val df = spark.read.format("jdbc").option("url", 
> "jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
> "oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
> systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
> "MYUSER").option("password", 
> "MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
> preambleSQL).load()
> df.show(5,false)
> ```
> Comments: This proposal has been developed and tested for connecting the 
> Spark JDBC data source to Oracle databases, however it believe it can be 
> useful for other target DBs too, as it is quite generic.  
> Note the proposed code allows to inject SQL into the target database. This is 
> not a security concern as such, as it requires password authentication, 
> however beware of the possibilities for injecting user-provided SQL (and 
> PL/SQL) that this opens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21519) Add an option to the JDBC data source to initialize the environment of the remote database session

2017-07-24 Thread Luca Canali (JIRA)

Luca Canali created SPARK-21519:
---

 Summary: Add an option to the JDBC data source to initialize the 
environment of the remote database session
 Key: SPARK-21519
 URL: https://issues.apache.org/jira/browse/SPARK-21519
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: Luca Canali
Priority: Minor


This proposes an option to the JDBC datasource, tentatively called " 
sessionInitStatement" to implement the functionality of session initialization 
present for example in the Sqoop connector for Oracle (see 
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_oraoop_oracle_session_initialization_statements
 ) . After each database session is opened to the remote DB, and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block in the case of Oracle).
Example of usage, relevant to Oracle JDBC:

```
val preambleSQL="""
begin 
  execute immediate 'alter session set tracefile_identifier=sparkora'; 
  execute immediate 'alter session set "_serial_direct_read"=true';
  execute immediate 'alter session set time_zone=''+02:00''';
end;

bin/spark-shell –jars ojdb6.jar

val df = spark.read.format("jdbc").option("url", 
"jdbc:oracle:thin:@ORACLEDBSERVER:1521/service_name").option("driver", 
"oracle.jdbc.driver.OracleDriver").option("dbtable", "(select 1, sysdate, 
systimestamp, current_timestamp, localtimestamp from dual)").option("user", 
"MYUSER").option("password", 
"MYPASSWORK").option("fetchsize",1000).option("sessionInitStatement", 
preambleSQL).load()

df.show(5,false)
```

Comments: This proposal has been developed and tested for connecting the Spark 
JDBC data source to Oracle databases, however it believe it can be useful for 
other target DBs too, as it is quite generic.  
Note the proposed code allows to inject SQL into the target database. This is 
not a security concern as such, as it requires password authentication, however 
beware of the possibilities for injecting user-provided SQL (and PL/SQL) that 
this opens.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21518) Warnings if spark.mesos.task.labels is unset

2017-07-24 Thread Adrian Bridgett (JIRA)

Adrian Bridgett created SPARK-21518:
---

 Summary: Warnings if spark.mesos.task.labels is unset
 Key: SPARK-21518
 URL: https://issues.apache.org/jira/browse/SPARK-21518
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Adrian Bridgett
Priority: Minor


If spark.mesos.task.labels is unset there are lots of warning messages:
{noformat}
17/07/24 08:13:46 WARN mesos.MesosCoarseGrainedSchedulerBackend: Unable to 
parse  into a key:value label for the task.
{noformat}

This doesn't break anything, just a little annoying.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21517:
-
Description: 
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it will execute during buffer copy.

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.


  was:
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it will execute 

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.



> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it will execute during buffer copy.
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21517:
-
Description: 
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it will execute 

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.


  was:
In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it till execute 

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.



> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it will execute 
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21517:


Assignee: Apache Spark

> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it till execute 
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21517:


Assignee: (was: Apache Spark)

> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it till execute 
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098074#comment-16098074
 ] 

Apache Spark commented on SPARK-21517:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18723

> Fetch local data via block manager cause oom
> 
>
> Key: SPARK-21517
> URL: https://issues.apache.org/jira/browse/SPARK-21517
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> In our production cluster,oom happens when NettyBlockRpcServer receive 
> OpenBlocks message.The reason we observed is below:
> When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
> Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
> maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
> is bigger than 16, it till execute 
> {code:java}
> private void consolidateIfNeeded() {
> int numComponents = this.components.size();
> if(numComponents > this.maxNumComponents) {
> int capacity = 
> ((CompositeByteBuf.Component)this.components.get(numComponents - 
> 1)).endOffset;
> ByteBuf consolidated = this.allocBuffer(capacity);
> for(int c = 0; c < numComponents; ++c) {
> CompositeByteBuf.Component c1 = 
> (CompositeByteBuf.Component)this.components.get(c);
> ByteBuf b = c1.buf;
> consolidated.writeBytes(b);
> c1.freeIfNecessary();
> }
> CompositeByteBuf.Component var7 = new 
> CompositeByteBuf.Component(consolidated);
> var7.endOffset = var7.length;
> this.components.clear();
> this.components.add(var7);
> }
> }
> {code}
> in CompositeByteBuf which will consume some memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21517) Fetch local data via block manager cause oom

2017-07-24 Thread zhoukang (JIRA)

zhoukang created SPARK-21517:


 Summary: Fetch local data via block manager cause oom
 Key: SPARK-21517
 URL: https://issues.apache.org/jira/browse/SPARK-21517
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 2.1.0, 1.6.1
Reporter: zhoukang


In our production cluster,oom happens when NettyBlockRpcServer receive 
OpenBlocks message.The reason we observed is below:
When BlockManagerManagedBuffer call ChunkedByteBuffer#toNetty, it will use 
Unpooled.wrappedBuffer(ByteBuffer... buffers) which use default 
maxNumComponents=16 in low-level CompositeByteBuf.When our component's number 
is bigger than 16, it till execute 

{code:java}
private void consolidateIfNeeded() {
int numComponents = this.components.size();
if(numComponents > this.maxNumComponents) {
int capacity = 
((CompositeByteBuf.Component)this.components.get(numComponents - 1)).endOffset;
ByteBuf consolidated = this.allocBuffer(capacity);

for(int c = 0; c < numComponents; ++c) {
CompositeByteBuf.Component c1 = 
(CompositeByteBuf.Component)this.components.get(c);
ByteBuf b = c1.buf;
consolidated.writeBytes(b);
c1.freeIfNecessary();
}

CompositeByteBuf.Component var7 = new 
CompositeByteBuf.Component(consolidated);
var7.endOffset = var7.length;
this.components.clear();
this.components.add(var7);
}

}
{code}
in CompositeByteBuf which will consume some memory.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21491) Performance enhancement: eliminate creation of intermediate collections

2017-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098060#comment-16098060
 ] 

Sean Owen commented on SPARK-21491:
---

Looks reasonable. I'd say just add a very brief note in the code to explain 
that it efficiently computes toMap

> Performance enhancement: eliminate creation of intermediate collections
> ---
>
> Key: SPARK-21491
> URL: https://issues.apache.org/jira/browse/SPARK-21491
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Priority: Trivial
>
> Simple performance optimization in a few places of GraphX:
> {{Traversable.toMap}} can be replaced with {{collection.breakout}}.
> This would eliminate creation of an intermediate collection of tuples, see
> [Stack Overflow 
> article|https://stackoverflow.com/questions/1715681/scala-2-8-breakout]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21508) Documentation on 'Spark Streaming Custom Receivers' has error in example code

2017-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098057#comment-16098057
 ] 

Sean Owen commented on SPARK-21508:
---

You don't need it assigned, not until you resolve it.

> Documentation on 'Spark Streaming Custom Receivers' has error in example code
> -
>
> Key: SPARK-21508
> URL: https://issues.apache.org/jira/browse/SPARK-21508
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Remis Haroon
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The example code provided in the documentation on 'Spark Streaming Custom 
> Receivers' has an error.
> https://spark.apache.org/docs/latest/streaming-custom-receivers.html
> {code}
> // Assuming ssc is the StreamingContext
> val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
> val words = lines.flatMap(_.split(" "))
> ...
> {code}
> instead of {code}lines.flatMap(_.split(" ")){code} it should be 
> {code}customReceiverStream.flatMap(_.split(" ")){code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21515) Spark ML Random Forest

2017-07-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21515.
---
Resolution: Invalid

This is a question for StackOverflow or the mailing list.
https://spark.apache.org/contributing.html

> Spark ML Random Forest
> --
>
> Key: SPARK-21515
> URL: https://issues.apache.org/jira/browse/SPARK-21515
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: KovvuriSriRamaReddy
>
> We are reading data from flat file and storing in DataSet.
> We have one for loop, where we need to modify dataset and use it in next 
> iteration. [ For first iteration we use original DataSet ] 
> We all know that variable of type Dataset is immutable. But the scenario 
> is, inside the for loop we perform some processing(Random Forest Spark ML)on 
> this variable(Dataset) and use the updated result in the next 
> iteration. This process continues until all the iterations are completed. [ 
> Size of the dataset is same, only values are changing ]
> Approach 1: we are storing the Intermediate result in new DataSet variable 
> and using it in next iteration.What we have observed is, it took only 1sec to 
> execute for loop 1st iteration and remaining iterations took more time 
> exponentially. [ i.e 2nd iteration taking 70sec, 3rd iteration taking 90sec 
> and so on...]
> Approach 2: Wrote intermediate DataSet into HDFS/ external file and read 
> freshly for each iteration from HDFS/File,then each iteration gets completed 
> more faster then previous approach.However, writing and reading data to/from 
> HDFS/external file is taking more time.
> This is the problem we have which we need to fine tune.Could anyone please 
> provide a better solution for this issue?
> Note: We are unpersisting & assigning NULL value to previous DataSets at the 
> end of the loop.
> Thanks in advance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20754) Add Function Alias For MOD/TRUNCT/POSITION

2017-07-24 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-20754.
-
Resolution: Fixed

> Add Function Alias For MOD/TRUNCT/POSITION
> --
>
> Key: SPARK-20754
> URL: https://issues.apache.org/jira/browse/SPARK-20754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> We already have the impl of the following functions. We can add the function 
> alias to be consistent with ANSI. 
> {noformat} 
> MOD(,)
> {noformat} 
> Returns the remainder of m divided by n. Returns m if n is 0.
> {noformat} 
> TRUNC
> {noformat} 
> Returns the number x, truncated to D decimals. If D is 0, the result will 
> have no decimal point or fractional part. If D is negative, the number is 
> zeroed out.
> {noformat} 
> POSITION
> {noformat} 
> Returns the position of the char IN ) in the source string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21498) quick start -> one py demo have some bug in code

2017-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16098027#comment-16098027
 ] 

Apache Spark commented on SPARK-21498:
--

User 'lizhaoch' has created a pull request for this issue:
https://github.com/apache/spark/pull/18722

> quick start  -> one  py demo have some bug in code 
> ---
>
> Key: SPARK-21498
> URL: https://issues.apache.org/jira/browse/SPARK-21498
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: LiZhaochuan
>Priority: Trivial
>  Labels: beginner
> Attachments: screenshot-1.png
>
>
> http://spark.apache.org/docs/latest/quick-start.html
> Self-Contained Applications：
> spark = SparkSession.builder().appName(appName).master(master).getOrCreate()
> should change to:
> spark = SparkSession.builder.appName("SimpleApp").getOrCreate()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21498) quick start -> one py demo have some bug in code

2017-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21498:


Assignee: Apache Spark

> quick start  -> one  py demo have some bug in code 
> ---
>
> Key: SPARK-21498
> URL: https://issues.apache.org/jira/browse/SPARK-21498
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: LiZhaochuan
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: beginner
> Attachments: screenshot-1.png
>
>
> http://spark.apache.org/docs/latest/quick-start.html
> Self-Contained Applications：
> spark = SparkSession.builder().appName(appName).master(master).getOrCreate()
> should change to:
> spark = SparkSession.builder.appName("SimpleApp").getOrCreate()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 107 matches

Mail list logo