date:20160303

[jira] [Updated] (SPARK-12326) Move GBT implementation from spark.mllib to spark.ml

2016-03-03 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-12326:
---
Assignee: Seth Hendrickson

> Move GBT implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-12326
> URL: https://issues.apache.org/jira/browse/SPARK-12326
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Several improvements can be made to gradient boosted trees, but are not 
> possible without moving the GBT implementation to spark.ml (e.g. 
> rawPrediction column, feature importance). This Jira is for moving the 
> current GBT implementation to spark.ml, which will have roughly the following 
> steps:
> 1. Copy the implementation to spark.ml and change spark.ml classes to use 
> that implementation. Current tests will ensure that the implementations learn 
> exactly the same models. 
> 2. Move the decision tree helper classes over to spark.ml (e.g. Impurity, 
> InformationGainStats, ImpurityStats, DTStatsAggregator, etc...). Since 
> eventually all tree implementations will reside in spark.ml, the helper 
> classes should as well.
> 3. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation. The spark.ml tests will again 
> ensure that we do not change any behavior.
> 4. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10610) Using AppName instead of AppId in the name of all metrics

2016-03-03 Thread Pete Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179507#comment-15179507
 ] 

Pete Robbins commented on SPARK-10610:
--

I think the appId is an important piece of information when visualizing the 
metrics along with hostname, executorId etc. I'm writing a sink and reporter to 
push the metrics to Elasticsearch and I include these in the metrics types for 
better correlation. eg

{
"timestamp": "2016-03-03T15:58:31.903+",
"hostName": "9.20.187.127"
"applicationId": "app-20160303155742-0005",
"executorId": "driver",
"BlockManager_memory_maxMem_MB": 3933
  }

The appId and executorId I extract form the metric name. When the sink is 
instantiated I don't believe I have access to any Utils to obtain the current 
appId and executorId so I'm kind of relying on these being in the metric name 
for the moment.

Is it possible to make appId, applicationName, executorId avaiable to me via 
some Utils function that I have access to in a metrics Sink?

> Using AppName instead of AppId in the name of all metrics
> -
>
> Key: SPARK-10610
> URL: https://issues.apache.org/jira/browse/SPARK-10610
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Yi Tian
>Priority: Minor
>
> When we using {{JMX}} to monitor spark system,  We have to configure the name 
> of target metrics in the monitor system. But the current name of metrics is 
> {{appId}} + {{executorId}} + {{source}} . So when the spark program 
> restarted, we have to update the name of metrics in the monitor system.
> We should add an optional configuration property to control whether using the 
> appName instead of appId in spark metrics system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179491#comment-15179491
 ] 

Mark Grover commented on SPARK-13670:
-

cc [~vanzin] as fyi.
I ran into this when I was running {{dev/mima}}. It should have failed because 
I didn't have the tools jar and launcher.main() did throw an exception but the 
subshell ate it all up and the mima testing continued, eventually giving me a 
ton of errors since the exclusions list wasn't correctly populated.

Anyways, there are a few ways I can think of fixing it, open to others as well:
1. We could just fix the symptom and have the dev/mima script, explicitly check 
for tools directory and tools jar in bash and break, if it's not present. 
However, a) that's just fixing the symptom, not the root cause, b) we already 
have those checks in launcher code.
2. We could have the command write the output to a file instead of reading it 
directly from a subshell. Then, that makes error handling easier.
3. We can have the subshell kill the parent process on error, essentially 
running the subshell like so:
{code}
("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" || kill 
$$)
{code}
The side effect is that any other subshells spawned by the parent, i.e. 
spark-class will be terminated as well. Based on a quick look, I didn't see any 
other subshells though.

Thoughts? Preferences?

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2016-03-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179485#comment-15179485
 ] 

Saisai Shao commented on SPARK-13669:
-

Hi [~imranr], I think you fixed this issue (SPARK-9439) and have much knowledge 
on scheduler and blacklist things, would you please comment on this issue, 
thanks a lot.
















> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the actual problem is Spark’s scheduler is not aware of the 
> unavailability of external shuffle service, and will still assign the tasks 
> on to that nodes. The fix is to avoid assigning tasks on to that nodes.
> Currently in the Spark, one related configuration is 
> “spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
> worked in this scenario. This configuration is used to avoid same reattempt 
> task to run on the same executor. Also ways like MapReduce’s blacklist 
> mechanism may not handle this scenario, since all the reduce tasks will be 
> failed, so counting the failure tasks will equally mark all the executors as 
> “bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-13670:

Description: 
There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}. This is because the launcher.Main is run within a subshell.

  was:
There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}.
This is because the launcher.Main is run within a subshell.


> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-03 Thread Mark Grover (JIRA)

Mark Grover created SPARK-13670:
---

 Summary: spark-class doesn't bubble up error from launcher command
 Key: SPARK-13670
 URL: https://issues.apache.org/jira/browse/SPARK-13670
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Mark Grover
Priority: Minor


There's a particular snippet in spark-class 
[here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
runs the spark-launcher code in a subshell.
{code}
# The launcher library will print arguments separated by a NULL character, to 
allow arguments with
# characters that would be otherwise interpreted by the shell. Read that in a 
while loop, populating
# an array that will be used to exec the final command.
CMD=()
while IFS= read -d '' -r ARG; do
  CMD+=("$ARG")
done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@")
{code}

The problem is that the if the launcher Main fails, this code still still 
returns success and continues, even though the top level script is marked {{set 
-e}}.
This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2016-03-03 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-13669:
---

 Summary: Job will always fail in the external shuffle service 
unavailable situation
 Key: SPARK-13669
 URL: https://issues.apache.org/jira/browse/SPARK-13669
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Reporter: Saisai Shao


Currently we are running into an issue with Yarn work preserving enabled + 
external shuffle service. 

In the work preserving enabled scenario, the failure of NM will not lead to the 
exit of executors, so executors can still accept and run the tasks. The problem 
here is when NM is failed, external shuffle service is actually inaccessible, 
so reduce tasks will always complain about the “Fetch failure”, and the failure 
of reduce stage will make the parent stage (map stage) rerun. The tricky thing 
here is Spark scheduler is not aware of the unavailability of external shuffle 
service, and will reschedule the map tasks on the executor where NM is failed, 
and again reduce stage will be failed with “Fetch failure”, and after 4 
retries, the job is failed.

So here the actual problem is Spark’s scheduler is not aware of the 
unavailability of external shuffle service, and will still assign the tasks on 
to that nodes. The fix is to avoid assigning tasks on to that nodes.

Currently in the Spark, one related configuration is 
“spark.scheduler.executorTaskBlacklistTime”, but I don’t think it will be 
worked in this scenario. This configuration is used to avoid same reattempt 
task to run on the same executor. Also ways like MapReduce’s blacklist 
mechanism may not handle this scenario, since all the reduce tasks will be 
failed, so counting the failure tasks will equally mark all the executors as 
“bad” one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13652.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0
   1.6.2

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
>Assignee: Shixiong Zhu
> Fix For: 1.6.2, 2.0.0
>
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

2016-03-03 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179447#comment-15179447
 ] 

Nick Pentreath commented on SPARK-13639:


For SPARK-13568, we can take one of two approaches:

1. Support imputing numerical DF columns, as well as imputing columns within a 
vector (itself a vector DF column);
2. Only support imputing numerical DF columns.

For #1, we then need {{Statistics.colStats}} to support ignoring NaN as an 
option (agree it should definitely not be default behaviour). Potentially we 
could only support it at a lower level (perhaps within 
{{MultivariateOnlineSummarizer}}).

For scikit-learn's Imputer, obviously it works on NaN vector elements, but 
since we are working with DataFrames, my initial idea was actually more along 
the lines of #2. The {{Imputer}} would tend to be among the early steps in a 
pipeline, before the relevant numerical columns were transformed into a vector.

So #1 is not an absolute requirement IMO, though obviously it would be more 
efficient to compute all the col stats for a set of columns together, and I do 
think it makes sense to support vector input types in {{Imputer}} if possible.

Open to ideas on SPARK-13568 also.

> Statistics.colStats(rdd).mean and variance should handle NaN in the input 
> vectors
> -
>
> Key: SPARK-13639
> URL: https://issues.apache.org/jira/browse/SPARK-13639
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Trivial
>
>val denseData = Array(
>   Vectors.dense(3.8, 0.0, 1.8),
>   Vectors.dense(1.7, 0.9, 0.0),
>   Vectors.dense(Double.NaN, 0, 0.0)
> )
> val rdd = sc.parallelize(denseData)
> println(Statistics.colStats(rdd).mean)
> [NaN,0.3,0.6]
> This is just a proposal for discussion on how to handle the NaN value in the 
> vectors. We can ignore the NaN value in the computation or just output NaN as 
> it is now as a warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13642) Properly handle signal kill of ApplicationMaster

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13642:


Assignee: (was: Apache Spark)

> Properly handle signal kill of ApplicationMaster
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
>

[jira] [Commented] (SPARK-13642) Properly handle signal kill of ApplicationMaster

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179400#comment-15179400
 ] 

Apache Spark commented on SPARK-13642:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11512

> Properly handle signal kill of ApplicationMaster
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
>

[jira] [Assigned] (SPARK-13642) Properly handle signal kill of ApplicationMaster

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13642:


Assignee: Apache Spark

> Properly handle signal kill of ApplicationMaster
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
>

[jira] [Updated] (SPARK-13642) Properly handle signal kill of ApplicationMaster

2016-03-03 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13642:

Summary: Properly handle signal kill of ApplicationMaster  (was: Different 
finishing state between Spark and YARN)

> Properly handle signal kill of ApplicationMaster
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03

[jira] [Commented] (SPARK-13650) Usage of the window() function on DStream

2016-03-03 Thread Mario Briggs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179347#comment-15179347
 ] 

Mario Briggs commented on SPARK-13650:
--

Running locally on my MAC, here is the output

{quote}
[Stage 0:>  (0 + 0) / 
2]16/03/04 10:38:18 INFO VerifiableProperties: Verifying properties
16/03/04 10:38:18 INFO VerifiableProperties: Property auto.offset.reset is 
overridden to smallest
16/03/04 10:38:18 INFO VerifiableProperties: Property group.id is overridden to 
16/03/04 10:38:18 INFO VerifiableProperties: Property zookeeper.connect is 
overridden to 

---
Time: 1457068098000 ms
---
10


[Stage 6:===>   (2 + 0) / 3]

[Stage 6:===>   (2 + 0) / 3]

{quote}

The '10' is the output from the first batch interval at time '1457068098000 ms' 
. Thereafter, the only output for the next ~2 minutes is the 'stage 6'. A ^C 
fails to stop the app and need to do a 'kill -9 pid'

> Usage of the window() function on DStream
> -
>
> Key: SPARK-13650
> URL: https://issues.apache.org/jira/browse/SPARK-13650
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Mario Briggs
>Priority: Minor
>
> Is there some guidance of the usage of the Window() function on DStream. Here 
> is my academic use-case for which it fails.
> Standard word count
>  val ssc = new StreamingContext(sparkConf, Seconds(6))
>  val messages = KafkaUtils.createDirectStream(...)
>  val words = messages.map(_._2).flatMap(_.split(" "))
>  val window = words.window(Seconds(12), Seconds(6)) 
>  window.count().print()
> For the first batch interval it gives the count and then it hangs (inside the 
> unionRDD)
> I say the above use-case is academic since one can achieve similar 
> fuctionality by using instead the more compact API
>words.countByWindow(Seconds(12), Seconds(6))
> which works fine. 
> Is the first approach above not the intended way of using the .window() API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-03 Thread Zhong Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179339#comment-15179339
 ] 

Zhong Wang commented on SPARK-13337:


The current join method with usingColumns argument generates result like 
TableC. The limitation is that it doesn't support null-safe join.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13668:


Assignee: Apache Spark

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13668:


Assignee: (was: Apache Spark)

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sameer Agarwal
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179318#comment-15179318
 ] 

Apache Spark commented on SPARK-13668:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11511

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sameer Agarwal
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-03 Thread Sameer Agarwal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-13668:
---
Description: 
If a filter predicate or a join condition consists of `IsNotNull` checks, we 
should reorder these checks such that these non-nullability checks are 
evaluated before the rest of the predicates.

For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
should rewrite this as `isNotNull(b) && a > 5` during physical plan generation.

cc [~nongli] [~yhuai]

  was:
If a filter predicate or a join condition consists of `IsNotNull` checks, we 
should reorder these checks such that these non-nullability checks are 
evaluated before the rest of the predicates.

For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
should rewrite this as `isNotNull(b) && a` during physical plan generation.

cc [~nongli] [~yhuai]


> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sameer Agarwal
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-03 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-13668:
--

 Summary: Reorder filter/join predicates to short-circuit isNotNull 
checks
 Key: SPARK-13668
 URL: https://issues.apache.org/jira/browse/SPARK-13668
 Project: Spark
  Issue Type: Improvement
Reporter: Sameer Agarwal


If a filter predicate or a join condition consists of `IsNotNull` checks, we 
should reorder these checks such that these non-nullability checks are 
evaluated before the rest of the predicates.

For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
should rewrite this as `isNotNull(b) && a` during physical plan generation.

cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13667) Support for specifying custom date format for date and timestamp types

2016-03-03 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-13667:


 Summary: Support for specifying custom date format for date and 
timestamp types
 Key: SPARK-13667
 URL: https://issues.apache.org/jira/browse/SPARK-13667
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, CSV data source does not support to parse date and timestamp types 
in custom format and infer the type of timestamp type  in custom format.

It looks quite many of users want this feature. It would be great to set custom 
date format.

This was reported in spark-csv.
https://github.com/databricks/spark-csv/issues/279
https://github.com/databricks/spark-csv/issues/262
https://github.com/databricks/spark-csv/issues/266

Currently I submitted a PR for this in spark-csv
https://github.com/databricks/spark-csv/pull/280



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13647) also check if numeric value is within allowed range in _verify_type

2016-03-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-13647.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11492
[https://github.com/apache/spark/pull/11492]

> also check if numeric value is within allowed range in _verify_type
> ---
>
> Key: SPARK-13647
> URL: https://issues.apache.org/jira/browse/SPARK-13647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13626) SparkConf deprecation log messages are printed multiple times

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13626:


Assignee: Apache Spark

> SparkConf deprecation log messages are printed multiple times
> -
>
> Key: SPARK-13626
> URL: https://issues.apache.org/jira/browse/SPARK-13626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed that if I have a deprecated config in my spark-defaults.conf, I'll 
> see multiple warnings when running, for example, spark-shell. I collected the 
> backtrace from when the messages are printed, and here's a few instances. The 
> first one is the only one I expect to be printed.
> {noformat}
> java.lang.Exception:
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.(Main.scala:30)
> {noformat}
> The following ones are causing duplicate log messages and we should clean 
> those up:
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.setAll(SparkConf.scala:139)
> at org.apache.spark.SparkConf.clone(SparkConf.scala:358)
> at org.apache.spark.SparkContext.(SparkContext.scala:368)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98)
> {noformat}
> There are also a few more caused by the use of {{SparkConf.clone()}}.
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13626) SparkConf deprecation log messages are printed multiple times

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13626:


Assignee: (was: Apache Spark)

> SparkConf deprecation log messages are printed multiple times
> -
>
> Key: SPARK-13626
> URL: https://issues.apache.org/jira/browse/SPARK-13626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I noticed that if I have a deprecated config in my spark-defaults.conf, I'll 
> see multiple warnings when running, for example, spark-shell. I collected the 
> backtrace from when the messages are printed, and here's a few instances. The 
> first one is the only one I expect to be printed.
> {noformat}
> java.lang.Exception:
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.(Main.scala:30)
> {noformat}
> The following ones are causing duplicate log messages and we should clean 
> those up:
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.setAll(SparkConf.scala:139)
> at org.apache.spark.SparkConf.clone(SparkConf.scala:358)
> at org.apache.spark.SparkContext.(SparkContext.scala:368)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98)
> {noformat}
> There are also a few more caused by the use of {{SparkConf.clone()}}.
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13626) SparkConf deprecation log messages are printed multiple times

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179276#comment-15179276
 ] 

Apache Spark commented on SPARK-13626:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11510

> SparkConf deprecation log messages are printed multiple times
> -
>
> Key: SPARK-13626
> URL: https://issues.apache.org/jira/browse/SPARK-13626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I noticed that if I have a deprecated config in my spark-defaults.conf, I'll 
> see multiple warnings when running, for example, spark-shell. I collected the 
> backtrace from when the messages are printed, and here's a few instances. The 
> first one is the only one I expect to be printed.
> {noformat}
> java.lang.Exception:
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.(Main.scala:30)
> {noformat}
> The following ones are causing duplicate log messages and we should clean 
> those up:
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:82)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.setAll(SparkConf.scala:139)
> at org.apache.spark.SparkConf.clone(SparkConf.scala:358)
> at org.apache.spark.SparkContext.(SparkContext.scala:368)
> at org.apache.spark.repl.Main$.createSparkContext(Main.scala:98)
> {noformat}
> There are also a few more caused by the use of {{SparkConf.clone()}}.
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:59)
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:48)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
> {noformat}
> {noformat}
> java.lang.Exception:
> at 
> org.apache.spark.SparkConf$$anonfun$logDeprecationWarning$2.apply(SparkConf.scala:682)
> ...
> at org.apache.spark.SparkConf.(SparkConf.scala:53)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:93)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179254#comment-15179254
 ] 

Jeff Zhang commented on SPARK-13587:


SPARK-13081 is almost done. But for this ticket, I definitely need your 
feedback and help, I am not heavy python user, so may miss something that 
python programmer concerns. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253
 ] 

Juliet Hougland commented on SPARK-13587:
-

That is wonderful. Let me know if you'd like me to help work on it. It has been 
dangling on my mental todo list for a while.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179251#comment-15179251
 ] 

Jeff Zhang commented on SPARK-13587:


Actually I have created this ticket several weeks ago SPARK-13081  :)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250
 ] 

Juliet Hougland commented on SPARK-13587:
-

I made a comment related to this below. TLDR I think the suggested --py-env 
option could be encompassed an already needed --pyspark_python sort of option.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247
 ] 

Juliet Hougland commented on SPARK-13587:
-

Currently the way users specify the workers' python interpreter is through the 
PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path 
to be specified by a cli flag. That is a current rough edge of using already 
installed envs on a cluster.

If this was added as a cli flag, I could see valid options being 
'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) 
and requiring a second flag to specify the requirements file. I think it helps 
prevent an explosion of flag for spark submit while helping handle a very 
important and often changed parameter for a job. What do you think of this? 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179241#comment-15179241
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] I can remove c_feature from Vector Assembler code, but is the naming 
convention intentional?  

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13639) Statistics.colStats(rdd).mean and variance should handle NaN in the input vectors

2016-03-03 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179227#comment-15179227
 ] 

yuhao yang commented on SPARK-13639:


We perhaps can find some other way for SPARK-13568. I just need to make sure 
the performance is acceptable.

> Statistics.colStats(rdd).mean and variance should handle NaN in the input 
> vectors
> -
>
> Key: SPARK-13639
> URL: https://issues.apache.org/jira/browse/SPARK-13639
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Priority: Trivial
>
>val denseData = Array(
>   Vectors.dense(3.8, 0.0, 1.8),
>   Vectors.dense(1.7, 0.9, 0.0),
>   Vectors.dense(Double.NaN, 0, 0.0)
> )
> val rdd = sc.parallelize(denseData)
> println(Statistics.colStats(rdd).mean)
> [NaN,0.3,0.6]
> This is just a proposal for discussion on how to handle the NaN value in the 
> vectors. We can ignore the NaN value in the computation or just output NaN as 
> it is now as a warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13665) Initial separation of concerns

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13665:


Assignee: Apache Spark  (was: Michael Armbrust)

> Initial separation of concerns
> --
>
> Key: SPARK-13665
> URL: https://issues.apache.org/jira/browse/SPARK-13665
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> The goal here is to break apart: File Management, code to deal with specific 
> formats and query planning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13665) Initial separation of concerns

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179224#comment-15179224
 ] 

Apache Spark commented on SPARK-13665:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/11509

> Initial separation of concerns
> --
>
> Key: SPARK-13665
> URL: https://issues.apache.org/jira/browse/SPARK-13665
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The goal here is to break apart: File Management, code to deal with specific 
> formats and query planning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13665) Initial separation of concerns

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13665:


Assignee: Michael Armbrust  (was: Apache Spark)

> Initial separation of concerns
> --
>
> Key: SPARK-13665
> URL: https://issues.apache.org/jira/browse/SPARK-13665
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> The goal here is to break apart: File Management, code to deal with specific 
> formats and query planning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread huangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157
 ] 

huangyu edited comment on SPARK-13652 at 3/4/16 3:09 AM:
-

I think the doc is about fetching stream rather than sendRpc


was (Author: huang_yuu):
I think the docs are about fetching stream rather than sendRpc

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13666) Annoying warnings from SQLConf in log output

2016-03-03 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-13666:
--

 Summary: Annoying warnings from SQLConf in log output
 Key: SPARK-13666
 URL: https://issues.apache.org/jira/browse/SPARK-13666
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


Whenever I run spark-shell I get a bunch of warnings about SQL configuration:

{noformat}
16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark 
SQL config in SQLConf: key = spark.yarn.driver.memoryOverhead, value = 26
16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark 
SQL config in SQLConf: key = spark.yarn.executor.memoryOverhead, value = 26
16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark 
SQL config in SQLConf: key = spark.executor.cores, value = 1
16/03/03 19:00:25 WARN hive.HiveSessionState$$anon$1: Attempt to set non-Spark 
SQL config in SQLConf: key = spark.executor.memory, value = 268435456
{noformat}

That should't happen, since I'm not setting those values explicitly. They're 
either set internally by Spark or come from spark-defaults.conf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13531:


Assignee: (was: Apache Spark)

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179201#comment-15179201
 ] 

Apache Spark commented on SPARK-13531:
--

User 'zuowang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11508

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13531) Some DataFrame joins stopped working with UnsupportedOperationException: No size estimation available for objects

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13531:


Assignee: Apache Spark

> Some DataFrame joins stopped working with UnsupportedOperationException: No 
> size estimation available for objects
> -
>
> Key: SPARK-13531
> URL: https://issues.apache.org/jira/browse/SPARK-13531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: koert kuipers
>Assignee: Apache Spark
>Priority: Minor
>
> this is using spark 2.0.0-SNAPSHOT
> dataframe df1:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , obj#135: object, [if (input[0, object].isNullAt) 
> null else input[0, object].get AS x#128]
> +- MapPartitions , createexternalrow(if (isnull(x#9)) null else 
> x#9), [input[0, object] AS obj#135]
>+- WholeStageCodegen
>   :  +- Project [_1#8 AS x#9]
>   : +- Scan ExistingRDD[_1#8]{noformat}
> show:
> {noformat}+---+
> |  x|
> +---+
> |  2|
> |  3|
> +---+{noformat}
> dataframe df2:
> schema:
> {noformat}StructType(StructField(x,IntegerType,true), 
> StructField(y,StringType,true)){noformat}
> explain:
> {noformat}== Physical Plan ==
> MapPartitions , createexternalrow(x#2, if (isnull(y#3)) null else 
> y#3.toString), [if (input[0, object].isNullAt) null else input[0, object].get 
> AS x#130,if (input[0, object].isNullAt) null else staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
> object].get, true) AS y#131]
> +- WholeStageCodegen
>:  +- Project [_1#0 AS x#2,_2#1 AS y#3]
>: +- Scan ExistingRDD[_1#0,_2#1]{noformat}
> show:
> {noformat}+---+---+
> |  x|  y|
> +---+---+
> |  1|  1|
> |  2|  2|
> |  3|  3|
> +---+---+{noformat}
> i run:
> df1.join(df2, Seq("x")).show
> i get:
> {noformat}java.lang.UnsupportedOperationException: No size estimation 
> available for objects.
> at org.apache.spark.sql.types.ObjectType.defaultSize(ObjectType.scala:41)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$6.apply(LogicalPlan.scala:323)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.plans.logical.UnaryNode.statistics(LogicalPlan.scala:323)
> at 
> org.apache.spark.sql.execution.SparkStrategies$CanBroadcast$.unapply(SparkStrategies.scala:87){noformat}
> now sure what changed, this ran about a week ago without issues (in our 
> internal unit tests). it is fully reproducible, however when i tried to 
> minimize the issue i could not reproduce it by just creating data frames in 
> the repl with the same contents, so it probably has something to do with way 
> these are created (from Row objects and StructTypes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread huangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157
 ] 

huangyu edited comment on SPARK-13652 at 3/4/16 2:53 AM:
-

I think the docs are about fetching stream rather than sendRpc


was (Author: huang_yuu):
I think it is about fetching stream rather than sendRpc

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13665) Initial separation of concerns

2016-03-03 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13665:


 Summary: Initial separation of concerns
 Key: SPARK-13665
 URL: https://issues.apache.org/jira/browse/SPARK-13665
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


The goal here is to break apart: File Management, code to deal with specific 
formats and query planning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13664) Simplify and Speedup HadoopFSRelation

2016-03-03 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-13664:


 Summary: Simplify and Speedup HadoopFSRelation
 Key: SPARK-13664
 URL: https://issues.apache.org/jira/browse/SPARK-13664
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, however 
there are currently several complexity and performance problems with this code 
path:
 - The class mixes the concerns of file management, schema reconciliation, scan 
building, bucketing, partitioning, and writing data.
 - For very large tables, we are broadcasting the entire list of files to every 
executor. [SPARK-11441]
 - For partitioned tables, we always do an extra projection.  This results not 
only in a copy, but undoes much of the performance gains that we are going to 
get from vectorized reads.

This is an umbrella ticket to track a set of improvements to this codepath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-03 Thread Ted Yu (JIRA)

Ted Yu created SPARK-13663:
--

 Summary: Upgrade Snappy Java to 1.1.2.1
 Key: SPARK-13663
 URL: https://issues.apache.org/jira/browse/SPARK-13663
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ted Yu
Priority: Minor


The JVM memory leaky problem reported in 
https://github.com/xerial/snappy-java/issues/131 has been resolved.

1.1.2.1 was released on Jan 22nd.

We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread huangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179186#comment-15179186
 ] 

huangyu edited comment on SPARK-13652 at 3/4/16 2:39 AM:
-

I used your patch and it worked, great!!


was (Author: huang_yuu):
I use your patch and it worked, great!!

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread huangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179186#comment-15179186
 ] 

huangyu commented on SPARK-13652:
-

I use your patch and it worked, great!!

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-03 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179180#comment-15179180
 ] 

Gayathri Murali commented on SPARK-13629:
-

[~josephkb] In the case of discrete probabilistic models in which this is most 
applicable, would the min_df and max_df thresholds change? or only the word 
count should be set to 0 or 1 depending on the value of binary? 

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12221) Add CPU time metric to TaskMetrics

2016-03-03 Thread Jisoo Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179173#comment-15179173
 ] 

Jisoo Kim commented on SPARK-12221:
---

I made this backward-compatible, so it should be able to read history written 
with earlier versions.

> Add CPU time metric to TaskMetrics
> --
>
> Key: SPARK-12221
> URL: https://issues.apache.org/jira/browse/SPARK-12221
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.2
>Reporter: Jisoo Kim
>
> Currently TaskMetrics doesn't support executor CPU time. I'd like to have one 
> so I can retrieve the metric from History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread huangyu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179157#comment-15179157
 ] 

huangyu commented on SPARK-13652:
-

I think it is about fetching stream rather than sendRpc

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13642) Different finishing state between Spark and YARN

2016-03-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179102#comment-15179102
 ] 

Saisai Shao edited comment on SPARK-13642 at 3/4/16 1:45 AM:
-

Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not 
called explicitly by the user's code when app is finished, this is triggered in 
shutdown hook when receiving a signal.

If the AM is terminated by signal, except deliberately killing, the application 
is actually in the middle of runtime, so does it make sense to mark it as 
successful? In this specific case, we expect another attempt but yarn fails to 
do it. Personally I think we should handle this situation properly.




was (Author: jerryshao):
Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not 
called explicitly by the user's code when app is finished, this is triggered in 
shutdown hook when receiving a signal.

Here in ApplicationMaster, if user's class is finished, either by calling 
{{System.exit()}} or to the end of main function, we will mark this application 
finished with success, you could see 
[here|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L553].

If the AM is terminated by signal, except deliberately killing, the application 
is actually in the middle of runtime, so does it make sense to mark it as 
successful? In this specific case, we expect another attempt but yarn fails to 
do it.



> Different finishing state between Spark and YARN
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
>

[jira] [Resolved] (SPARK-13415) Visualize subquery in SQL web UI

2016-03-03 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13415.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11417
[https://github.com/apache/spark/pull/11417]

> Visualize subquery in SQL web UI
> 
>
> Key: SPARK-13415
> URL: https://issues.apache.org/jira/browse/SPARK-13415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Right now, uncorrelated scalar subqueries are not showed in SQL tab.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13642) Different finishing state between Spark and YARN

2016-03-03 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13642:

Summary: Different finishing state between Spark and YARN  (was: 
Inconsistent finishing state between driver and AM )

> Different finishing state between Spark and YARN
> 
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage,null}
>

[jira] [Commented] (SPARK-13642) Inconsistent finishing state between driver and AM

2016-03-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179102#comment-15179102
 ] 

Saisai Shao commented on SPARK-13642:
-

Hi [~tgraves], thanks a lot for your comments. I think here {{stop()}} is not 
called explicitly by the user's code when app is finished, this is triggered in 
shutdown hook when receiving a signal.

Here in ApplicationMaster, if user's class is finished, either by calling 
{{System.exit()}} or to the end of main function, we will mark this application 
finished with success, you could see 
[here|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L553].

If the AM is terminated by signal, except deliberately killing, the application 
is actually in the middle of runtime, so does it make sense to mark it as 
successful? In this specific case, we expect another attempt but yarn fails to 
do it.



> Inconsistent finishing state between driver and AM 
> ---
>
> Key: SPARK-13642
> URL: https://issues.apache.org/jira/browse/SPARK-13642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default 
> application final state is "SUCCEED", if any exception is occurred, this 
> final state will be changed to "FAILED" and trigger the reattempt if 
> possible. 
> This is OK in normal case, but if there's a race condition when AM received a 
> signal (SIGTERM) and no any exception is occurred. In this situation, 
> shutdown hook will be invoked and marked this application as finished with 
> success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed 
> and need another attempt, but from Yarn's aspect the application is finished 
> with success.
> This could happened in NM failure situation, the failure of NM will send 
> SIGTERM to AM, AM should mark this attempt as failure and rerun again, not 
> invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(3L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from 
> Yarn's point this application is finished successfully, but from Spark's 
> point, this application should be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 
> 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57462 with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor 
> NettyRpcEndpointRef(null) (192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.0.105:57468 with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready 
> for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped 
> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO

[jira] [Commented] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179078#comment-15179078
 ] 

Apache Spark commented on SPARK-13601:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11507

> Invoke task failure callbacks before calling outputstream.close()
> -
>
> Key: SPARK-13601
> URL: https://issues.apache.org/jira/browse/SPARK-13601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> We need to submit another PR against Spark to call the task failure callbacks 
> before Spark calls the close function on various output streams.
> For example, we need to intercept an exception and call 
> TaskContext.markTaskFailed before calling close in the following code (in 
> PairRDDFunctions.scala):
> {code}
>   Utils.tryWithSafeFinally {
> while (iter.hasNext) {
>   val record = iter.next()
>   writer.write(record._1.asInstanceOf[AnyRef], 
> record._2.asInstanceOf[AnyRef])
>   // Update bytes written metric every few records
>   maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, 
> recordsWritten)
>   recordsWritten += 1
> }
>   } {
> writer.close()
>   }
> {code}
> Changes to Spark should include unit tests to make sure this always work in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning

2016-03-03 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12877:
--
Assignee: Jeremy

> TrainValidationSplit is missing in pyspark.ml.tuning
> 
>
> Key: SPARK-12877
> URL: https://issues.apache.org/jira/browse/SPARK-12877
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>Assignee: Jeremy
> Fix For: 2.0.0
>
>
> I was investingating progress in SPARK-10759 and I noticed that there is no 
> TrainValidationSplit class in pyspark.ml.tuning module.
> Java/Scala's examples SPARK-10759 use 
> org.apache.spark.ml.tuning.TrainValidationSplit that is not available from 
> Python and this blocks SPARK-10759.
> Does the class have different name in PySpark, maybe? Also, I couldn't find 
> any JIRA task to saying it need to be implemented. Is it by design that the 
> TrainValidationSplit estimator is not ported to PySpark? If not, that is if 
> the estimator needs porting then I would like to contribute.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11861) Expose feature importances API for decision trees

2016-03-03 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11861:
--
Shepherd: Joseph K. Bradley
Assignee: Seth Hendrickson
Target Version/s: 2.0.0

> Expose feature importances API for decision trees
> -
>
> Key: SPARK-11861
> URL: https://issues.apache.org/jira/browse/SPARK-11861
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Feature importances should be added to decision trees leveraging the feature 
> importance implementation for Random Forests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13633:


Assignee: Andrew Or  (was: Apache Spark)

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13633:


Assignee: Apache Spark  (was: Andrew Or)

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179033#comment-15179033
 ] 

Apache Spark commented on SPARK-13633:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11506

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12721) SQL generation support for script transformation

2016-03-03 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12721:

Comment: was deleted

(was: A PR is submitted https://github.com/apache/spark/pull/11503)

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-03-03 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179016#comment-15179016
 ] 

Seth Hendrickson edited comment on SPARK-13068 at 3/4/16 12:32 AM:
---

I think these are good and valid points. I will give it some more thought. 

My concern is that the {{expectedType}} approach does not play nice with 
lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy 
array to list, but if the numpy array dtype is float and Scala expects ints in 
the array, there will still be a Py4J classcast exception. Likewise, if someone 
passes [1,2,3] to an Array[Double] param in Scala, then we will get an 
exception. To me, it is a bit unsatisfying to provide a solution that works for 
one very common case, but still fails in other common cases.

I'm fine with not adding subclasses of Param for each type, but I think the 
Param validator functions would provide a comprehensive solution to some of the 
issues we're seeing. There is another 
[Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark 
params working with lists/numpy arrays/vectors so I think addressing this issue 
in a robust way is important. I would love to hear feedback and thoughts on 
this or alternative approaches. Thanks!


was (Author: sethah):
I think these are good and valid points. I will give it some more thought. 

My concern is that the {{expectedType}} approach does not play nice with 
lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy 
array to list, but if the numpy array dtype is float and Scala expects ints in 
the array, there will still be a Py4J classcast exception. Likewise, if someone 
passes [1,2,3] to an Array[Double] param in Scala, then we will get an 
exception. To me, it is a bit unsatisfying to provide a solution that works for 
one very common case, but still fails in other common cases.

I'm fine with not adding subclasses of Param for each type, but I think the 
Param validator functions would provide a comprehensive solution to some of the 
issues we're seeing. There is another 
[Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark 
params working with lists/numpy arrays/vectors so I think addressing this issue 
in a robust way is important. I would love to hear feedback and others' 
thoughts and alternative approaches. Thanks!

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13631:


Assignee: (was: Apache Spark)

> getPreferredLocations race condition in spark 1.6.0?
> 
>
> Key: SPARK-13631
> URL: https://issues.apache.org/jira/browse/SPARK-13631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>   at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>   at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12721) SQL generation support for script transformation

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12721:


Assignee: (was: Apache Spark)

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12721) SQL generation support for script transformation

2016-03-03 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179017#comment-15179017
 ] 

Xiao Li commented on SPARK-12721:
-

A PR is submitted https://github.com/apache/spark/pull/11503

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179019#comment-15179019
 ] 

Apache Spark commented on SPARK-13631:
--

User 'a1k0n' has created a pull request for this issue:
https://github.com/apache/spark/pull/11505

> getPreferredLocations race condition in spark 1.6.0?
> 
>
> Key: SPARK-13631
> URL: https://issues.apache.org/jira/browse/SPARK-13631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>   at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>   at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12721) SQL generation support for script transformation

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12721:


Assignee: Apache Spark

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12721) SQL generation support for script transformation

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179018#comment-15179018
 ] 

Apache Spark commented on SPARK-12721:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11503

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13631:


Assignee: Apache Spark

> getPreferredLocations race condition in spark 1.6.0?
> 
>
> Key: SPARK-13631
> URL: https://issues.apache.org/jira/browse/SPARK-13631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Andy Sloane
>Assignee: Apache Spark
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>   at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>   at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-03-03 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179016#comment-15179016
 ] 

Seth Hendrickson commented on SPARK-13068:
--

I think these are good and valid points. I will give it some more thought. 

My concern is that the {{expectedType}} approach does not play nice with 
lists/numpy arrays/vectors. If {{expectedType=list}}, then we can cast numpy 
array to list, but if the numpy array dtype is float and Scala expects ints in 
the array, there will still be a Py4J classcast exception. Likewise, if someone 
passes [1,2,3] to an Array[Double] param in Scala, then we will get an 
exception. To me, it is a bit unsatisfying to provide a solution that works for 
one very common case, but still fails in other common cases.

I'm fine with not adding subclasses of Param for each type, but I think the 
Param validator functions would provide a comprehensive solution to some of the 
issues we're seeing. There is another 
[Jira|https://issues.apache.org/jira/browse/SPARK-10009] open about pyspark 
params working with lists/numpy arrays/vectors so I think addressing this issue 
in a robust way is important. I would love to hear feedback and others' 
thoughts and alternative approaches. Thanks!

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179012#comment-15179012
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for the feedback [~j_houg].  Yes, that's why I make the virtualenv as 
application scope. Creating python package management tool for a cluster will 
be a big project and too heavy.   And what does "--pyspark_python" mean ? 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13601) Invoke task failure callbacks before calling outputstream.close()

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179011#comment-15179011
 ] 

Apache Spark commented on SPARK-13601:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11504

> Invoke task failure callbacks before calling outputstream.close()
> -
>
> Key: SPARK-13601
> URL: https://issues.apache.org/jira/browse/SPARK-13601
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> We need to submit another PR against Spark to call the task failure callbacks 
> before Spark calls the close function on various output streams.
> For example, we need to intercept an exception and call 
> TaskContext.markTaskFailed before calling close in the following code (in 
> PairRDDFunctions.scala):
> {code}
>   Utils.tryWithSafeFinally {
> while (iter.hasNext) {
>   val record = iter.next()
>   writer.write(record._1.asInstanceOf[AnyRef], 
> record._2.asInstanceOf[AnyRef])
>   // Update bytes written metric every few records
>   maybeUpdateOutputMetrics(outputMetricsAndBytesWrittenCallback, 
> recordsWritten)
>   recordsWritten += 1
> }
>   } {
> writer.close()
>   }
> {code}
> Changes to Spark should include unit tests to make sure this always work in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore

2016-03-03 Thread Evan Chan (JIRA)

Evan Chan created SPARK-13662:
-

 Summary: [SQL][Hive] Have SHOW TABLES return additional fields 
from Hive MetaStore 
 Key: SPARK-13662
 URL: https://issues.apache.org/jira/browse/SPARK-13662
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0, 1.5.2
 Environment: All
Reporter: Evan Chan


Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or 
equivalently the HiveContext.tables method, returns a DataFrame with only two 
columns: the name of the table and whether it is temporary.  It would be really 
nice to add support to return some extra information, such as:

- Whether this table is Spark-only or a native Hive table
- If spark-only, the name of the data source
- potentially other properties

The first two is really useful for BI environments connecting to multiple data 
sources and that work with both Hive and Spark.

Some thoughts:
- The SQL/HiveContext Catalog API might need to be expanded to return something 
like a TableEntry, rather than just a tuple of (name, temporary).
- I believe there is a Hive Catalog/client API to get information about each 
table.  I suppose one concern would be the speed of using this API.  Perhaps 
there are other APis that can get this info faster.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13631) getPreferredLocations race condition in spark 1.6.0?

2016-03-03 Thread Andy Sloane (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178980#comment-15178980
 ] 

Andy Sloane commented on SPARK-13631:
-

Hm. I see, it's just a coincidence that a thread tends to get scheduled from 
the thread pool while the shuffle task is running, and tries to schedule a new 
job based on the locations of the (currently in-process) shuffle tasks. There's 
no blocking wait on an RDD happening that's causing this job to kick off -- 
it's just trying to schedule it.

It seems like we'd want to defer finding preferred locations on an RDD which is 
currently being computed, somehow, but the job planning seems to happen 
completely up front and there aren't any indicators in an individual RDD that 
it's presently being computed. Which means we are probably unnecessarily 
recomputing RDDs in this multithreaded scheme.



> getPreferredLocations race condition in spark 1.6.0?
> 
>
> Key: SPARK-13631
> URL: https://issues.apache.org/jira/browse/SPARK-13631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Andy Sloane
>
> We are seeing something that looks a lot like a regression from spark 1.2. 
> When we run jobs with multiple threads, we have a crash somewhere inside 
> getPreferredLocations, as was fixed in SPARK-4454. Except now it's inside 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs 
> instead of DAGScheduler directly.
> I tried Spark 1.2 post-SPARK-4454 (before this patch it's only slightly 
> flaky), 1.4.1, and 1.5.2 and all are fine. 1.6.0 immediately crashes on our 
> threaded test case, though once in a while it passes.
> The stack trace is huge, but starts like this:
> Caused by: java.lang.NullPointerException: null
>   at 
> org.apache.spark.MapOutputTrackerMaster.getLocationsWithLargestOutputs(MapOutputTracker.scala:406)
>   at 
> org.apache.spark.MapOutputTrackerMaster.getPreferredLocationsForShuffle(MapOutputTracker.scala:366)
>   at 
> org.apache.spark.rdd.ShuffledRDD.getPreferredLocations(ShuffledRDD.scala:92)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:257)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:256)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1545)
> The full trace is available here:
> https://gist.github.com/andy256/97611f19924bbf65cf49



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178883#comment-15178883
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for letting me know this. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13659) Remove returnValues from BlockStore APIs

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13659:


Assignee: Apache Spark  (was: Josh Rosen)

> Remove returnValues from BlockStore APIs
> 
>
> Key: SPARK-13659
> URL: https://issues.apache.org/jira/browse/SPARK-13659
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In preparation for larger refactorings, I think that we should remove the 
> confusing returnValues() option from the BlockStore put() APIs: returning the 
> value is only useful in one place (caching) and in other situations, such as 
> block replication, it's simpler to put() and then get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13659) Remove returnValues from BlockStore APIs

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13659:


Assignee: Josh Rosen  (was: Apache Spark)

> Remove returnValues from BlockStore APIs
> 
>
> Key: SPARK-13659
> URL: https://issues.apache.org/jira/browse/SPARK-13659
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In preparation for larger refactorings, I think that we should remove the 
> confusing returnValues() option from the BlockStore put() APIs: returning the 
> value is only useful in one place (caching) and in other situations, such as 
> block replication, it's simpler to put() and then get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13661) Avoid the copy of UnsafeRow in HashedRelation

2016-03-03 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13661:
--

 Summary: Avoid the copy of UnsafeRow in HashedRelation
 Key: SPARK-13661
 URL: https://issues.apache.org/jira/browse/SPARK-13661
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


We usually build the HashedRelation on top of array of UnsafeRow, the copy 
could be avoided.

The caller of HashedRelation need to do the copy if it's needed.

Another approach could be making the copy() of UnsafeRow smart so that it know 
when should copy the bytes or not, this could be also useful for other 
components. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13659) Remove returnValues from BlockStore APIs

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178869#comment-15178869
 ] 

Apache Spark commented on SPARK-13659:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11502

> Remove returnValues from BlockStore APIs
> 
>
> Key: SPARK-13659
> URL: https://issues.apache.org/jira/browse/SPARK-13659
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In preparation for larger refactorings, I think that we should remove the 
> confusing returnValues() option from the BlockStore put() APIs: returning the 
> value is only useful in one place (caching) and in other situations, such as 
> block replication, it's simpler to put() and then get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13660:
-
Labels: starter  (was: )

> CommitFailureTestRelationSuite floods the logs with garbage
> ---
>
> Key: SPARK-13660
> URL: https://issues.apache.org/jira/browse/SPARK-13660
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>  Labels: starter
>
> https://github.com/apache/spark/pull/11439 added a utility method 
> "testQuietly". We can use it for CommitFailureTestRelationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage

2016-03-03 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-13660:


 Summary: CommitFailureTestRelationSuite floods the logs with 
garbage
 Key: SPARK-13660
 URL: https://issues.apache.org/jira/browse/SPARK-13660
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Shixiong Zhu


https://github.com/apache/spark/pull/11439 added a utility method 
"testQuietly". We can use it for CommitFailureTestRelationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12877) TrainValidationSplit is missing in pyspark.ml.tuning

2016-03-03 Thread Jeremy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178867#comment-15178867
 ] 

Jeremy commented on SPARK-12877:


Hi Xiangrui, 

Commenting!

> TrainValidationSplit is missing in pyspark.ml.tuning
> 
>
> Key: SPARK-12877
> URL: https://issues.apache.org/jira/browse/SPARK-12877
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
> Fix For: 2.0.0
>
>
> I was investingating progress in SPARK-10759 and I noticed that there is no 
> TrainValidationSplit class in pyspark.ml.tuning module.
> Java/Scala's examples SPARK-10759 use 
> org.apache.spark.ml.tuning.TrainValidationSplit that is not available from 
> Python and this blocks SPARK-10759.
> Does the class have different name in PySpark, maybe? Also, I couldn't find 
> any JIRA task to saying it need to be implemented. Is it by design that the 
> TrainValidationSplit estimator is not ported to PySpark? If not, that is if 
> the estimator needs porting then I would like to contribute.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13584.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0

> ContinuousQueryManagerSuite floods the logs with garbage
> 
>
> Key: SPARK-13584
> URL: https://issues.apache.org/jira/browse/SPARK-13584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> We should clean up the following outputs
> {code}
> [info] ContinuousQueryManagerSuite:
> 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 0.0 (TID 1)
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at

[jira] [Updated] (SPARK-13635) Enable LimitPushdown optimizer rule because we have whole-stage codegen for Limit

2016-03-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-13635:
---
Assignee: Liang-Chi Hsieh

> Enable LimitPushdown optimizer rule because we have whole-stage codegen for 
> Limit
> -
>
> Key: SPARK-13635
> URL: https://issues.apache.org/jira/browse/SPARK-13635
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> LimitPushdown optimizer rule has been disabled due to no whole-stage codegen 
> for Limit. As we have whole-stage codegen for Limit now, we should enable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13657) Can't parse very long AND/OR expression

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13657:


Assignee: Davies Liu  (was: Apache Spark)

> Can't parse very long AND/OR expression
> ---
>
> Key: SPARK-13657
> URL: https://issues.apache.org/jira/browse/SPARK-13657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> CatalystQl can't parse an expression with hundreds of AND/OR [1], it will 
> fail as StackOverflow. 
> [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13657) Can't parse very long AND/OR expression

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13657:


Assignee: Apache Spark  (was: Davies Liu)

> Can't parse very long AND/OR expression
> ---
>
> Key: SPARK-13657
> URL: https://issues.apache.org/jira/browse/SPARK-13657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> CatalystQl can't parse an expression with hundreds of AND/OR [1], it will 
> fail as StackOverflow. 
> [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13657) Can't parse very long AND/OR expression

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178862#comment-15178862
 ] 

Apache Spark commented on SPARK-13657:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11501

> Can't parse very long AND/OR expression
> ---
>
> Key: SPARK-13657
> URL: https://issues.apache.org/jira/browse/SPARK-13657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> CatalystQl can't parse an expression with hundreds of AND/OR [1], it will 
> fail as StackOverflow. 
> [1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13659) Remove returnValues from BlockStore APIs

2016-03-03 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13659:
--

 Summary: Remove returnValues from BlockStore APIs
 Key: SPARK-13659
 URL: https://issues.apache.org/jira/browse/SPARK-13659
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen


In preparation for larger refactorings, I think that we should remove the 
confusing returnValues() option from the BlockStore put() APIs: returning the 
value is only useful in one place (caching) and in other situations, such as 
block replication, it's simpler to put() and then get().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13658) BooleanSimplification rule is slow with large boolean expressions

2016-03-03 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13658:
--

 Summary: BooleanSimplification rule is slow with large boolean 
expressions
 Key: SPARK-13658
 URL: https://issues.apache.org/jira/browse/SPARK-13658
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu


When run TPCDS Q3 [1] with lots predicates to filter out the partitions, the 
optimizer rule BooleanSimplification take about 2 seconds (it use lots of 
sematicsEqual, which require copy the whole tree).

It will great if we could speedup it.

[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql

cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13632) Create new o.a.s.sql.execution.commands package

2016-03-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13632.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Create new o.a.s.sql.execution.commands package
> ---
>
> Key: SPARK-13632
> URL: https://issues.apache.org/jira/browse/SPARK-13632
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13577:


Assignee: Apache Spark

> Allow YARN to handle multiple jars, archive when uploading Spark dependencies
> -
>
> Key: SPARK-13577
> URL: https://issues.apache.org/jira/browse/SPARK-13577
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See parent bug for more details.
> Before we remove assemblies from Spark, we need the YARN backend to 
> understand how to find and upload multiple jars containing the Spark code. as 
> a feature request made during spec review, we should also allow the Spark 
> code to be provided as an archive that would be uploaded as a single file to 
> the cluster, but exploded when downloaded to the containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies

2016-03-03 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13577:


Assignee: (was: Apache Spark)

> Allow YARN to handle multiple jars, archive when uploading Spark dependencies
> -
>
> Key: SPARK-13577
> URL: https://issues.apache.org/jira/browse/SPARK-13577
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>
> See parent bug for more details.
> Before we remove assemblies from Spark, we need the YARN backend to 
> understand how to find and upload multiple jars containing the Spark code. as 
> a feature request made during spec review, we should also allow the Spark 
> code to be provided as an archive that would be uploaded as a single file to 
> the cluster, but exploded when downloaded to the containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies

2016-03-03 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178831#comment-15178831
 ] 

Apache Spark commented on SPARK-13577:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11500

> Allow YARN to handle multiple jars, archive when uploading Spark dependencies
> -
>
> Key: SPARK-13577
> URL: https://issues.apache.org/jira/browse/SPARK-13577
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>
> See parent bug for more details.
> Before we remove assemblies from Spark, we need the YARN backend to 
> understand how to find and upload multiple jars containing the Spark code. as 
> a feature request made during spec review, we should also allow the Spark 
> code to be provided as an archive that would be uploaded as a single file to 
> the cluster, but exploded when downloaded to the containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12347) Write script to run all MLlib examples for testing

2016-03-03 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178803#comment-15178803
 ] 

Joseph K. Bradley commented on SPARK-12347:
---

1. As in the JIRA description, there would need to be some list of special 
examples requiring input.

2. If an example requires an input file, it should be one of the provided input 
datasets which comes with Spark, so I think it's OK to hard-code the names.

> Write script to run all MLlib examples for testing
> --
>
> Key: SPARK-12347
> URL: https://issues.apache.org/jira/browse/SPARK-12347
> Project: Spark
>  Issue Type: Test
>  Components: ML, MLlib, PySpark, SparkR, Tests
>Reporter: Joseph K. Bradley
>
> It would facilitate testing to have a script which runs all MLlib examples 
> for all languages.
> Design sketch to ensure all examples are run:
> * Generate a list of examples to run programmatically (not from a fixed list).
> * Use a list of special examples to handle examples which require command 
> line arguments.
> * Make sure data, etc. used are small to keep the tests quick.
> This could be broken into subtasks for each language, though it would be nice 
> to provide a single script.
> Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

2016-03-03 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178794#comment-15178794
 ] 

Joseph K. Bradley commented on SPARK-13068:
---

I like the idea of automatic casting when possible.  A few comments:
* I don't think we should add special subclasses of Param for each type.  We 
had to do that in Scala to make it Java-friendly.  Instead, the casting could 
be handled by a single method which takes the value and the expected type.
* I don't want to go overboard on conversions.  E.g., if a {{str}} is required, 
I don't think we should try casting to a string.  (If a user calls 
{{setInputCol([1,2,3])}}, then they are probably making a mistake.)
* I don't think we should add ParamValidators in Python at this time.  We could 
consider doing it later, but it's really a separate issue.  I'm also worried 
about having validation in 2 places since they could get out of synch.

How does that sound?

> Extend pyspark ml paramtype conversion to support lists
> ---
>
> Key: SPARK-13068
> URL: https://issues.apache.org/jira/browse/SPARK-13068
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should 
> follow up and support param type conversion for lists and nested structures 
> as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13599) Groovy-all ends up in spark-assembly if hive profile set

2016-03-03 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13599.
-
   Resolution: Fixed
 Assignee: Steve Loughran
Fix Version/s: 2.0.0

> Groovy-all ends up in spark-assembly if hive profile set
> 
>
> Key: SPARK-13599
> URL: https://issues.apache.org/jira/browse/SPARK-13599
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you do a build with {{-Phive,hive-thriftserver}} then the contents of 
> {{org.codehaus.groovy:groovy-all}} gets into the spark-assembly.jar
> This bad because
> * it makes the JAR bigger
> * it makes the build longer
> * it's an uber-JAR itself, so can include things (maybe even conflicting 
> things)
> * It's something else that needs to be kept up to date security-wise



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13657) Can't parse very long AND/OR expression

2016-03-03 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13657:
--

 Summary: Can't parse very long AND/OR expression
 Key: SPARK-13657
 URL: https://issues.apache.org/jira/browse/SPARK-13657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


CatalystQl can't parse an expression with hundreds of AND/OR [1], it will fail 
as StackOverflow. 



[1] https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q3.sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13652:
-
Summary: TransportClient.sendRpcSync returns wrong results  (was: spark 
netty network issue)

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13652) spark netty network issue

2016-03-03 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178723#comment-15178723
 ] 

Shixiong Zhu commented on SPARK-13652:
--

Updated "Affects Version/s". This bug only exists in 1.6.0 and master

> spark netty network issue
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13652) spark netty network issue

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13652:
-
Affects Version/s: (was: 1.5.2)
   (was: 1.5.1)
   (was: 1.4.1)
   (was: 1.3.1)
   (was: 1.2.2)

> spark netty network issue
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13652) spark netty network issue

2016-03-03 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13652:
-
Affects Version/s: 1.2.2
   1.3.1
   1.4.1

> spark netty network issue
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.5.2, 1.6.0
>Reporter: huangyu
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 226 matches

Mail list logo